Skip to:
Content

BuddyPress.org

Opened 9 years ago

Closed 9 years ago

Last modified 9 years ago

#6517 closed defect (bug) (fixed)

bp_create_excerpt returning mall-formed markup - mb_strlen issues?

Reported by: hnla's profile hnla Owned by: boonebgorges's profile boonebgorges
Milestone: 2.3.3 Priority: normal
Severity: normal Version:
Component: Core Keywords: has-patch needs-testing
Cc:

Description

Until very recently this BP function worked happily to take a piece of post-post_content wrapped in wp_ksses in turn wrapped in wpautop along with a truncation length and BP array of args to remove ellipses. This was to gain more control os a specific content piece with additional allowed_tags.

This arrangement has suddenly resulted in the $length being ignored and full content returned as well as stray closing tags appearing at the end of the content (resultant when markup elements present i.e anchors)

I re-factored my approach removing wp_ksses as we have recently upgraded BP allowed tags so it wasn't necessary and removing wpautop to use later on the returned excerpt (works fine).

Issue still remains that links are mall-formed and looking at the function where we attempt to gather / strip the tags to remove from the character/string length count we are failing to correctly return them, possibly failing on out final foreach closing tags loop.

This feels like it may be an issue with mb_strlen as I've seen it mentioned in another ticket and Boone alluded to the possibility WP had made some changes here?

Unit test are suggested as a means of demonstrating the issue but sadly with zero experience of writing tests that will have to wait.

My current tests are in a Archive post loop running a new wp_query:
bp_create_excerpt( $this_cat_news->post->post_content, 300, array('ending' => '', 'html' => true ) )

With the post content having a simple set of basic anchor links in the post body.

Attachments (5)

6517.diff (2.6 KB) - added by boonebgorges 9 years ago.
excerpt issue.jpg (37.5 KB) - added by danbp 9 years ago.
excerpt 1.jpg (57.2 KB) - added by danbp 9 years ago.
excerpt 2.jpg (78.9 KB) - added by danbp 9 years ago.
6517.2.diff (4.6 KB) - added by boonebgorges 9 years ago.

Download all attachments as: .zip

Change History (39)

#1 @hnla
9 years ago

Additionally we acknowledge Cake as source for function but phpdocs link appears outdated:

http://book.cakephp.org/view/1469/Text#truncate-1625

Best match I could find was:

http://book.cakephp.org/3.0/en/core-libraries/text.html#truncating-text

#2 @boonebgorges
9 years ago

hnla - Can you please share some sample content that demonstrates the problem?

#3 @hnla
9 years ago

I'll try and write a small function to demonstrate the issue in isolation.

#4 @hnla
9 years ago

This should demonstrate the issue, I've been trying out various combinations of markup, links are commented as showing linked characters that shouldn't be e.g '/a' but they are simply a consequence of this malformed angle tag re-introduction after stripping from the character count, I have though seen this linking extend to a whole paragraph or end of returned excerpt so is an issue when ideal conditions encountered.

function bp_excerpt_truncation_issue_trac6517() {

$test_content = '<p>This test content demonstrates an issue 
with truncating a long string  run through the bp_create_excerpt function  and which has pcdata</p>
<p>The issue manifests when we have tags to process (note the html arg in the function that removes the characters
representing parsed data tags from the truncated string length return) <i>at this point  
it would be noted that we now have stray closing angle brackets at the end of the text </i></p>
If we now add link markup into the content body we will see mal-formed tags causing  incorrect wrapping of the text.
We now link to one of the <a href="http://buddypress.org/"> really awsome sites on the net </a> and observe that the text link
is now looking a little odd <a href="http://google.com/">when I add another link into the mix</a> observe how we now 
include in the link the partial closing anchor tag but without the less than angle bracket. 
<p>The issue ? clearly appears to be one of incorrectly returning the tag characters 
after stripping out our loop counting is off? a link right at the end 
<a href="http://https://buddypress.trac.wordpress.org/ticket/6517#comment:3">to the trac ticket</a> 
breaks including partial linked tag characters</p>';

$excerpt_test =  bp_create_excerpt( $test_content , 350, array('ending' => '', 'html' => true ) );

echo $excerpt_test;
}

#5 @hnla
9 years ago

If in fact I change the html arg to false i.e do calculate html tags in excerpt length then we avoid the issue by avoiding the need to strip_tags via the mb_strlen function and return a correct excerpt.

On a side note it's evident this script is from the dark ages of the internet as highlighted by this line:
!preg_match( '/img|br|input|hr|area|base|basefont|col|frame|isindex|link|meta|param/s

HTML 3.0 :s

So dumping $openTags results in:
array(7) { [0]=> string(1) "p" [1]=> string(2) "a " [2]=> string(2) "a>" [3]=> string(2) "a>" [4]=> string(2) "a " [5]=> string(2) "p>" [6]=> string(2) "p>" }

we then foreach those and do:
'</' . $tag . '>'

Clearly that's wrong and why we're doubling up the closing angle bracket.

[2]=> string(2) "a>"

'</' . 'a>' . '>' effectively.

Can't see where in that yet we're messing up the $openTag and it's hard to fathom preg_match stuff.

Last edited 9 years ago by hnla (previous) (diff)

#6 @hnla
9 years ago

@boone the issue is with revision 9523

In bp_create_excerpt(), recognize HTML comments when exclude HTML from counts.

This prevents problems with, among other things, the <!--more--> tag in
activity streams.

Fixes #3680

Reverting the changes to the preg_match_all line corrects the additional closing brackets.

#7 @boonebgorges
9 years ago

  • Milestone changed from Awaiting Review to 2.4

Thanks for tracking this down a bit, hnla. There seem to be two problems. The first is that the regex change in [9523] made the tag check too liberal, with the result that some larger-than-desired chunks were being detected. The second is that the exact=false logic - which ensures that strings are only truncated on word boundaries - was not distinguishing between a space that is a word boundary and a space that appears inside of an HTML tag. That is, when truncating Foo <a href="http://example.com">Bar</a> Baz with exact=false, we should never break on the space between a and href.

Please test 6517.diff to see if it's solving your problems. Thanks!

@boonebgorges
9 years ago

#8 @hnla
9 years ago

Yep the word boundary aspect I suspected had a part to play in things but couldn't get to testing further, I have issues with other activity excerpt media that I must track down, then I'll test this patch out and report back on it.

On a sidenote I was wondering what we actually did with the more tag or html comments, again haven't actually tested to see if we strip them out or, as other markup, return them which in the case of comments would be pointless?

#9 @hnla
9 years ago

@Boone:
Tested patch, tested character counts with html true/false and that seems fine, changing count to include markup tags or exclude them from $length.

However it looks like we've lost $exact as an array option it should be default false yet seems stuck on true.

#10 @boonebgorges
9 years ago

Thanks for testing, hnla. Unit tests for exact are still passing - can you show me what you're doing to test this?

#11 @danbp
9 years ago

A similar issue was recently posted on the forum.

Tried to replicate on a fresh instal and issue is confirmed. See screenshot.

I applied 6517.diff which solved the issue.

@danbp
9 years ago

#12 @boonebgorges
9 years ago

  • Milestone changed from 2.4 to 2.3.3

Thanks, danbp. As this was introduced in 2.3.0, I'm going to fix this on the 2.3 branch.

#13 @hnla
9 years ago

@Boone

can you show me what you're doing to test this?

Having determined count and html option work as expected and having a truncation returned that ends mid word adding in the option 'exact' => false ($text will not be cut mid-word) I expect to see that truncation expand to include the word up to the next white space point?

Wondering if that is a valid expectation though now, however what ever combo of $html true/false and $exact true/false I can not demonstrate words shown in full when I flip flop between say 20 characters or 21/22.

#14 @boonebgorges
9 years ago

bp_create_excerpt() rounds down. That is:

bp_create_excerpt( 'foo bar baz', 5, array( 'exact' => true ) );
// 'foo ba'

bp_create_excerpt( 'foo bar baz', 5, array( 'exact' => false ) );
// 'foo'

The one exception is when the first word is so long that it'd result in a zero string. See #6254.

Does that square with what you're seeing?

#15 @hnla
9 years ago

Yes your test case works (ok so rounds down, would have to really )

But now add an anchor into the mix as currently that test case doesn't cover all the perms
$totalLength == 18

bp_create_excerpt( 'foo <a>bar</a> baz', 5, array( 'exact' => true, 'html' => false ) );

//  'html' is false so include tag count
// 'exact' is true so return literal string length

//  returns 'foo <'

// Result returns parsed markup as character data.
// Result not as expected?
// We need to strip tags from an literal count of characters if including markup in count
// lest we land on a portion of that markup inadvertently?
bp_create_excerpt( 'foo <a>bar</a> baz', 5, array( 'exact' => true, 'html' => true ) );

//  returns 'foo b'

// Result as expected; 5 characters wanted 'exact' is true, markup is removed from count 
// so we have first character of second word rendered (wrapped in it's anchor though!)

bp_create_excerpt( 'foo <a>bar</a> baz', 6, array( 'exact' => false, 'html' => false ) );

//  returns 'foo'

//  'html' is false so include tag count
// 'exact' is false so don't cut mid-word

// Result as expected; 6 characters counted up to '<a' result rounded down to first word boundary end.


#16 @hnla
9 years ago

The greatest trick the devil ever pulled was convincing the world that reg expressions were cool

#17 @danbp
9 years ago

As always, the devil IS in the detail ! (learned after some hours face to face with regex)

#18 @boonebgorges
9 years ago

The greatest trick the devil ever pulled was convincing the world that reg expressions were cool

Save your scorn for a time when regular expressions are actually part of the problem :)

bp_create_excerpt( 'foo <a>bar</a> baz', 5, array( 'exact' => true, 'html' => false ) );

//  'html' is false so include tag count
// 'exact' is true so return literal string length

//  returns 'foo <'

// Result returns parsed markup as character data.
// Result not as expected?
// We need to strip tags from an literal count of characters if including markup in count
// lest we land on a portion of that markup inadvertently?

No, I think this is incorrect. html=false implies that the input should be treated as a simple string - we should not be doing any parsing of it. If a developer chooses to combine exact=true with html=false, that's their prerogative. I can imagine situations where someone might want to use this function to get an exact-length excerpt (for reasons other than display within a web page).

#19 @boonebgorges
9 years ago

  • Owner set to boonebgorges
  • Resolution set to fixed
  • Status changed from new to closed

In 9962:

Improve the HTML-handling logic of bp_create_excerpt().

  • The regular expression that detects tags should be less generous, to avoid false matches.
  • Word boundary detection, as used when exact=false, should never return a space character inside of an HTML tag (such as the space in <a href="http://example.com">).

These changes fix some tag-parsing issues introduced in [9523].

Fixes #6517.

#20 @hnla
9 years ago

If a developer chooses to combine exact=true with html=false that's their prerogative

Had a feeling you might say that, and sort of have to agree, but that's normally my harsh line with people, when I get tired of making something idiot proof :)

However not really sure that 'implies' is correct why on earth do we have this option to either count markup or not the string is the only important count the markup if exists should always be treated as parsed data not to be displayed and therefore not part of the count and if that were true the function then becomes a whole heap simpler?

I can imagine situations where someone might want to use this function to get an exact-length excerpt (for reasons other than display within a web page

surely there are simpler php functions to count a string if you needed to but hey ho, a long as we're happy it works we'll expect devs/users not to pass those two option together to count all and display exact - works for me!

Last edited 9 years ago by hnla (previous) (diff)

#21 @boonebgorges
9 years ago

The purpose of the function is to create excerpts. If someone wants to create an excerpt of the form foo <a [&hellip;], there's no reason we should prevent it from happening. While it's true that someone could simply use substr() to create an "exact" excerpt, it may be that they're using bp_create_excerpt() as part of a larger system, and want to have consistent interfaces.

#22 @boonebgorges
9 years ago

In 9963:

Improve the HTML-handling logic of bp_create_excerpt().

  • The regular expression that detects tags should be less generous, to avoid false matches.
  • Word boundary detection, as used when exact=false, should never return a space character inside of an HTML tag (such as the space in <a href="http://example.com">).

These changes fix some tag-parsing issues introduced in [9523].

Merges [9962] to the 2.3 branch for 2.2.3.

Fixes #6517.

#23 @danbp
9 years ago

  • Resolution fixed deleted
  • Status changed from closed to reopened

A new issue started with excerpt after applying the patch.

when you publish a text which goes over excerpt limit in update, topic or comment, no text appears except 3dots and read more. You have to click read more to see all. Annoying, if the excerpt is not initially showed.

If text is under the limit, everything goes well.

This ticket was mentioned in Slack in #buddypress by danbp. View the logs.


9 years ago

#25 @boonebgorges
9 years ago

danbp - Can you please give specific steps to reproduce, including examples? My testing, and the automated tests, suggest that the function is not completely broken in the way that you allege :)

#26 @danbp
9 years ago

@boonebgorges,

took my a while to sort this out, sorry for late answer.

When you insert a long text, you should see an excerpt and [read more] button.

when you insert Lorem Ipsum text, you got a correct excerpt output.

When you insert a accented text like french, the whole excerpt is stripped and you see only ... [read more]

See attached images.

Pic 1 shows 3 activities.
1) a correct excerpt in latin text.
2) a correct text under the excert limit, in french
3) the same text, a bit longer, with the empty excerpt

Pic 2 is the same activity, with the third activity deployed.

You can reproduce that by inserting a long latin text.

Est @dan et velit dolores eos quasi expedita sed. Assumenda asperiores suscipit corrupti ut provident distinctio quia. 

Dolores enim qui repellendus natus. Earum voluptas est explicabo et. Neque at dolorum cumque. Beatae ipsa ratione quibusdam voluptates quibusdam, vel est necessitatibus voluptatem id. Porro deleniti doloremque quaerat 

in aut accusamus. Eligendi possimus similique nesciunt voluptas odio sed. Ab aut fuga voluptas quis illo. Eos ipsum est et eos amet tempore.

and a long french text

Toutes les connaissances que les hommes avaient mis sur Internet lui étaient accessible. Les grandes bibliothèques du monde entier n’avaient plus de secret pour lui. Il pouvait apprendre très vite, beaucoup plus vite que n’importe quel humain. 
Il avait appris toutes les connaissances du monde entier, visiter tout les pays. C’est lui qui avait fait en sorte qu’Internet se déploie ainsi.
Last edited 9 years ago by danbp (previous) (diff)

@danbp
9 years ago

@danbp
9 years ago

@boonebgorges
9 years ago

#27 @boonebgorges
9 years ago

  • Keywords has-patch needs-testing added

Thanks very much for the details, danbp. 6517.2.diff is an improvement on the previous logic, with a unit test to show that it works as expected with accented characters. Could I ask you (and others) to give it a try?

#28 @hnla
9 years ago

Was seeing an issue with block of text posted, (excerpt function running with default args) and excerpts returning for just the first word along with $ending.

6517.2 patch has cleared that up excerpts appear to work correctly and accept and retun accented characters.

Not sure if danbp still sees issues though.

#29 follow-up: @boonebgorges
9 years ago

Was seeing an issue with block of text posted, (excerpt function running with default args) and excerpts returning for just the first word along with $ending.

Yeah. There was an error in [9963] that sorta double-flipped the string. (I've gotta look backward from the end of the string to find the last word boundary, but I was forgetting to change the direction before truncating.) My tests weren't written in a way that accounted properly for this. I think this is pretty much danbp's issue - it's not really about accented characters, but about the fact that the error only emerged when the string contains markup. Anyway, danbp, please have a look if you can.

#30 in reply to: ↑ 29 @danbp
9 years ago

Replying to boonebgorges:

it's not really about accented characters, but about the fact that the error only emerged when the string contains markup. Anyway, danbp, please have a look if you can.

FYI, when you enter my previous french text and add `some text` (with code button) to part of it, the excerpt comes up !

I applied 6517.2 and the issue is gone. Seems it is working correclty for now.

Thank you !

#31 @boonebgorges
9 years ago

  • Resolution set to fixed
  • Status changed from reopened to closed

In 9967:

The continuing adventures of bp_create_excerpt().

This changeset fixes a bug introduced in [9963] that caused excerpts to be
truncated too much in some cases. It also fixes some potential issues with
multibyte strings when html=true.

Fixes #6517.

#32 @boonebgorges
9 years ago

In 9968:

The continuing adventures of bp_create_excerpt().

This changeset fixes a bug introduced in [9963] that caused excerpts to be
truncated too much in some cases. It also fixes some potential issues with
multibyte strings when html=true.

Merges [9967] to the 2.3 branch for 2.3.3.

Fixes #6517.

#33 @danbp
9 years ago

Merci for the quick fix, Boone !

#34 @boonebgorges
9 years ago

pas du tout!

Note: See TracTickets for help on using tickets.