Context Navigation

← Previous Ticket
Next Ticket →

#6517 closed defect (bug) (fixed)

bp_create_excerpt returning mall-formed markup - mb_strlen issues?

Reported by:	hnla	Owned by:	boonebgorges
Milestone:	2.3.3	Priority:	normal
Severity:	normal	Version:
Component:	Core	Keywords:	has-patch needs-testing
Cc:

Description ¶

Until very recently this BP function worked happily to take a piece of post-post_content wrapped in wp_ksses in turn wrapped in wpautop along with a truncation length and BP array of args to remove ellipses. This was to gain more control os a specific content piece with additional allowed_tags.

This arrangement has suddenly resulted in the $length being ignored and full content returned as well as stray closing tags appearing at the end of the content (resultant when markup elements present i.e anchors)

I re-factored my approach removing wp_ksses as we have recently upgraded BP allowed tags so it wasn't necessary and removing wpautop to use later on the returned excerpt (works fine).

Issue still remains that links are mall-formed and looking at the function where we attempt to gather / strip the tags to remove from the character/string length count we are failing to correctly return them, possibly failing on out final foreach closing tags loop.

This feels like it may be an issue with mb_strlen as I've seen it mentioned in another ticket and Boone alluded to the possibility WP had made some changes here?

Unit test are suggested as a means of demonstrating the issue but sadly with zero experience of writing tests that will have to wait.

My current tests are in a Archive post loop running a new wp_query:
bp_create_excerpt( $this_cat_news->post->post_content, 300, array('ending' => '', 'html' => true ) )

With the post content having a simple set of basic anchor links in the post body.

Commits (4)

[9962] Improve the HTML-handling logic of bp_create_excerpt().… by @boonebgorges 10 years ago
[9963] Improve the HTML-handling logic of bp_create_excerpt().… by @boonebgorges 10 years ago
[9967] The continuing adventures of bp_create_excerpt().… by @boonebgorges 10 years ago
[9968] The continuing adventures of bp_create_excerpt().… by @boonebgorges 10 years ago

Attachments (5)

6517.diff (2.6 KB) - added by boonebgorges 10 years ago.
excerpt issue.jpg (37.5 KB) - added by danbp 10 years ago.
excerpt 1.jpg (57.2 KB) - added by danbp 10 years ago.
excerpt 2.jpg (78.9 KB) - added by danbp 10 years ago.
6517.2.diff (4.6 KB) - added by boonebgorges 10 years ago.

Download all attachments as: .zip

Pull Requests ()

Loading…

Change History (39)

#1 @hnla
10 years ago

Additionally we acknowledge Cake as source for function but phpdocs link appears outdated:

http://book.cakephp.org/view/1469/Text#truncate-1625

Best match I could find was:

http://book.cakephp.org/3.0/en/core-libraries/text.html#truncating-text

#2 @boonebgorges
10 years ago

hnla - Can you please share some sample content that demonstrates the problem?

#3 @hnla
10 years ago

I'll try and write a small function to demonstrate the issue in isolation.

#4 @hnla
10 years ago

This should demonstrate the issue, I've been trying out various combinations of markup, links are commented as showing linked characters that shouldn't be e.g '/a' but they are simply a consequence of this malformed angle tag re-introduction after stripping from the character count, I have though seen this linking extend to a whole paragraph or end of returned excerpt so is an issue when ideal conditions encountered.

function bp_excerpt_truncation_issue_trac6517() {

$test_content = '<p>This test content demonstrates an issue 
with truncating a long string  run through the bp_create_excerpt function  and which has pcdata</p>
<p>The issue manifests when we have tags to process (note the html arg in the function that removes the characters
representing parsed data tags from the truncated string length return) <i>at this point  
it would be noted that we now have stray closing angle brackets at the end of the text </i></p>
If we now add link markup into the content body we will see mal-formed tags causing  incorrect wrapping of the text.
We now link to one of the <a href="http://buddypress.org/"> really awsome sites on the net </a> and observe that the text link
is now looking a little odd <a href="http://google.com/">when I add another link into the mix</a> observe how we now 
include in the link the partial closing anchor tag but without the less than angle bracket. 
<p>The issue ? clearly appears to be one of incorrectly returning the tag characters 
after stripping out our loop counting is off? a link right at the end 
<a href="http://https://buddypress.trac.wordpress.org/ticket/6517#comment:3">to the trac ticket</a> 
breaks including partial linked tag characters</p>';

$excerpt_test =  bp_create_excerpt( $test_content , 350, array('ending' => '', 'html' => true ) );

echo $excerpt_test;
}

#5 @hnla
10 years ago

If in fact I change the html arg to false i.e do calculate html tags in excerpt length then we avoid the issue by avoiding the need to strip_tags via the mb_strlen function and return a correct excerpt.

HTML 3.0 :s

So dumping $openTags results in:
array(7) { [0]=> string(1) "p" [1]=> string(2) "a " [2]=> string(2) "a>" [3]=> string(2) "a>" [4]=> string(2) "a " [5]=> string(2) "p>" [6]=> string(2) "p>" }

we then foreach those and do:
'</' . $tag . '>'

Clearly that's wrong and why we're doubling up the closing angle bracket.

[2]=> string(2) "a>"

'</' . 'a>' . '>' effectively.

Can't see where in that yet we're messing up the $openTag and it's hard to fathom preg_match stuff.

Last edited 10 years ago by hnla (previous) (diff)

#6 @hnla
10 years ago

@boone the issue is with revision 9523

In bp_create_excerpt(), recognize HTML comments when exclude HTML from counts.

This prevents problems with, among other things, the  tag in
activity streams.

Fixes #3680

Reverting the changes to the preg_match_all line corrects the additional closing brackets.

#7 @boonebgorges
10 years ago

Milestone changed from Awaiting Review to 2.4

Thanks for tracking this down a bit, hnla. There seem to be two problems. The first is that the regex change in [9523] made the tag check too liberal, with the result that some larger-than-desired chunks were being detected. The second is that the exact=false logic - which ensures that strings are only truncated on word boundaries - was not distinguishing between a space that is a word boundary and a space that appears inside of an HTML tag. That is, when truncating Foo <a href="http://example.com">Bar</a> Baz with exact=false, we should never break on the space between a and href.

Please test 6517.diff to see if it's solving your problems. Thanks!

@boonebgorges
10 years ago

Attachment 6517.diff added

#8 @hnla
10 years ago

Yep the word boundary aspect I suspected had a part to play in things but couldn't get to testing further, I have issues with other activity excerpt media that I must track down, then I'll test this patch out and report back on it.

On a sidenote I was wondering what we actually did with the more tag or html comments, again haven't actually tested to see if we strip them out or, as other markup, return them which in the case of comments would be pointless?

#9 @hnla
10 years ago

@Boone:
Tested patch, tested character counts with html true/false and that seems fine, changing count to include markup tags or exclude them from $length.

However it looks like we've lost $exact as an array option it should be default false yet seems stuck on true.

#10 @boonebgorges
10 years ago

Thanks for testing, hnla. Unit tests for exact are still passing - can you show me what you're doing to test this?

#11 @danbp
10 years ago

A similar issue was recently posted on the forum.

Tried to replicate on a fresh instal and issue is confirmed. See screenshot.

I applied 6517.diff which solved the issue.

@danbp
10 years ago

Attachment excerpt issue.jpg added

#12 @boonebgorges
10 years ago

Milestone changed from 2.4 to 2.3.3

Thanks, danbp. As this was introduced in 2.3.0, I'm going to fix this on the 2.3 branch.

#13 @hnla
10 years ago

@Boone

can you show me what you're doing to test this?

Having determined count and html option work as expected and having a truncation returned that ends mid word adding in the option 'exact' => false ($text will not be cut mid-word) I expect to see that truncation expand to include the word up to the next white space point?

Wondering if that is a valid expectation though now, however what ever combo of $html true/false and $exact true/false I can not demonstrate words shown in full when I flip flop between say 20 characters or 21/22.

#14 @boonebgorges
10 years ago

bp_create_excerpt() rounds down. That is:

bp_create_excerpt( 'foo bar baz', 5, array( 'exact' => true ) );
// 'foo ba'

bp_create_excerpt( 'foo bar baz', 5, array( 'exact' => false ) );
// 'foo'

The one exception is when the first word is so long that it'd result in a zero string. See #6254.

Does that square with what you're seeing?

#15 @hnla
10 years ago

Yes your test case works (ok so rounds down, would have to really )

But now add an anchor into the mix as currently that test case doesn't cover all the perms
$totalLength == 18

bp_create_excerpt( 'foo <a>bar</a> baz', 5, array( 'exact' => true, 'html' => false ) );

//  'html' is false so include tag count
// 'exact' is true so return literal string length

//  returns 'foo <'

// Result returns parsed markup as character data.
// Result not as expected?
// We need to strip tags from an literal count of characters if including markup in count
// lest we land on a portion of that markup inadvertently?

bp_create_excerpt( 'foo <a>bar</a> baz', 5, array( 'exact' => true, 'html' => true ) );

//  returns 'foo b'

// Result as expected; 5 characters wanted 'exact' is true, markup is removed from count 
// so we have first character of second word rendered (wrapped in it's anchor though!)

bp_create_excerpt( 'foo <a>bar</a> baz', 6, array( 'exact' => false, 'html' => false ) );

//  returns 'foo'

//  'html' is false so include tag count
// 'exact' is false so don't cut mid-word

// Result as expected; 6 characters counted up to '<a' result rounded down to first word boundary end.

#16 @hnla
10 years ago

The greatest trick the devil ever pulled was convincing the world that reg expressions were cool

#17 @danbp
10 years ago

As always, the devil IS in the detail ! (learned after some hours face to face with regex)

#18 @boonebgorges
10 years ago

The greatest trick the devil ever pulled was convincing the world that reg expressions were cool

Save your scorn for a time when regular expressions are actually part of the problem :)

bp_create_excerpt( 'foo <a>bar</a> baz', 5, array( 'exact' => true, 'html' => false ) );

//  'html' is false so include tag count
// 'exact' is true so return literal string length

//  returns 'foo <'

// Result returns parsed markup as character data.
// Result not as expected?
// We need to strip tags from an literal count of characters if including markup in count
// lest we land on a portion of that markup inadvertently?

No, I think this is incorrect. html=false implies that the input should be treated as a simple string - we should not be doing any parsing of it. If a developer chooses to combine exact=true with html=false, that's their prerogative. I can imagine situations where someone might want to use this function to get an exact-length excerpt (for reasons other than display within a web page).

#19 @boonebgorges
10 years ago

Owner set to boonebgorges
Resolution set to fixed
Status changed from new to closed

In 9962:

Improve the HTML-handling logic of bp_create_excerpt().

The regular expression that detects tags should be less generous, to avoid false matches.
Word boundary detection, as used when exact=false, should never return a space character inside of an HTML tag (such as the space in <a href="http://example.com">).

These changes fix some tag-parsing issues introduced in [9523].

Fixes #6517.

#20 @hnla
10 years ago

If a developer chooses to combine exact=true with html=false that's their prerogative

Had a feeling you might say that, and sort of have to agree, but that's normally my harsh line with people, when I get tired of making something idiot proof :)

However not really sure that 'implies' is correct why on earth do we have this option to either count markup or not the string is the only important count the markup if exists should always be treated as parsed data not to be displayed and therefore not part of the count and if that were true the function then becomes a whole heap simpler?

I can imagine situations where someone might want to use this function to get an exact-length excerpt (for reasons other than display within a web page

surely there are simpler php functions to count a string if you needed to but hey ho, a long as we're happy it works we'll expect devs/users not to pass those two option together to count all and display exact - works for me!

Last edited 10 years ago by hnla (previous) (diff)

#21 @boonebgorges
10 years ago

The purpose of the function is to create excerpts. If someone wants to create an excerpt of the form foo <a […], there's no reason we should prevent it from happening. While it's true that someone could simply use substr() to create an "exact" excerpt, it may be that they're using bp_create_excerpt() as part of a larger system, and want to have consistent interfaces.

#22 @boonebgorges
10 years ago

In 9963:

Improve the HTML-handling logic of bp_create_excerpt().

The regular expression that detects tags should be less generous, to avoid false matches.
Word boundary detection, as used when exact=false, should never return a space character inside of an HTML tag (such as the space in <a href="http://example.com">).

These changes fix some tag-parsing issues introduced in [9523].

Merges [9962] to the 2.3 branch for 2.2.3.

Fixes #6517.

#23 @danbp
10 years ago

Resolution fixed deleted
Status changed from closed to reopened

A new issue started with excerpt after applying the patch.

when you publish a text which goes over excerpt limit in update, topic or comment, no text appears except 3dots and read more. You have to click read more to see all. Annoying, if the excerpt is not initially showed.

If text is under the limit, everything goes well.

This ticket was mentioned in Slack in #buddypress by danbp. View the logs.

10 years ago

#25 @boonebgorges
10 years ago

danbp - Can you please give specific steps to reproduce, including examples? My testing, and the automated tests, suggest that the function is not completely broken in the way that you allege :)

#26 @danbp
10 years ago

@boonebgorges,

took my a while to sort this out, sorry for late answer.

When you insert a long text, you should see an excerpt and [read more] button.

when you insert Lorem Ipsum text, you got a correct excerpt output.

When you insert a accented text like french, the whole excerpt is stripped and you see only ... [read more]

See attached images.

Pic 1 shows 3 activities.
1) a correct excerpt in latin text.
2) a correct text under the excert limit, in french
3) the same text, a bit longer, with the empty excerpt

Pic 2 is the same activity, with the third activity deployed.

You can reproduce that by inserting a long latin text.

Est @dan et velit dolores eos quasi expedita sed. Assumenda asperiores suscipit corrupti ut provident distinctio quia. 

Dolores enim qui repellendus natus. Earum voluptas est explicabo et. Neque at dolorum cumque. Beatae ipsa ratione quibusdam voluptates quibusdam, vel est necessitatibus voluptatem id. Porro deleniti doloremque quaerat 

in aut accusamus. Eligendi possimus similique nesciunt voluptas odio sed. Ab aut fuga voluptas quis illo. Eos ipsum est et eos amet tempore.

and a long french text

Toutes les connaissances que les hommes avaient mis sur Internet lui étaient accessible. Les grandes bibliothèques du monde entier n’avaient plus de secret pour lui. Il pouvait apprendre très vite, beaucoup plus vite que n’importe quel humain. 
Il avait appris toutes les connaissances du monde entier, visiter tout les pays. C’est lui qui avait fait en sorte qu’Internet se déploie ainsi.

Last edited 10 years ago by danbp (previous) (diff)

@danbp
10 years ago

Attachment excerpt 1.jpg added

@danbp
10 years ago

Attachment excerpt 2.jpg added

@boonebgorges
10 years ago

Attachment 6517.2.diff added

#27 @boonebgorges
10 years ago

Keywords has-patch needs-testing added

Thanks very much for the details, danbp. 6517.2.diff is an improvement on the previous logic, with a unit test to show that it works as expected with accented characters. Could I ask you (and others) to give it a try?

#28 @hnla
10 years ago

Was seeing an issue with block of text posted, (excerpt function running with default args) and excerpts returning for just the first word along with $ending.

6517.2 patch has cleared that up excerpts appear to work correctly and accept and retun accented characters.

Not sure if danbp still sees issues though.

#29 follow-up: ↓ 30 @boonebgorges
10 years ago

Was seeing an issue with block of text posted, (excerpt function running with default args) and excerpts returning for just the first word along with $ending.

Yeah. There was an error in [9963] that sorta double-flipped the string. (I've gotta look backward from the end of the string to find the last word boundary, but I was forgetting to change the direction before truncating.) My tests weren't written in a way that accounted properly for this. I think this is pretty much danbp's issue - it's not really about accented characters, but about the fact that the error only emerged when the string contains markup. Anyway, danbp, please have a look if you can.

#30 in reply to: ↑ 29 @danbp
10 years ago

Replying to boonebgorges:

it's not really about accented characters, but about the fact that the error only emerged when the string contains markup. Anyway, danbp, please have a look if you can.

FYI, when you enter my previous french text and add `some text` (with code button) to part of it, the excerpt comes up !

I applied 6517.2 and the issue is gone. Seems it is working correclty for now.

Thank you !

#31 @boonebgorges
10 years ago

Resolution set to fixed
Status changed from reopened to closed

In 9967:

The continuing adventures of bp_create_excerpt().

This changeset fixes a bug introduced in [9963] that caused excerpts to be
truncated too much in some cases. It also fixes some potential issues with
multibyte strings when html=true.

Fixes #6517.

#32 @boonebgorges
10 years ago

In 9968:

The continuing adventures of bp_create_excerpt().

This changeset fixes a bug introduced in [9963] that caused excerpts to be
truncated too much in some cases. It also fixes some potential issues with
multibyte strings when html=true.

Merges [9967] to the 2.3 branch for 2.3.3.

Fixes #6517.

#33 @danbp
10 years ago

Merci for the quick fix, Boone !

#34 @boonebgorges
10 years ago

pas du tout!

Note: See TracTickets for help on using tickets.

Trac UI Preferences

Download in other formats:

BuddyPress.org

Context Navigation

#6517 closed defect (bug) (fixed)

bp_create_excerpt returning mall-formed markup - mb_strlen issues?

Description ¶

Commits (4)

Attachments (5)

Pull Requests ()

Change History (39)

#1 @hnla 10 years ago

#2 @boonebgorges 10 years ago

#3 @hnla 10 years ago

#4 @hnla 10 years ago

#5 @hnla 10 years ago

#6 @hnla 10 years ago

#7 @boonebgorges 10 years ago

@boonebgorges 10 years ago

#8 @hnla 10 years ago

#9 @hnla 10 years ago

#10 @boonebgorges 10 years ago

#11 @danbp 10 years ago

@danbp 10 years ago

#12 @boonebgorges 10 years ago

#13 @hnla 10 years ago

#14 @boonebgorges 10 years ago

#15 @hnla 10 years ago

#16 @hnla 10 years ago

#17 @danbp 10 years ago

#18 @boonebgorges 10 years ago

#19 @boonebgorges 10 years ago

#20 @hnla 10 years ago

#21 @boonebgorges 10 years ago

#22 @boonebgorges 10 years ago

#23 @danbp 10 years ago

This ticket was mentioned in ​Slack in #buddypress by danbp. ​View the logs. 10 years ago

#25 @boonebgorges 10 years ago

#26 @danbp 10 years ago

@danbp 10 years ago

@danbp 10 years ago

@boonebgorges 10 years ago

#27 @boonebgorges 10 years ago

#28 @hnla 10 years ago

#29 follow-up: ↓ 30 @boonebgorges 10 years ago

#30 in reply to: ↑ 29 @danbp 10 years ago

#31 @boonebgorges 10 years ago

#32 @boonebgorges 10 years ago

#33 @danbp 10 years ago

#34 @boonebgorges 10 years ago

Download in other formats:

#1 @hnla
10 years ago

#2 @boonebgorges
10 years ago

#3 @hnla
10 years ago

#4 @hnla
10 years ago

#5 @hnla
10 years ago

#6 @hnla
10 years ago

#7 @boonebgorges
10 years ago

@boonebgorges
10 years ago

#8 @hnla
10 years ago

#9 @hnla
10 years ago

#10 @boonebgorges
10 years ago

#11 @danbp
10 years ago

@danbp
10 years ago

#12 @boonebgorges
10 years ago

#13 @hnla
10 years ago

#14 @boonebgorges
10 years ago

#15 @hnla
10 years ago

#16 @hnla
10 years ago

#17 @danbp
10 years ago

#18 @boonebgorges
10 years ago

#19 @boonebgorges
10 years ago

#20 @hnla
10 years ago

#21 @boonebgorges
10 years ago

#22 @boonebgorges
10 years ago

#23 @danbp
10 years ago

This ticket was mentioned in Slack in #buddypress by danbp. View the logs.

10 years ago

#25 @boonebgorges
10 years ago

#26 @danbp
10 years ago

@danbp
10 years ago

@danbp
10 years ago

@boonebgorges
10 years ago

#27 @boonebgorges
10 years ago

#28 @hnla
10 years ago

#29 follow-up: ↓ 30 @boonebgorges
10 years ago

#30 in reply to: ↑ 29 @danbp
10 years ago

#31 @boonebgorges
10 years ago

#32 @boonebgorges
10 years ago

#33 @danbp
10 years ago

#34 @boonebgorges
10 years ago