Skip to:
Content

Opened 2 years ago

Last modified 4 months ago

#6789 new enhancement

XProfile: do not store serialized arrays for multi-value profile field data

Reported by: Offereins Owned by:
Milestone: Awaiting Contributions Priority: normal
Severity: normal Version:
Component: Extended Profile Keywords:
Cc: slaFFik, dcavins, espellcaste@…, lmoffereins@…

Description

At the moment, data from multi-value profile fields (selectboxes/multiselect) are stored as serialized arrays in $bp->profile->table_name_data. This prohibits querying users by this data using BP_Xprofile_Query, which is obviously something BP needs to continue to grow. So let us do something about it (with love)!

At the moment, backpat is the greatest concern, but that should not stop us. Please join the discussion on how to make this happen.

Attachments (9)

6789.01.diff (2.2 KB) - added by Offereins 12 months ago.
profile type supports_multiple_values class property
6789.02.diff (3.5 KB) - added by Offereins 12 months ago.
.01 + bp_xprofile_field_supports_multiple_values
6789.docs.01.patch (1009 bytes) - added by Offereins 12 months ago.
Docfix for xprofile_sanitize_data_value_before_save
6789.docs.02.patch (2.2 KB) - added by Offereins 12 months ago.
Filter docs for BP_XProfile_ProfileData::save
6789.03.patch (18.4 KB) - added by Offereins 12 months ago.
.01 + .02 + first pass at multiple row data
6789.04.patch (22.5 KB) - added by Offereins 12 months ago.
.03 + improvements
6789.05.patch (22.6 KB) - added by Offereins 12 months ago.
.04 + improvements
6789.06.patch (22.5 KB) - added by boonebgorges 4 months ago.
6789.07.patch (21.7 KB) - added by boonebgorges 4 months ago.

Download all attachments as: .zip

Change History (30)

#1 @slaFFik
2 years ago

  • Cc slaFFik added

#2 @dcavins
2 years ago

  • Cc dcavins added

#3 @espellcaste
2 years ago

  • Cc espellcaste@… added

#4 @boonebgorges
2 years ago

  • Milestone changed from Awaiting Review to Future Release

First, <3. It would be great to do away with this kind of serialized data.

Second, some initial thoughts.

  • The most obvious way to store the data is to split individual entries into their own rows in wp_bp_xprofile_data.
  • But this will mean that user_id+field_id will no longer be a de facto unique identifier for xprofile_data. That is, a given user may have more than one row in xprofile_data corresponding to a given field.
  • Various parts of our query API expect to pull only a single row from the table. Eg, BP_XProfile_ProfileData::populate(), which does $wpdb->get_row(). In this case we'll probably need to do get_results() and then figure out how far up the stack we need to add support for arrays. Another example is BP_XProfile_ProfileData::get_data_for_user(), which keys data by field_id. Here, we'll either need to do the array conversion right in the query method (so the data structure can remain the same in, eg, BP_XProfile_Group::get(), or we'll have to change the way data is keyed and then make necessary modifications to consuming methods.
  • In all of these cases, we're going to have to weigh backward compatibility against design. Returning arrays from functions that didn't previous return arrays is almost certainly going to break something somewhere.
  • We should do a scan for plugins that interact directly with the xprofile_data table, especially those that expect serialized data. If we find that lots of folks are doing this, we may decide to provide legacy support for it. We did something similar with user last_activity data (you can continue to access/modify the data in usermeta, though BP throws a bunch of _doing_it_wrong() notices). Ideally, we would simply drop support for this kind of direct access; but this will take some good documentation and publicity.
  • The migration/update routine is going to be pretty large, and will probably need to be broken into chunks. We can do some background migration (eg, fire off a cron job during update that migrates data in batches), but since this is asynchronous, we'd need to maintain backward compatibility for the old data format, at least for a version or two.

Would love to hear more thoughts on any of the above, or anything I am overlooking :)

#5 @Offereins
2 years ago

  • Cc lmoffereins@… added

I have nothing to add at the moment. I'm not that much invested in the current way of how it works, so perhaps the other guys can weigh in. Maybe we should start with a proper set of tests having in place before we rock it - though I'm not really into that game.

This ticket was mentioned in Slack in #buddypress by offereins. View the logs.


2 years ago

#7 @thomaslhotta
23 months ago

It's possible to do this using "LIKE". This probably does not offer the best performance, but it also doesn't require any of the changes listed above.

Right now this only works with "OR" relations if one wants to search for multiple values. When using "AND" every value causes a new JOIN, what really kills performance. The reason for this is found in BP_XProfile_Query on line 519:

 } elseif ( isset( $sibling['field_id'] ) && isset( $clause['field_id'] ) && $sibling['field_id'] === $clause['field_id'] ) {
     $compatible_compares = array( '!=', 'NOT IN', 'NOT LIKE' );
 }

Adding "LIKE" her allows "AND" queries to work fine.

This ticket was mentioned in Slack in #buddypress by offereins. View the logs.


20 months ago

This ticket was mentioned in Slack in #buddypress by offereins. View the logs.


19 months ago

@Offereins
12 months ago

profile type supports_multiple_values class property

#10 @Offereins
12 months ago

So, to get a start somewhere, I decided to first identify whether fields (and their types) would accept multiple values. This is to find a way to still support fields that want to save their values as serialized arrays versus fields that would have the data as separate value rows.

By default this is the case for the Multiselect and Checkbox field types. While they already are uniquely identified by their supports_multiple_defaults class property, that prop actually signals a different function than saving multiple values. That is why I introduce the supports_multiple_values property/flag - also just for these types.

This ticket was mentioned in Slack in #buddypress by dcavins. View the logs.


12 months ago

#12 @Offereins
12 months ago

Just stumbled upon the (hypothetical) issue of xprofile meta, which is equally possible for data(!) as for fields and groups. The unique id of a profile data object currently equals the data row's id. This apparently gets messed up when multiple rows of data exist for a single data object. In case of meta, this triggers the question of which id to use to store meta for the data object.

For now I'm suggesting to use the last row's item id. For storing profile data meta, this is probably not so much a problem. Querying profile data by meta is a different story perhaps, as you get a single data row where multiple may apply for the field - though it's probably advised to use the default API functions anyway. You know, filtering.

@Offereins
12 months ago

.01 + bp_xprofile_field_supports_multiple_values

#13 @Offereins
12 months ago

Latest patch adds a function to check the field's support for multiple values. It follows the same logic as bp_xprofile_is_richtext_enabled_for_field().

@Offereins
12 months ago

Docfix for xprofile_sanitize_data_value_before_save

@Offereins
12 months ago

Filter docs for BP_XProfile_ProfileData::save

#14 @Offereins
12 months ago

In the process of working on this, I stumbled on a few errors/shortcomings in the XProfile ProfileData inline documentation. Two patches are added to cover this.

@Offereins
12 months ago

.01 + .02 + first pass at multiple row data

#15 @Offereins
12 months ago

Some notes on the WIP changes presented in the latest patch.

In BP_XProfile_ProfileData

  • populate() - multiple values are returned as an array per field.
  • save() - multiple values are stored as multiple rows. The previous rows are deleted through calling $this->delete(). The new id of the data object is the one from the first created data row.
  • get_data_for_user() - multiple values are returned as an array per field.
  • get_fielddataid_byid() - no changes, but the main data id is returned in case of multiple-value fields.
  • get_value_by_id() - multiple values are returned as an array per field.
  • get_value_byfieldname() - multiple values are returned as an array per field.
  • get_random() - a subquery is used to find the random field id, then all related data is fetched.

In xprofile functions

  • xprofile_get_field_data() - multiple values are intercepted through bp_xprofile_field_supports_multiple_values(), but further logic stands as before.
  • xprofile_set_field_data() - multiple values are sent in as an array of serialized values (in case of multiple array values).
  • bp_unserialize_profile_field() - now also parses arrays into comma-separated lists.

Further checking on template functions parsing field data is TBD.

Considerations

Since multiple values previously were stored and returned as serialized arrays, should we now return non-serialized row data also as a (re)serialized array - to keep backpat in order? Then data would be stored in the new fashion, but returned as is done previously. Any ideas about that?

@Offereins
12 months ago

.03 + improvements

@Offereins
12 months ago

.04 + improvements

#16 @Offereins
12 months ago

Reiterating on this, the latest uploaded patch (.05) contains the following fixes to .03:

In BP_XProfile_ProfileData

Empty lists for multiple-values are stored as a serialized empty array. This is to signal that the field is deliberately empty, ignoring the need to suggest defaults in the editing interface. The current patch suggests to continue to do so. The logic is implemented for the following methods comparing array() === maybe_unserialize( $value ):

  • populate()
  • save()
  • get_data_for_user()
  • get_value_by_id()
  • get_value_byfieldname()
  • get_random()

In xprofile functions

  • xprofile_sanitize_data_value_before_save() - don't bail the filtering when the field value is actually valid: 0 in case of a number field. See similar checks in xprofile_set_field_data().

This ticket was mentioned in Slack in #buddypress by offereins. View the logs.


4 months ago

#18 @boonebgorges
4 months ago

@Offereins Thanks for your work on this. It is exciting to see it taking shape. Overall, the decisions you've made seem smart to me. The handling of empty/falsey fields tested well for me, and the supports_multiple_values flag feels correct.

I'm attaching a new patch 6789.06.patch that includes updates to documentation, as well as some changes to work after recent changes to the treatment of serialized data throughout BP. See [11692].

I have some random feedback. In no particular order.

  • The serialization dances are complicated, perhaps too complicated. For example: Data coming from multiple-value fields is being passed through array_map( 'maybe_unserialize' ) in multiple places. In what situation is it possible for a multiselect or checkbox field to store serializable data for its individual options? Certainly it's not possible in BP itself, and I wonder if it's even possible to do it now with a custom field type. Likewise: if a field does *not* support multiple values, what is the value in running it through maybe_unserialize()? Have you seen examples of custom field types that are storing structured data in the profile_data table in this way? (Related: we will need to perform a review to ensure that maybe_unserialize() calls are always symmetrical; for security reasons, we should never run any user-provided data through maybe_unserialize() that didn't go through maybe_serialize() on the way in.)
  • You've added some missing filter documentation in BP_XProfile_Data::save(). It's unrelated to the ticket at hand, so I'd go ahead and commit it separately. No need to wait for this ticket.
  • When saving multiple values in BP_XProfile_Data::save(), you call the delete() method. This will trigger the xprofile_data_before_delete hook, which feels incorrect to me. As much as I like the idea of not duplicating code, I think we should make a direct DB query here (or break the existing DELETE query into its own hookless method, which would be called from both places).
  • You suggested earlier to "use the last row's item id" for storing xprofile meta, but you're returning the *first* row's data ID from the save() method. There's no technical issue here, especially since your patch doesn't actually use xprofile data meta at all (nor does BP itself? though it would perhaps be nice to move 'visibility' here; see #6350). It might be wise to standardize here. The docs for the save() method should probably indicate this fact about return values as well.

Since multiple values previously were stored and returned as serialized arrays, should we now return non-serialized row data also as a (re)serialized array - to keep backpat in order? Then data would be stored in the new fashion, but returned as is done previously. Any ideas about that?

Ugh - I assume here you mean the data property of BP_XProfile_ProfileData objects, because xprofile_get_field_data() and other places do the necessary unserialization. I guess the answer is yes, we should reserialize so as to maintain backward compatibility with this specific field. But it suggests to me that perhaps we could set the data property with the serialized value, but then stop using the property internally. Instead, introduce something like a get_data() method, which would return structured data, and then use get_data() throughout BP. What do you think of an idea like that?

#19 @Offereins
4 months ago

@boonebgorges Thanks for taking a look at this. I'm hoping we can work this into a solid solution for the issue of serialized data.

In reply to your random feedback, the following.

  1. I find it plausible that a custom profile field would store its data as a serialized array. One of such examples would be an address field that comes with a preformatted layout and stores data for street, city, postal code, etcetera. I'm unsure how such a field would store its data otherwise within the current table structure of BuddyPress - or is this a potential case for field data metdata? A second example would be a column type field which allows admins to define their own subfields. I'm basically talking about nested fields. Following from this, one can imagine a multi-type field which supports listing multiple nested fields of the aforementioned types. For this case I'm mainly thinking about a variation on my BP XProfile Relationship Field plugin, which would enable you to make a listed version of a single-type field. Whether it is plausible enough to support this in core is up for debate. I might be biased.
  2. Handled in r11866.
  3. Agreed. I'm leaning towards making a hookless method for the delete logic.
  4. That should be fixed. Do you prefer using the first value's key or the second? I haven't applied field data metadata before, but we should consider it since it is technically supported.

Concerning your last point, I'm not quite sure I understand your meaning of an optional get_data() method. Would it serve as a default data getter where the data would always be unserialized/sanitized if applicable - so we can remove the unserializing from any logic outside of the profiledata class?

Last edited 4 months ago by Offereins (previous) (diff)

#20 @boonebgorges
4 months ago

I've attached a new patch, with the following changes:

  1. Updated after [11866].
  2. Adds a unit test for the behavior of caching multi-value fields in get_data_for_user() - the test helped me understand what you did, so let's throw it in there
  3. Adds a fix for a bug that came up during this test: I've removed the $this->exists() check from the if block in the save() method (see line 260 of 6789.06.patch). When setting the data via xprofile_set_field_data(), the multi-value logic in that block was not kicking in because the generated user in the test didn't yet have the data set (ie, $this->exists() failed). It seems to me that the only important part of the exists check is to delete the existing value, so I moved it down a couple of lines. This could use a review from you, though, as you may understand all these if/else statements better than I do.

Some responses to your comments

I find it plausible that a custom profile field would store its data as a serialized array.

Gotcha. I think this seems fine, especially if you yourself have done something close to this. I just wanted to hear someone else confirm that we weren't overengineering :)

Do you prefer using the first value's key or the second?

Standardizing on the first seems to make sense to me, though it's mostly arbitrary. One very small argument: if you ever did a $wpdb->get_row( "SELECT * FROM ..." ) query to fetch this data, you'd get the row with the lowest id.

Concerning your last point, I'm not quite sure I understand your meaning of an optional get_data() method.

The way I see it, it was a mistake that serialized data was ever available. I understand why serialization was/is important for *storage*, but that should've been totally invisible to external functions and/or other classes. However, as a matter of historical fact, the $value property (I said $data above but I meant $value, sorry!!) might contain serialized data. So I think we need to keep this behavior for backward compatibility. At the same time, this would be a good time for us to compartmentalize serialization in our standard treatment. So, something like this:

// In `BP_XProfile_Data::populate()`:
...
$raw_value = is_array( $profiledata->value ) ? array_map( 'stripslashes', $profiledata->value ) : stripslashes( $profiledata->value );

$this->raw_value = $raw_value;
$this->value = maybe_serialize( $raw_value );
...

// A new method in `BP_XProfile_Data`:
public function get_value() {
    return $this->raw_value;
}

Then, anywhere in BP, we would no longer reference the $value property directly, only get_value(). (Probably something parallel for set_value()?) But, now that I'm looking through the codebase, I don't actually see anywhere where we ever internally reference the $value property of a BP_XProfile_ProfileData object; pretty much anywhere where we fetch user profile data, it's done through a direct SQL query that is never actually turned into a BP_XProfile_ProfileData object.

So, maybe there's no point in doing what I've suggested here, at least not until we come to a point where we rethink some of the internals. If you agree, you can ignore my comment :-D

Next steps

To move forward, I think the next step is to think about data migration. We don't currently have a framework for migration; see #6841. Setting aside the details of that framework for a moment, we should think about whether we need some sort of internal backward compatibility layer for unmigrated data as part of an initial release. Say this change is part of v3.1. Sites running v3.0 will have serialized data for multi-value fields. 3.1 will include a migration routine that'll move the data to the new non-serialized format. But what if I upgrade to 3.1 but have not yet run the 3.1 migration routine (either because it's asynchronous, or because it requires manual intervention, or both)? Queries like the one in get_data_for_user() will fail for multi-value fields - they'll return a value like array( '{a:some serialized array}' ).

What can we do to mitigate this? It seems to me that we can't have fallback support for non-migrated, serialized data, because - as you note in your comment above - multi-value fields might have serialized data for *one* of the multi-values (so is_serialized() will not be a reliable check for non-migrated data). It might be that there's nothing we can do about it, and we'll just need to inform admins that such fields may not work correctly until the migration has been run - but it's something we should think about a bit.

(Sorry for the long comment - trying to be thorough!!)

#21 @DJPaul
4 months ago

I just wanted to pipe in that this change is very exciting, and thanks for both working on it! The patch looks high-quality so far.

Note: See TracTickets for help on using tickets.