Skip to:
Content

BuddyPress.org

Opened 11 years ago

Closed 11 years ago

#5210 closed enhancement (duplicate)

Using DISTINCT in the select statement when retrieving activity data slows down the query

Reported by: kwerri's profile kwerri Owned by:
Milestone: Priority: high
Severity: major Version: 1.8.1
Component: Activity Keywords: reporter-feedback
Cc:

Description

In this line in bp-activity/bp-activity-classes.php:

$select_sql = "SELECT DISTINCT a.*, u.user_email, u.user_nicename, u.user_login, u.display_name";

The inclusion of the distinct causes the creation of a temporary table. The creation and removal of this temporary table can dramatically impact how long it takes for the query to run, especially for wp_bp_activity tables with a lot of data.

Is this needed, and would you guys consider removing it?

Change History (7)

#1 @boonebgorges
11 years ago

  • Keywords reporter-feedback added

The DISTINCT keyword was added in r7318. See #5118 for background.

I'm surprised to hear that merely adding DISTINCT is forcing the creation of temp tables. The activity tables are heavily indexed. Can you share some more details about the queries?

  • What do the full queries look like? Are there lots of WHERE clauses?
  • Can you run an EXPLAIN on the query and share it?
  • How big is your activity table? What's your server environment like? (mysql version, RAM)

I wonder if we could avoid the temp table problem if we did SELECT DISTINCT a.id, a.*, limiting the scope of the DISTINCT. I'm seeing conflicting data around the web. See, eg, https://dev.mysql.com/doc/refman/5.0/en/distinct-optimization.html, which says that the problem may stem from coupling with ORDER BY; yet http://stackoverflow.com/questions/11746062/performance-and-sorting-and-distinct-unique-between-mysql-and-php insists that the DISTINCT logic is far more efficient in MySQL than processing in PHP.

Any more info or thoughts you could share would be helpful. Thanks for the report.

#2 @kwerri
11 years ago

Hi,

Thanks for the response.

I'll need to set up a separate server with the DISTINCT keyword added back to the code, and then I'll post some of the queries that are being run, as well as the result of profiling such queries, and the result of running an EXPLAIN on it.

The activity table is pretty big - about close to 200k records.

#3 @kwerri
11 years ago

OK, here is more info:

I ran this query:

EXPLAIN SELECT DISTINCT a.*, u.user_email, u.user_nicename, u.user_login, u.display_name  FROM wp_bp_activity a LEFT JOIN wp_users u ON a.user_id = u.ID  WHERE a.is_spam = 0 AND a.hide_sitewide = 0 AND a.type != 'activity_comment' ORDER BY a.date_recorded DESC LIMIT 0, 20;

I got this:

+----+-------------+-------+--------+--------------------+---------------+---------+--------------------------+-------+----------------------------------------------+
| id | select_type | table | type   | possible_keys      | key           | key_len | ref                      | rows  | Extra                                        |
+----+-------------+-------+--------+--------------------+---------------+---------+--------------------------+-------+----------------------------------------------+
|  1 | SIMPLE      | a     | ref    | type,hide_sitewide,| hide_sitewide | 2       | const                    | 97582 | Using where; Using temporary; Using filesort |
|    |             |       |        | is_spam            |               |         |                          |       |                                              |
|  1 | SIMPLE      | u     | eq_ref | PRIMARY            | PRIMARY       | 8       | mydb.a.user_id           |     1 |                                              |
+----+-------------+-------+--------+--------------------+---------------+---------+--------------------------+-------+----------------------------------------------+

I also did a profile of the query - this is what I got:

+----------------------+-----------+
| Status               | Duration  |
+----------------------+-----------+
| starting             |  0.000099 |
| checking permissions |  0.000007 |
| checking permissions |  0.000006 |
| Opening tables       |  0.000023 |
| System lock          |  0.000011 |
| init                 |  0.000049 |
| optimizing           |  0.000018 |
| statistics           |  0.000153 |
| preparing            |  0.000027 |
| Creating tmp table   |  0.000386 |
| executing            |  0.000006 |
| Copying to tmp table | 17.143788 |
| Sorting result       |  0.516267 |
| Sending data         |  0.000156 |
| end                  |  0.000005 |
| removing tmp table   |  0.374855 |
| end                  |  0.000015 |
| query end            |  0.000007 |
| closing tables       |  0.000014 |
| freeing items        |  0.000033 |
| logging slow query   |  0.000003 |
| logging slow query   |  1.447473 |
| cleaning up          |  0.000023 |
+----------------------+-----------+

Note the 17-second duration to takes to copy to the tmp table.

I have 2GB of RAM on my server, and I'm using mysql 5.5.32.

Lastly, there's this:

http://dev.mysql.com/doc/refman/5.1/en/internal-temporary-tables.html

The relevant section is here:

"Temporary tables can be created under conditions such as these:

If there is an ORDER BY clause and a different GROUP BY clause, or if the ORDER BY or GROUP BY contains columns from tables other than the first table in the join queue, a temporary table is created.

DISTINCT combined with ORDER BY may require a temporary table."

#4 @boonebgorges
11 years ago

Thanks very much for this info. I'm going to do some experimentation of my own, but in the meantime, perhaps you could run a similar EXPLAIN for the following query:

EXPLAIN SELECT DISTINCT id FROM wp_bp_activity WHERE is_spam = 0 AND hide_sitewide = 0 AND type != 'activity_comment' ORDER BY date_recorded DESC LIMIT 0, 20;

Basically, I want to see whether querying only for activity ids (rather than fetching all kinds of data at once) will have any appreciable effect.

#5 @kwerri
11 years ago

Here's the result, boonebgorges:

+----+-------------+----------------+-------+----------------------------+---------------+---------+------+------+-------------+
| id | select_type | table          | type  | possible_keys              | key           | key_len | ref  | rows | Extra       |
+----+-------------+----------------+-------+----------------------------+---------------+---------+------+------+-------------+
|  1 | SIMPLE      | wp_bp_activity | index | type,hide_sitewide,is_spam | date_recorded | 8       | NULL |   40 | Using where |
+----+-------------+----------------+-------+----------------------------+---------------+---------+------+------+-------------+

#6 @boonebgorges
11 years ago

  • Milestone changed from Awaiting Review to 2.0
  • Priority changed from normal to high
  • Severity changed from normal to major

Thanks very much for the feedback, kwern. As I suspected, the real issue here is not so much the DISTINCT + ORDER BY (there's not much we can do to avoid this), but the fact that our query is not optimized in other ways. As we did in the case of users and BP_User_Query (and as WP did with WP_Query), we should split our query into two parts: one that fetches activity IDs, and another that separately fetches data about the located items. This will allow our table indexes to function properly.

I've been wanting to do something like this for a while, and your ticket is a good impetus to work on it for 2.0.

#7 @boonebgorges
11 years ago

  • Milestone 2.0 deleted
  • Resolution set to duplicate
  • Status changed from new to closed

kwerri - Thanks again for your help with this. I'm closing this ticket in favor of #5349, which has a patch that should solve both of our problems. It continues to use DISTINCT, but other aspects of the query have been modified to reduce overhead by several orders of magnitude. It'd be great to get your testing/feedback on that ticket.

Note: See TracTickets for help on using tickets.