Profiling your slow queries using pt-query-digest and some love from Percona Server

Overview

Profiling, analyzing and then fixing slow queries is likely the most oft-repeated part of a job of a DBA. And there are not too many tools out there that can make your life easy by providing analysis of queries with such data points that allow you to attack the right queries in the right way. One such tool that I have always found myself using is pt-query-digest(formerly known as mk-query-digest).

Now let us go through using this very nice tool.

Before We Start!

But before we start, make sure you have enabled slow query logging and set a low enough long_query_time. The correct value of long_query_time depends on your application requirements, a long_query_time of 1 second or 2 seconds might be sufficient for most of the users. Its also typical to see in many cases that you set the long_query_time to 0 for a small period of time to log all the queries.

Note that logging all queries in this fashion as opposed to the general query log, enables us to have the statistics available after the query is actually executed, while no such statistics are available for queries that are logged using the general query log.

And there might be other cases when you want to log queries taking less than 1 second of time(micro-seconds), for that you can specify a value in fractions, for example, you can specify long_query_time=0.5 to log queries taking greater than half-a-second.

Note that logging queries taking fraction of a second is not possible for versions of MySQL < 5.1, unless you use the microslow patch developed by Percona. You can follow the guide here if you are still running MySQL < 5.1 and would like to install this patch. You should also note that for versions of MySQL < 5.1 setting long_query_time=0 would actually disable the slow query logging.

Installing pt-query-digest tool (as well as other tools from Percona Toolkit) is very easy, and is explained here at this link.

Using pt-query-digest

Using pt-query-digest is pretty straight forward:

pt-query-digest /path/to/slow-query.log

Note that executing pt-query-digest can be pretty CPU and memory consuming, so ideally you should download the “slow query log” to another machine and run it there.

Analyzing pt-query-digest Output

Now let’s see what output it returns. The first part of the output is an overall summary:

# 250ms user time, 20ms system time, 17.38M rss, 53.62M vsz
# Current date: Wed Dec 28 08:16:13 2011
# Hostname: somehost.net
# Files: ./slow-query.log
# Overall: 296 total, 12 unique, 0.00 QPS, 0.00x concurrency _____________
# Time range: 2011-11-26 17:44:58 to 2011-12-27 13:01:44
# Attribute          total     min     max     avg     95%  stddev  median
# ============     ======= ======= ======= ======= ======= ======= =======
# Exec time           736s      1s     23s      2s      6s      2s      1s
# Lock time            5ms       0   290us    17us   103us    43us       0
# Rows sent          8.31M       0   8.31M  28.75k  202.40 477.30k       0
# Rows examine      53.69k       0   3.51k  185.75  964.41  497.06       0
# Rows affecte          38       0       1    0.13    0.99    0.33       0
# Rows read          8.31M       0   8.31M  28.75k  202.40 477.30k       0
# Bytes sent       294.94M       0 275.07M 1020.33k  79.83k  15.63M  79.83k
# Tmp tables            16       0       4    0.05       0    0.45       0
# Tmp disk tbl           0       0       0       0       0       0       0
# Tmp tbl size       1.94M       0 496.08k   6.70k       0  56.06k       0
# Query size        36.89k      44     886  127.63  420.77  144.69   69.19

It tells you that there are a total of 296 slow queries which are actually invocations of 12 different queries. Following that there are summaries of various other data points such as the total query execution time and the average query execution time. One thing I suggest here is that, you should probably give more importance to the times/values reported in the 95% (95th percentile) column as that gives us more accurate understanding. Now, for example, from this summary you can easily see whether you need to bump up the tmp_table_size variable if you see a big “Tmp disk tbl” number, you can easily adjust the value of tmp_table_size by taking a look at the 95% column for the row dealing with “Tmp tbl size”. Pretty nifty!

Let’s analyze next part of the output produced by pt-query-digest.

# Profile
# Rank Query ID           Response time  Calls R/Call  Apdx V/M   Item
# ==== ================== ============== ===== ======= ==== ===== ========
#    1 0x92F3B1B361FB0E5B 644.9895 87.6%   244  2.6434 0.44  1.26 SELECT wp_options
#    2 0x555191621979A464  33.6349  4.6%    30  1.1212 0.65  0.03 REPLACE SELECT test.checksum test.sbtest_myisam
#    3 0x8354260420CBD34B  22.6124  3.1%     1 22.6124 0.00  0.00 SELECT customer address category

The above part of the output ranks the queries and shows the top slowest queries. As we can see here the slowest one is the SELECT wp_options, this is basically a unique way of identifying the query and simply implies that this is a SELECT query executed against the wp_options table.

Now let’s take a look at the most important part of the output:

# Query 1: 0.00 QPS, 0.00x concurrency, ID 0x92F3B1B361FB0E5B at byte 119442
# This item is included in the report because it matches --limit.
# Scores: Apdex = 0.44 [1.0], V/M = 1.26
# Query_time sparkline: |      ^ |
# Time range: 2011-12-08 17:48:20 to 2011-12-27 13:01:44
# Attribute    pct   total     min     max     avg     95%  stddev  median
# ============ === ======= ======= ======= ======= ======= ======= =======
# Count         82     244
# Exec time     87    645s      1s     10s      3s      6s      2s      1s
# Lock time     37     2ms       0   201us     8us    60us    30us       0
# Rows sent      0   3.40k       0     211   14.28  202.40   51.53       0
# Rows examine   7   4.01k       0     252   16.84  234.30   60.55       0
# Rows affecte   0       0       0       0       0       0       0       0
# Rows read      0   3.40k       0     211   14.28  202.40   51.53       0
# Bytes sent     6  19.03M  41.35k  83.95k  79.88k  79.83k   6.92k  79.83k
# Tmp tables     0       0       0       0       0       0       0       0
# Tmp disk tbl   0       0       0       0       0       0       0       0
# Tmp tbl size   0       0       0       0       0       0       0       0
# Query size    45  16.92k      71      71      71      71       0      71
# String:
# Databases    wp_blog_one (154/63%), wp_blog_tw... (81/33%)... 1 more
# Hosts
# InnoDB trxID 7F1910 (1/0%), 7F3860 (1/0%), 7F9F74 (1/0%)... 14 more
# Last errno   0
# Users        wp_blog_one (154/63%), wp_blog_two (81/33%)... 1 more
# Query_time distribution
#   1us
#  10us
# 100us
#   1ms
#  10ms
# 100ms
#    1s  ################################################################
#  10s+
# Tables
#    SHOW TABLE STATUS FROM `wp_blog_two` LIKE 'wp_options'\G
#    SHOW CREATE TABLE `wp_blog_two`.`wp_options`\G
# EXPLAIN /*!50100 PARTITIONS*/
SELECT option_name, option_value FROM wp_options WHERE autoload = 'yes'\G

This is the actual part of the output dealing with analysis of the most slowest query ranked #1. The first row in the table above shows the Count of number of times this query was executed. Now let’s take a look at the values in the 95% column, we can clearly see that this query is taking up a lot of execution time (6s) and is sending a lot of rows (202) and a lot of data (79.83k). The “Databases” section of the output also shows the name of the databases where this query was executed. Next the “Query_time distribution” section shows how this query times, which you can see lies in the range 1s to 10s all the time. The “Tables” section lists the queries that you can use to gather more data about the underlying tables involved and the query execution plan used by MySQL.
The end result might be that you end up limiting the number of results returned by the query, by using a LIMIT clause or by filtering based on the option_name column.

Let’s analyze another slow query, this time query ranked #3 by pt-query-digest.

# Query 3: 0 QPS, 0x concurrency, ID 0x8354260420CBD34B at byte 132619 ___
# This item is included in the report because it matches --limit.
# Scores: Apdex = 0.00 [1.0]*, V/M = 0.00
# Query_time sparkline: |       ^|
# Time range: all events occurred at 2011-12-23 17:07:16
# Attribute    pct   total     min     max     avg     95%  stddev  median
# ============ === ======= ======= ======= ======= ======= ======= =======
# Count          0       1
# Exec time      3     23s     23s     23s     23s     23s       0     23s
# Lock time      1   101us   101us   101us   101us   101us       0   101us
# Rows sent     99   8.31M   8.31M   8.31M   8.31M   8.31M       0   8.31M
# Rows examine   0      24      24      24      24      24       0      24
# Rows affecte   0       0       0       0       0       0       0       0
# Rows read     99   8.31M   8.31M   8.31M   8.31M   8.31M       0   8.31M
# Bytes sent    93 275.07M 275.07M 275.07M 275.07M 275.07M       0 275.07M
# Tmp tables     0       0       0       0       0       0       0       0
# Tmp disk tbl   0       0       0       0       0       0       0       0
# Tmp tbl size   0       0       0       0       0       0       0       0
# Query size     0      92      92      92      92      92       0      92
# String:
# Databases    another_db
# Hosts
# InnoDB trxID 83E808
# Last errno   0
# Users        another_db
# Query_time distribution
#   1us
#  10us
# 100us
#   1ms
#  10ms
# 100ms
#    1s
#  10s+  ################################################################
# Tables
#    SHOW TABLE STATUS FROM `another_db` LIKE 'customer'\G
#    SHOW CREATE TABLE `another_db`.`customer`\G
#    SHOW TABLE STATUS FROM `another_db` LIKE 'address'\G
#    SHOW CREATE TABLE `another_db`.`address`\G
#    SHOW TABLE STATUS FROM `another_db` LIKE 'category'\G
#    SHOW CREATE TABLE `another_db`.`category`\G
# EXPLAIN /*!50100 PARTITIONS*/
select customer.store_id, address.address, category.name
  from customer, address, category\G

Let’s again take a look at the 95% column in the above output. And you can clearly see that the problem with this query is that its reading and sending far too many rows (8.31M) and sending far too much data (275M). Going down quickly to the “Tables” section, you can see why that is happening: “Three tables joined without a join condition meaning a cartesian product of all the rows in all the three tables”. See how easily you could pin-point the cause of slowness.
This query was actually executed on a demo of a MySQL Web Client running on one of the MySQL servers managed by me, providing me with good reason to turn on safe-updates option.

Gather Even More Stats with some extra love from Percona Server!

My colleagues at Percona have built in much more verbosity to what is output in the slow query log, which would give you even more insight into what is actually going on behind the scene during the execution of a particular slow query. This is through the edition of a new variable available in Percona Server 5.1 and above called log_slow_verbosity. You can read more about this variable and some other diagnostics added to the Percona server from this link. The variable log_slow_verbosity can have the values ‘microtime’, ‘query_plan’, ‘innodb’ and ‘full’. Let’s turn on log_slow_verbosity as follows:

set session log_slow_verbosity='microtime,query_plan,innodb';

And see how verbose the new entry in slow log is:

# Time: 111228 11:52:30
# User@Host: root[root] @ localhost []
# Thread_id: 57  Schema: mywebsql_demo  Last_errno: 0  Killed: 0
# Query_time: 204.981516  Lock_time: 0.000133  Rows_sent: 10000  Rows_examined: 8721904  Rows_affected: 0  Rows_read: 10000
# Bytes_sent: 164952  Tmp_tables: 2  Tmp_disk_tables: 1  Tmp_table_sizes: 1995142008
# InnoDB_trx_id: 856C30
# QC_Hit: No  Full_scan: Yes  Full_join: Yes  Tmp_table: Yes  Tmp_table_on_disk: Yes
# Filesort: Yes  Filesort_on_disk: Yes  Merge_passes: 662
#   InnoDB_IO_r_ops: 1  InnoDB_IO_r_bytes: 16384  InnoDB_IO_r_wait: 0.034391
#   InnoDB_rec_lock_wait: 0.000000  InnoDB_queue_wait: 0.000000
#   InnoDB_pages_distinct: 11
SET timestamp=1325073150;
select customer.store_id, address.address, category.name   from customer, address, (select * from category) as category order by customer.store_id limit 10000;

See how much more information the slow query log has now, it reports everything from InnoDB stats to on disk filesort was needed for the query, etc. Now you can clearly see this is much much more helpful then the regular slow query statistics reported by the vanilla MySQL.

Conclusion

The only conclusion, I can make out is “Get yourself Percona Server, turn on log_slow_verbosity and start using pt-query-digest”, your job of identifying bottleneck queries will be all the more simpler then.