Performance tuning using vertical partitioning.

Does having small data-sets really help? Of course it does!

Are memory lookups faster that disk lookups. Of course !

So many times I have seen people complain about queries taking too long now, while they were not taking that long earlier. There is one big reason for this, earlier the size of data-set was small so it could fit into memory. Now that the data-set has grown large enough that it cannot fit entirely into memory, the disk seeks really have slowed down the queries significantly.

What to do now? Vertical partitioning.

Divide the data-set into separate data-sets vertically. That is, divide the table vertically, separating columns that are used more frequently, from other blob, text and varchar columns that are used infrequently.

There is a big difference between table access times for long tables and table access times for short tables. Short tables will almost certainly fit into memory, there by ensuring that lookups are very fast, in contrast to long tables with blob, text and varchar columns which do not fit into memory.

So next time before you de-normalize, keep in mind that if there are long tables, then almost certainly performance is going to suffer.

Example:

Let’s take a look at an example. I have two tables:

Long Table (long_table):

CREATE TABLE `long_table` (
  `ts` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
  `user_type` tinyint(3) unsigned NOT NULL DEFAULT '1',
  `post_id` bigint(20) unsigned NOT NULL,
  `title` text
) ENGINE=MyISAM DEFAULT CHARSET=latin1

Short Table (short_table):

CREATE TABLE `short_table` (
 `ts` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
 `user_type` tinyint(3) unsigned NOT NULL DEFAULT '1',
 `post_id` bigint(20) unsigned NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1

Both the tables are identical with one exception. The long table has one extra column, title of type text. Both the tables have identical number of rows, i.e., 10K.

Now lets run identical queries on both the tables and see how much time the queries take:

SELECT COUNT(ts)
FROM short_table
WHERE ts > DATE_SUB(CURRENT_TIMESTAMP(), INTERVAL 62 DAY);
1 row in set (0.01 sec)
SELECT COUNT(ts)
FROM long_table
WHERE ts > DATE_SUB(CURRENT_TIMESTAMP(), INTERVAL 62 DAY);
1 row in set (0.11 sec)

See the time difference between the queries. This difference could increase easily for a table with million records.

Lets run both the queries on tables with 1371526 records.

SELECT COUNT(ts)
FROM short_table
WHERE ts > DATE_SUB(CURRENT_TIMESTAMP(), INTERVAL 10 DAY);
1 row in set (1.09 sec)
SELECT COUNT(ts)
FROM long_table
WHERE ts > DATE_SUB(CURRENT_TIMESTAMP(), INTERVAL 10 DAY);
1 row in set (9.13 sec)

The results speak for themselves.

  • http://www.apheliondynamics.com Fadi Chalfoun

    Thanks for reading my own post about vertical partitioning here http://apheliondynamics.com/blog/2010/02/11/database-optimization-vertical-partitioning-in-mysql it’s good to see you use examples and data as well to back up your claims.

    How did you fill up your tables with data to time your queries?

  • http://www.apheliondynamics.com Fadi Chalfoun

    Thanks for reading my own post about vertical partitioning here http://apheliondynamics.com/blog/2010/02/11/database-optimization-vertical-partitioning-in-mysql it’s good to see you use examples and data as well to back up your claims.

    How did you fill up your tables with data to time your queries?

  • http://tech.ovaistariq.net/ Ovais Tariq

    Well I had a hits table in a wordpress based project that I am working on that records hits recieved on posts, I just used the data from that table

  • http://tech.ovaistariq.net/ Ovais Tariq

    Well I had a hits table in a wordpress based project that I am working on that records hits recieved on posts, I just used the data from that table

  • http://bzzuqggjezlu.com/ xfpoyclsjxw

    IBM87d haurfsszmoud, [url=http://qflzcbxggytv.com/]qflzcbxggytv[/url], [link=http://ayhpmdizgjdp.com/]ayhpmdizgjdp[/link], http://dbhjjzdzajoo.com/

  • http://bzzuqggjezlu.com/ xfpoyclsjxw

    IBM87d haurfsszmoud, [url=http://qflzcbxggytv.com/]qflzcbxggytv[/url], [link=http://ayhpmdizgjdp.com/]ayhpmdizgjdp[/link], http://dbhjjzdzajoo.com/

  • http://faisalferoz.wordpress.com Faisal Feroz

    Vertical Partitioning is a technique for tables with Long/Big Row Sizes. Lets say you don’t have any Blob/Text column (more detail is available in MySQL documentation about which datatypes fit into the memory space and which not) than Vertical Partitioning is not going to be of much help. Infact it might even slow down the performance of queries for the added join required to retrieve the whole dataset. In such scenarios Horizontal Partitioning is a much better option.

  • http://faisalferoz.wordpress.com Faisal Feroz

    Vertical Partitioning is a technique for tables with Long/Big Row Sizes. Lets say you don’t have any Blob/Text column (more detail is available in MySQL documentation about which datatypes fit into the memory space and which not) than Vertical Partitioning is not going to be of much help. Infact it might even slow down the performance of queries for the added join required to retrieve the whole dataset. In such scenarios Horizontal Partitioning is a much better option.

  • http://www.ovaistariq.net/ ovais.tariq

    That’s exactly what I have mentioned in my post, I have mentioned text / varchar columns. I feel you probably didn’t read my entire post :)

  • http://www.ovaistariq.net/ ovais.tariq

    That’s exactly what I have mentioned in my post, I have mentioned text / varchar columns. I feel you probably didn’t read my entire post :)

  • http://www.pharmacytechnicianblog.com/ pharmacy technician

    found your site on del.icio.us today and really liked it.. i bookmarked it and will be back to check it out some more later

  • http://www.pharmacytechnicianblog.com/ pharmacy technician

    found your site on del.icio.us today and really liked it.. i bookmarked it and will be back to check it out some more later

  • http://asdfjieieie@asdf.com physical therapist

    found your site on del.icio.us today and really liked it.. i bookmarked it and will be back to check it out some more later

  • andras

    Thanks for this post. If I do need to retrieve the “title” column for a range of “post_id”, I have to join with the new “title” table (if I follow your advice). Is there still a gain (with Mysql) to have this vertical partitionning or not ?

  • andras

    Thanks for this post. If I do need to retrieve the “title” column for a range of “post_id”, I have to join with the new “title” table (if I follow your advice). Is there still a gain (with Mysql) to have this vertical partitionning or not ?

  • http://www.ovaistariq.net/ ovais.tariq

    Considering the scenario, yeah there would be a separate table called details which would have columns like title, content, etc.
    But this table is going to have the post_id as the foreign key in it as well, otherwise how would you relate the titles to the actual post.

    Hence there would be no join needed, since you have the post_id column in the details table, you would run a simple range scan query as follows:

    SELECT title FROM details WHERE post_id IN (1, 2, 500, 23, 44);

    Do remember to have the post_id column indexed.

  • http://www.ovaistariq.net/ ovais.tariq

    Considering the scenario, yeah there would be a separate table called details which would have columns like title, content, etc.
    But this table is going to have the post_id as the foreign key in it as well, otherwise how would you relate the titles to the actual post.

    Hence there would be no join needed, since you have the post_id column in the details table, you would run a simple range scan query as follows:

    SELECT title FROM details WHERE post_id IN (1, 2, 500, 23, 44);
    

    Do remember to have the post_id column indexed.

  • andras

    Thanks for your answer.
    Then do you have benchmarked whether an imbricated request :
    SELECT title FROM details WHERE post_id IN (SELECT post_id FROM short_table WHERE ts > DATE_SUB(CURRENT_TIMESTAMP(), INTERVAL 10 DAY)) t;

    is (much) quicker than :
    SELECT title
    FROM long_table
    WHERE ts > DATE_SUB(CURRENT_TIMESTAMP(), INTERVAL 10 DAY);
    ?

  • andras

    Thanks for your answer.
    Then do you have benchmarked whether an imbricated request :
    SELECT title FROM details WHERE post_id IN (SELECT post_id FROM short_table WHERE ts > DATE_SUB(CURRENT_TIMESTAMP(), INTERVAL 10 DAY)) t;

    is (much) quicker than :
    SELECT title
    FROM long_table
    WHERE ts > DATE_SUB(CURRENT_TIMESTAMP(), INTERVAL 10 DAY);
    ?

  • http://www.ovaistariq.net/ ovais.tariq

    There are few things to consider when talking about the performance difference between the two queries that you have mentioned:

    What is the write load on the database?
    What is the read load on the database?
    Is the title field a really long blob or text field?
    Is the title field referenced very frequently?

    Depending on the answers to the above questions, we would come to the conclusion is to which query should be used in the long run.

    Of course if the title field is not really that big a field, i.e. if its a varchar field, then keeping it in the long_table and creating an index (ts, title) would have this query perform well under read load.

    But in the case where title is a blob field then you would benefit keeping it in the details table and joining it with the table that holds the post_id as follows:

    SELECT title FROM details INNER JOIN short_table USING (post_id)
    WHERE ts > DATE_SUB(CURRENT_TIMESTAMP(), INTERVAL 10 DAY);

    You will notice that I have changed your original query from using a subquery to a JOIN, the reason is that JOINs perform better than subquery.

  • http://www.ovaistariq.net/ ovais.tariq

    There are few things to consider when talking about the performance difference between the two queries that you have mentioned:

    What is the write load on the database?
    What is the read load on the database?
    Is the title field a really long blob or text field?
    Is the title field referenced very frequently?

    Depending on the answers to the above questions, we would come to the conclusion is to which query should be used in the long run.

    Of course if the title field is not really that big a field, i.e. if its a varchar field, then keeping it in the long_table and creating an index (ts, title) would have this query perform well under read load.

    But in the case where title is a blob field then you would benefit keeping it in the details table and joining it with the table that holds the post_id as follows:

    SELECT title FROM details INNER JOIN short_table USING (post_id)
    WHERE ts > DATE_SUB(CURRENT_TIMESTAMP(), INTERVAL 10 DAY);

    You will notice that I have changed your original query from using a subquery to a JOIN, the reason is that JOINs perform better than subquery.

  • andras

    Thanks for these precisions !

  • andras

    Thanks for these precisions !

  • http://www.ovaistariq.net/ ovais.tariq

    You are welcome :)

    I suggest you also read another post of mine on a similar topic
    Databases: Normalization or Denormalization. Which is the better technique?/

    Thanks.

  • http://www.ovaistariq.net/ ovais.tariq

    You are welcome :)

    I suggest you also read another post of mine on a similar topic
    Databases: Normalization or Denormalization. Which is the better technique?/

    Thanks.

  • http://www.topxxxlist.com Polworunt

    Thanks for article. Everytime like to read you.

  • http://www.topxxxlist.com Polworunt

    Thanks for article. Everytime like to read you.

blog comments powered by Disqus