Does having small data-sets really help? Of course it does!

Are memory lookups faster that disk lookups. Of course !

So many times I have seen people complain about queries taking too long now, while they were not taking that long earlier. There is one big reason for this, earlier the size of data-set was small so it could fit into memory. Now that the data-set has grown large enough that it cannot fit entirely into memory, the disk seeks really have slowed down the queries significantly.

What to do now? Vertical partitioning.

Divide the data-set into separate data-sets vertically. That is, divide the table vertically, separating columns that are used more frequently, from other blob, text and varchar columns that are used infrequently.

There is a big difference between table access times for long tables and table access times for short tables. Short tables will almost certainly fit into memory, there by ensuring that lookups are very fast, in contrast to long tables with blob, text and varchar columns which do not fit into memory.

So next time before you de-normalize, keep in mind that if there are long tables, then almost certainly performance is going to suffer.

Example:

Let’s take a look at an example. I have two tables:

Long Table (long_table):

CREATE TABLE `long_table` (
  `ts` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
  `user_type` tinyint(3) unsigned NOT NULL DEFAULT '1',
  `post_id` bigint(20) unsigned NOT NULL,
  `title` text
) ENGINE=MyISAM DEFAULT CHARSET=latin1

Short Table (short_table):

CREATE TABLE `short_table` (
 `ts` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
 `user_type` tinyint(3) unsigned NOT NULL DEFAULT '1',
 `post_id` bigint(20) unsigned NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1

Both the tables are identical with one exception. The long table has one extra column, title of type text. Both the tables have identical number of rows, i.e., 10K.

Now lets run identical queries on both the tables and see how much time the queries take:

SELECT COUNT(ts) 
FROM short_table 
WHERE ts > DATE_SUB(CURRENT_TIMESTAMP(), INTERVAL 62 DAY);
1 row in set (0.01 sec)
SELECT COUNT(ts) 
FROM long_table 
WHERE ts > DATE_SUB(CURRENT_TIMESTAMP(), INTERVAL 62 DAY);
1 row in set (0.11 sec)

See the time difference between the queries. This difference could increase easily for a table with million records.

Lets run both the queries on tables with 1371526 records.

SELECT COUNT(ts) 
FROM short_table 
WHERE ts > DATE_SUB(CURRENT_TIMESTAMP(), INTERVAL 10 DAY);
1 row in set (1.09 sec)
SELECT COUNT(ts) 
FROM long_table 
WHERE ts > DATE_SUB(CURRENT_TIMESTAMP(), INTERVAL 10 DAY);
1 row in set (9.13 sec)

The results speak for themselves.