A few notes on InnoDB PRIMARY KEY

InnoDB is very different from MyISAM in how it stores data and how the indexes works. InnoDB uses an index-organized data storage technique, wherein the primary key acts as the clustered index and this clustered index holds the data. There is no separate storage of data, its stored within the clustered index in the leaf nodes.

Its for this reason that understanding the basics of InnoDB primary key is very important, and hence the need for these notes. The purpose of these notes is to help you with selecting the correct primary keys.

Now let’s start off!

Understanding the primary key

Following are a few things that you should know about the primary key:

  • If you don’t create a primary key, InnoDB will first try to use a unique non-null key if present. And if there is no such key present then InnoDB will by itself generate a hidden 6-byte integer primary key and cluster on that. This value may or may not be a feasible choice for your data access patterns.
  • With InnoDB, the columns making up the primary key are automatically appended to the secondary indexes. This has two side-effects:
    1. if you have a long primary key, then all of your secondary indexes will become long as well,
    2. the secondary indexes can act as covering indexes, for data retrievals that need secondary indexed column values as well as the primary key column value.
  • InnoDB does not compress indexes as MyISAM does, which means that you will need comparatively large amount of disk space.

Selecting the right primary key

Following are a few consequences from the above mentioned points about the primary key:

  • You should strive for a small primary key if you’ll have many indexes on a table. That’s going to save you disk space as well as will allow for much better caching.
  • The easiest way to select a primary key is to use an AUTO_INCREMENT column as the primary key. This will ensure that rows are inserted in sequential order and will allow for very efficient clustering.
  • Avoid using random non-sequential values as primary keys, because such type of values spread out the data making the index insertion random and do not give any helpful clustering.

I hope after reading this post you will have a better understanding of how to select the right primary key, and I hope this forms a part of your handy reference.

  • SomeGuyNamedPaul

    …and no using a varchar for your primary key won’t save you any space at all. The full possible width of the column is what’s stored in the PK as well as the secondaries.

  • http://www.pythian.com/news/22511/log-buffer-217-a-carnival-of-the-vanities-for-dbas/ Log Buffer #217, A Carnival of the Vanities for DBAs | The Pythian Blog

    [...] his open source tech rumblings, Ovais Tariq makes few notes on InnoDB PRIMARY [...]

  • http://www.ovaistariq.net/ Ovais Tariq

    Well mostly char is a better choice instead of varchar, if you have a lots of updates happening and you only have a few very large string values,. also it helps to avoid from fragmentation,.

    I would love to have you explain varchar’s effect on primary key,.

  • http://www.ovaistariq.net/ Ovais Tariq

    Also, although I have mentioned that InnoDB does not compress indexes, but the newer version of InnoDB plugin,. compresses both the data and the indexes.,

  • nav

    Yeah thats correct Ovais.. as i was going to mention it.. not only the innodb pluggin the newer version of MySQL 5.5 which has innodb pluggin built in as a default engine also having the same feature.

    Cheers!!!!

  • http://www.ovaistariq.net/ Ovais Tariq

    Yeah that’s a great feature but still I haven’t seen any benchmarks proving the performance benefits of compressed InnoDB tables.,

  • nav
  • http://www.ovaistariq.net/ Ovais Tariq

    Thanks for the links., but I guess I will have to have a more thorough look at the compression myself,.

  • nav

    No problem.. look at it by yourself and update it here too, i would be interesdted in output you get.. :)

    Thanks..!!!!

  • http://javarevisited.blogspot.com/ Javin Paul

    with my experience numeric primary keys are best performer than non numeric and easy to handle search and sort at both database and application level.

    Javin
    10 basic mysql commands to remember

  • http://www.ovaistariq.net/ Ovais Tariq

    sure

  • http://www.ovaistariq.net/ Ovais Tariq

    yes you are right, numeric primary keys are almost always better for following reasons:
    - they consume less space
    - they are fast
    - they work with AUTO_INCREMENT which ensures that rows are inserted in sequential order

  • http://maxnim.wordpress.com/2011/04/30/mysql-innodb%ec%9d%98-primary-key-%ed%8a%b9%ec%84%b1/ MySQL InnoDB의 primary key 특성 « turtle9
  • Hsinghjatt

     confusion to take long primary key which one is adoptable such as  feedback_detail_id  and create the primary key on this filed is this is long primary key or  if we create primary on many field suchas (emp_id,name,emp_code)is this long which consider ???????????

  • http://www.ovaistariq.net/ Ovais Tariq

    Long primary key means a primary key on many fields, read my post here to understand why that would impact performance.

  • dharmendra patel

    thanks for sharing this article… please allow me to share a useful article of mine on InnoDB Disabled problem

    Thanks
    Dharmendra

  • gopal

    Hello Tariq,

    Thanks for nice article. I have following question:

    Saving in the space due to short primary key is reflected in ‘SHOW TABLE STATUS’ or not?

    E.g I have a table with following structure:
    CREATE TABLE `ABC` (
    `Datafrom` date NOT NULL DEFAULT ’0000-00-00′,
    `Variable` mediumint(8) unsigned NOT NULL DEFAULT ’0′,
    `Value` decimal(10,2) NOT NULL DEFAULT ’0.00′,
    `Company` enum(‘x’,'y’) COLLATE latin1_german1_ci NOT NULL DEFAULT ‘GmbH’,
    `Language` mediumint(8) unsigned NOT NULL DEFAULT ’0′,
    `Country` smallint(5) unsigned NOT NULL DEFAULT ’0′,
    `account_status` enum(‘a’,'b’,'c’) COLLATE latin1_german1_ci NOT NULL DEFAULT ‘active’,
    UNIQUE KEY `Datafrom` (`Datafrom`,`Variable`,`Company`,`Language`,`Country`,`account_status`),
    KEY `Inserted` (`Datafrom`),
    KEY `Id` (`Variable`)
    ) ENGINE=InnoDB

    and I have changed by adding a ‘auto increment’ primary key:
    CREATE TABLE `ABC` (
    `id` int(11) NOT NULL AUTO_INCREMENT,
    `Datafrom` date NOT NULL DEFAULT ’0000-00-00′,
    `Variable` mediumint(8) unsigned NOT NULL DEFAULT ’0′,
    `Value` decimal(10,2) NOT NULL DEFAULT ’0.00′,
    `Company` enum(‘x’,'y’) COLLATE latin1_german1_ci NOT NULL DEFAULT ‘GmbH’,
    `Language` mediumint(8) unsigned NOT NULL DEFAULT ’0′,
    `Country` smallint(5) unsigned NOT NULL DEFAULT ’0′,
    `account_status` enum(‘a’,'b’,'c’) COLLATE latin1_german1_ci NOT NULL DEFAULT ‘active’,
    PRIMARY KEY (`id`),
    KEY `Inserted` (`Datafrom`),
    KEY `Id` (`Variable`),
    UNIQUE KEY `Datafrom` (`Datafrom`,`Variable`,`Company`,`Language`,`Country`,`account_status`)
    ) ENGINE=InnoDB

    This table has more than 16 million records.

    ‘Show Table Status’ gives following result:
    Before:
    Rows: 16397283
    Avg_row_length: 39
    Data_length: 645922816
    Max_data_length: 0
    Index_length: 640614400

    After:
    Rows: 16396993
    Avg_row_length: 43
    Data_length: 716177408
    Max_data_length: 0
    Index_length: 818479104

    Here we can see that Index size is increased considerably.

    In such a scenario, is it OK to change the primary key or not?

    Thanks in advance,
    Gopal

  • http://www.ovaistariq.net/ Ovais Tariq

    In the first table definition you do not have an explicit primary key defined which means that InnoDB should use the first non-nullable unique key available as the Primary Key or an implicit 6-byte key. Since you do have a unique key which is created on non-nullable columns so InnoDB should just use that.
    There are couple of considerations with respect to primary key, first of all you should make sure that the primary key is defined in such as way that inserts are in increasing order otherwise writes will suffer. Secondly keep the primary key small, because primary key columns get appended to all the secondary keys. So by adding the new primary key column “id” it will definitely give you the advantage of improved write performance and smaller index sizes.

    However, the SHOW TABLE STATUS does not reflect that in the output that you have shared. Please note that table statistics reported are only approximate and can vary. Other then you might be hitting some bug that is reporting wrong numbers or isn’t selecting the unique key as the primary key. In any case its a bad idea to have an implicit primary key.

  • gopal

    Hello Tariq,

    As per my further analysis, SHOW TABLE STATUS output was correct, as same was reflected in dump size of 2 tables.

    Another thing I did to verify the point by creating additional indexes in both the table.

    alter table ABC add key (country), add key (company), add key (language), add key (account_status);

    after adding additional indexes difference in Index size of the 2 versions of the table is as expected.

    e.g.
    # Without Explicit Primary Key
    Index_length: 1932361728

    # with Auto Increment Primary Key
    Index_length: 1612201984

    As per my this analysis, it really depends on how many secondary indexes you have in a table, and performance should be measured accordingly.

  • http://www.ovaistariq.net/ Ovais Tariq

    Well efficient space utilization does indeed depend on the number of secondary indexes. We cannot simply measure performance by the number of secondary indexes. As I have explained in the post its important for a primary key to be defined in such a way that writes are always ordered by the primary key, and such that you are not writing in the primary key in random places. This really impacts performance a lot. That is why in your case clearly a surrogate primary key created on an auto incremented column is much better.
    Of course performance also depends on the number of secondary indexes, but as long as the secondary indexes are non-unique InnoDB can offset the performance problem by buffering changes to the secondary index in memory.

blog comments powered by Disqus