Databases: Normalization or Denormalization. Which is the better technique?

This has really been a long debate as to which approach is more performance orientated, normalized databases or denormalized databases. So this article is a step on my part to figure out the right strategy, because neither one of these approaches can be rejected outright. I will start of by discussing the pros and cons of both the approaches.

Pros and Cons of a Normalized database design.

Normalized databases fair very well under conditions where the applications are write-intensive and the write-load is more than the read-load. This is because of the following reasons:

  • Normalized tables are usually smaller and have a smaller foot-print because the data is divided vertically among many tables. This allows them to perform better as they are small enough to get fit into the buffer.
  • The updates are very fast because the data to be updated is located at a single place and there are no duplicates.
  • Similarly the inserts are very fast because the data has to be inserted at a single place and does not have to be duplicated.
  • The selects are fast in cases where data has to be fetched from a single table, because normally normalized tables are small enough to get fit into the buffer.
  • Because the data is not duplicated so there is less need for heavy duty group by or distinct queries.

Although there seems to be much in favor of normalized tables, with all the pros outlined above, but the main cause of concern with fully normalized tables is that normalized data means joins between tables. And this joining means that read operations have to suffer because indexing strategies do not go well with table joins.

Now lets have a look at the pros and cons of a denormalized database design.

Pros and cons of denormalized database design.

Denormalized databases fair well under heavy read-load and when the application is read intensive. This is because of the following reasons:

  • The data is present in the same table so there is no need for any joins, hence the selects are very fast.
  • A single table with all the required data allows much more efficient index usage. If the columns are indexed properly, then results can be filtered and sorted by utilizing the same index. While in the case of a normalized table, since the data would be spread out in different tables, this would not be possible.

Although for reasons mentioned above selects can be very fast on denormalized tables, but because the data is duplicated, the updates and inserts become complex and costly.

Having said that neither one of the approach can be entirely neglected, because a real world application is going to have both read-loads and write-loads. Hence the correct way would be to utilize both the normalized and denormalized approaches depending on situations.

Using normalized and denormalized approaches together.

The most common way of mixing denormalized and normalized approaches is to duplicate related columns from one table into another table. Let me show you by example:

Suppose you have a products table and an orders table.
The normalized approach would be to only have the product_id in the orders table and all the other product related information in the products table.

But that would make the query that filters by product_name and sorts by order_date inefficient because both are stored in different tables.

In a fully normalized schema, such a query would be performed in the following manner:

SELECT product_name, order_date
FROM orders INNER JOIN products USING(product_id)
WHERE product_name like 'A%'
ORDER by order_date DESC 

As you can see MySQL here will have to scan the order_date index on the orders table and then compare the corresponding product_name in the products table to see if the name starts with A.

The above query can be drastically improved by denormalizing the schema a little bit, so that the orders table now includes the product_name column as well.

SELECT product_name, order_date
FROM orders
WHERE product_name like 'A%'
ORDER by order_date DESC 

See how the query has become much simpler, there is no join now and a single index on columns product_name, order_date can be used to do the filtering as well as the sorting.

So can both the techniques be used together? Yes they can be, because real word applications have a mix of read and write loads.

Final words.

Although, denormalized schema can greatly improve performance under extreme read-loads but the updates and inserts become complex as the data is duplicate and hence has to be updated/inserted in more than one places.

One clean way to go about solving this problem is through the use of triggers. For example in our case where the orders table has the product_name column as well, when the value of product_name has to be updated, then it can simply be done in the following way:

  • Have a trigger setup on the products table that updates the product_name on any update to the products table.
  • Execute the update query on the products table. The data would automatically be updated in the orders table because of the trigger.

However, when denormalizing the schema, do take into consideration, the number of times you would be updating records compared to the number of times you would be executing SELECTs. When mixing normalization and denormalization, focus on denormalizing tables that are read intensive, while tables that are write intensive keep them normalized.

  • thanks mate

  • Articles with many useful information. Thank you very much

  • Articles with useful content, I like it and read it regularly

  • wadesworld

    One facet you fail to point out is the data anomalies which can arise from denormalization. While denormalization sometimes cannot be prevented for performance reasons, it’s important for young developers to realize there are possible significant detrimental affects to their data as a result. They should not optimize prematurely and should only denormalize if they’ve done the testing to prove their performance concern is valid.

  • That’s a good point. As in every solution you choose there is a trade-off and it really depends on the use case. Should you use denormalization in everyday projects, I don’t think so. But then again when you hit big and have lots of queries that need to order data, you are probably better off by maintaining denormalized tables by either having the actual table denormalized or by maintaining materialized views. Sadly MySQL does not have materialized views, so it comes down to having separate tables that are denormalized and that duplicate the data from other tables.

  • ECM ECM

    First thing that came to my mind when reading Ovais topic is the point you raise here. However, the critical question should be how or why does writing complex queries create data load issues. In other words how can we correctly/scientifically measure the rate of loss of efficiency in data load using these queries in normalized and un-normalized databases.

    We should not confuse what is clear and simple versus long and complex queries for human understanding (when writing those queries) versus their efficiency when the computer reads and loads using these queries.

    I myself strongly support very normalized database, one of the reasons being the actual normalization resembles and helps better control the business rules and application logic that can be created for it (that way you less complicated application overall).

  • Duplicating data into denormalized tables is a general strategy used by read-heavy and large-scale web applications because its costly to join tables when you are joining thousands of thousands of rows. If you add sorting into the mix, then that’s another reason to denormalize because MySQL can only choose an index for sorting from the first table in the join. That’s not an issue if you denormalize because then you only fetch data from a single table.

  • Every technology has limitations and strong points, same is the case with MySQL. Joining many tables is a complex operation and is time consuming because you are traversing many trees and sometimes its not feasible for example when you have requirements such as “95% of the queries should finish in 5 milliseconds which is not uncommon for some of the web shops that I work with”. In such cases you have to be creative and employ other techniques. And denormalization is one. Materialized views is a sort of denormalization. Unfortunately MySQL does not have a notion of materialized views, hence you create denormalized tables together with normalized tables so that you can use them for queries that involve querying multiple tables and ordering data.

  • This is really a great blog.

    Best skin lightening cream

  • Zahid hussain

    this is agreat source to learn about DBMS

  • planecrash

    I agree completely. Start with the normal forms until you run into a reason not to.

buy antibiotics online