Solid State Drive (SSD) have made it big and have made their way not only in desktop computing but also in mission-critical servers. SSDs have proved to be a break-through in IO performance and leave HDD far far behind in terms of Random IO performance. Random IO is what most of the database administrators would be concerned about as that is 90% of the IO pattern visible on database servers like MySQL. I have found Intel 520-series and Intel 910-series to be quite popular and they do give very good numbers in terms of Random IOPS. However, its not just performance that you should be concerned about, failure predictions and health gauges are also very important, as loss of data is a big NO-NO. There is a great deal of misconception about the endurance level of SSD, as its mostly compared to rotating disks even when measuring endurance levels, however, there is a big difference in how both SSD and HDD work, and that has a direct impact on the endurance level of SSD.

I will mostly be taling about MLC SSD, now let’s start off with a SSD primer.

SSD Primer

The smallest unit of SSD storage that can be read or written to is a page which is typically 4KB or 8KB in size. These pages are typically organized into blocks which are between 256KB or 1MB in size. SSDs have no mechanical parts and no heads or anything and their is no seeks needed as in conventional rotating disks. Reads involve reading pages from the SSD, however its the writes that are more tricky. Once you write to a page on SSD, you cannot simply overwrite (if you want to write new data) it in the same way you do with a HDD. Instead, you must erase the contents and then write again. However, a SSD can only do erasures at the block level and not the page level. What this means is that the SSD must relocate any valid data in the block to be erased, before the block can be erased and have new data written to it. To summarize, writes mean erase+write. Nowadays, SSD controllers are intelligent and do erasures in the background, so that the latency of the write operation is not affected. These background erasures are typically done within a process known garbage collection. You can imagine if these erasures were not done in the background, then writes would be too slow.

Of course every SSD has a lifespan after which it can be seen as unusable, let’s see what factors matter here.

SSD Lifespans

The lifespan of blocks that make up a SSD is really the number of times erasures and writes can be performed on those blocks. The lifespan is measure in terms of erase/write cycles. Typically enterprise grade MLC SSDs have a lifespan of about 30000 erase/write cycles, while consumer grade MLC SSD have a life span of 5000 to 10000 erase/write cycles. This fact makes it clear that the lifespan of a SSD depends on how much time it is written to. If you have a write-intensive workload then you should expect the SSD to fail much more quickly, in comparison to a read-heavy workload. This is by design.
To offset this behaviour of writes reducing the life of a SSD, engineers use two techniques, wear-levelling and over-provisioning. Wear-levelling works by making sure that all the blocks in a SSD are erased and written to in a evenly distributed fashion, this makes sure that some blocks do not die quickly then other blocks. Over-provisioning SSD capacity is one another technique that increases SSD endurance. This is accomplished by having a large population of blocks to distribute erases and writes over time (bigger capacity SSD), and by providing a large spare area. Many SSD models over provision the space, for example a 80GB SSD could have 10GB of over-provisioned space, so that while it is actually 90GB in size it is reported as a 80GB SSD. While this over-provisioning is done by the SSD manufacturers, this can also be done by not utilising the entire SSD, for example partitioning the SSD in such a way that you only partition about 75% to 80% of the SSD and leave the rest as RAW space that is not visible to the OS/filesystem. So while over-provisioning takes away some part of the disk capacity, it gives back in terms of increased endurance and performance.

Now comes the important part of the post that I would like to discuss.

Health Measurement and failure predictability

As you may have noticed after reading the above part of this post, its all the more important to be able to predict when a SSD would fail and to be able to see health related information about the SSD. Yet I haven’t found much written about how to gauge the health of a SSD. RAID controllers employed with SSD tend to be very limited in terms of the amount of information that they provide about an SSD that could allow predicting when a SSD could fail. However, most of the SSD provide a lot of information via S.M.A.R.T. and this can be leveraged to good affect.
Let’s consider the example of Intel SSD, these SSD have to S.M.A.R.T. attributes that can be leveraged to predict when the SSD would fail. These attributes are:

  • Available_Reservd_Space: This attribute reports the number of reserve blocks remaining. The value of the attribute starts at 100, which means that the reserved space is 100 percent available. The threshold value for this attribute is 10 which means 10 percent availability, which indicates that the drive is close to its end of life.
  • Media_Wearout_Indicator: This attribute reports the number of erase/write cycles the NAND media has performed. The value of the attribute decreases from 100 to 1, as the average erase cycle count increases from 0 to the maximum rated cycles. Once the value of this attribute reaches 1, the number will not decrease, although it is likely that significant additional wear can be put on the device. A value of 1 should be thought of as the threshold value for this attribute.

Using the smartctl tool (part of the smartmontools package) we can very easily read the values of these attributes and then use it to predict failures. For example for SATA SSD drives attached to a LSI Megaraid controller, we could very easily read the values of those attributes using the following bash snippet:

Available_Reservd_Space_current=$(smartctl -d sat+megaraid,${device_id} -a /dev/sda | grep "Available_Reservd_Space" | awk '{print $4}')
Media_Wearout_Indicator_current=$(smartctl -d sat+megaraid,${device_id} -a /dev/sda | grep "Media_Wearout_Indicator" | awk '{print $4}') 

Then the above information can be used in different fashions, we could raise an alert if its nearing the threshold value, or measure how quickly the values decrease and then use the rate of decrease to estimate when the drive could fail.

SSDs and RAID levels

RAID have been typically with HDD used for data protection via redundancy and for increased performance, and they have found their use with SSD as well. Its common to see RAID level 5 or 6 being used with SSD on mixed read/write workloads, because the write penalty visible by using these level with rotating disks, is not of that extent when talking about SSD because there is no disk seek involved, so the read-modify-write cycle typically involved with parity based RAID levels does not cause a lot of performance hit. On the other hand striping and mirroring does improve the read performance of the SSD a lot and redundant arrays using SSD deliver far better performance as compared to HDD arrays.
But what about data protection? Do the parity-based RAID levels and mirroring provide the same level of data protection for SSDs as they are thought of? I am skeptical about that, because as I have mentioned above the endurance of a SSD depends a lot on how much it has been written to. In parity-based RAID configurations, a lot of extra writes are generated because of parity changes and they of course decrease the lifespan of the SSD, similarly in the case of mirroring, I am not sure it can provide any benefit in case of wearing out of SSD, if both the SSD in the mirror configuration have the same age, why? Because in mirroring both the SSDs in the array would be receiving the same amount of writes and hence the lifespan would decrease at the same amount of time.
I would think that there is some drastic changes that are needed to the thought process when thinking of data protection and RAID levels, because for me parity-based configuration or mirroring configuration are not going to provide any extra data protection in cases where the SSD used are of similar ages. It might actually be a good idea to periodically replace drives with younger ones so as to make sure that all the drives do not age together.

I would like to know what my readers think!