On SSDs – Lifespans, Health Measurement and RAID

Solid State Drive (SSD) have made it big and have made their way not only in desktop computing but also in mission-critical servers. SSDs have proved to be a break-through in IO performance and leave HDD far far behind in terms of Random IO performance. Random IO is what most of the database administrators would be concerned about as that is 90% of the IO pattern visible on database servers like MySQL. I have found Intel 520-series and Intel 910-series to be quite popular and they do give very good numbers in terms of Random IOPS. However, its not just performance that you should be concerned about, failure predictions and health gauges are also very important, as loss of data is a big NO-NO. There is a great deal of misconception about the endurance level of SSD, as its mostly compared to rotating disks even when measuring endurance levels, however, there is a big difference in how both SSD and HDD work, and that has a direct impact on the endurance level of SSD.

I will mostly be taling about MLC SSD, now let’s start off with a SSD primer.

SSD Primer

The smallest unit of SSD storage that can be read or written to is a page which is typically 4KB or 8KB in size. These pages are typically organized into blocks which are between 256KB or 1MB in size. SSDs have no mechanical parts and no heads or anything and their is no seeks needed as in conventional rotating disks. Reads involve reading pages from the SSD, however its the writes that are more tricky. Once you write to a page on SSD, you cannot simply overwrite (if you want to write new data) it in the same way you do with a HDD. Instead, you must erase the contents and then write again. However, a SSD can only do erasures at the block level and not the page level. What this means is that the SSD must relocate any valid data in the block to be erased, before the block can be erased and have new data written to it. To summarize, writes mean erase+write. Nowadays, SSD controllers are intelligent and do erasures in the background, so that the latency of the write operation is not affected. These background erasures are typically done within a process known garbage collection. You can imagine if these erasures were not done in the background, then writes would be too slow.

Of course every SSD has a lifespan after which it can be seen as unusable, let’s see what factors matter here.

SSD Lifespans

The lifespan of blocks that make up a SSD is really the number of times erasures and writes can be performed on those blocks. The lifespan is measure in terms of erase/write cycles. Typically enterprise grade MLC SSDs have a lifespan of about 30000 erase/write cycles, while consumer grade MLC SSD have a life span of 5000 to 10000 erase/write cycles. This fact makes it clear that the lifespan of a SSD depends on how much time it is written to. If you have a write-intensive workload then you should expect the SSD to fail much more quickly, in comparison to a read-heavy workload. This is by design.
To offset this behaviour of writes reducing the life of a SSD, engineers use two techniques, wear-levelling and over-provisioning. Wear-levelling works by making sure that all the blocks in a SSD are erased and written to in a evenly distributed fashion, this makes sure that some blocks do not die quickly then other blocks. Over-provisioning SSD capacity is one another technique that increases SSD endurance. This is accomplished by having a large population of blocks to distribute erases and writes over time (bigger capacity SSD), and by providing a large spare area. Many SSD models over provision the space, for example a 80GB SSD could have 10GB of over-provisioned space, so that while it is actually 90GB in size it is reported as a 80GB SSD. While this over-provisioning is done by the SSD manufacturers, this can also be done by not utilising the entire SSD, for example partitioning the SSD in such a way that you only partition about 75% to 80% of the SSD and leave the rest as RAW space that is not visible to the OS/filesystem. So while over-provisioning takes away some part of the disk capacity, it gives back in terms of increased endurance and performance.

Now comes the important part of the post that I would like to discuss.

Health Measurement and failure predictability

As you may have noticed after reading the above part of this post, its all the more important to be able to predict when a SSD would fail and to be able to see health related information about the SSD. Yet I haven’t found much written about how to gauge the health of a SSD. RAID controllers employed with SSD tend to be very limited in terms of the amount of information that they provide about an SSD that could allow predicting when a SSD could fail. However, most of the SSD provide a lot of information via S.M.A.R.T. and this can be leveraged to good affect.
Let’s consider the example of Intel SSD, these SSD have to S.M.A.R.T. attributes that can be leveraged to predict when the SSD would fail. These attributes are:

  • Available_Reservd_Space: This attribute reports the number of reserve blocks remaining. The value of the attribute starts at 100, which means that the reserved space is 100 percent available. The threshold value for this attribute is 10 which means 10 percent availability, which indicates that the drive is close to its end of life.
  • Media_Wearout_Indicator: This attribute reports the number of erase/write cycles the NAND media has performed. The value of the attribute decreases from 100 to 1, as the average erase cycle count increases from 0 to the maximum rated cycles. Once the value of this attribute reaches 1, the number will not decrease, although it is likely that significant additional wear can be put on the device. A value of 1 should be thought of as the threshold value for this attribute.

Using the smartctl tool (part of the smartmontools package) we can very easily read the values of these attributes and then use it to predict failures. For example for SATA SSD drives attached to a LSI Megaraid controller, we could very easily read the values of those attributes using the following bash snippet:

Available_Reservd_Space_current=$(smartctl -d sat+megaraid,${device_id} -a /dev/sda | grep "Available_Reservd_Space" | awk '{print $4}')
Media_Wearout_Indicator_current=$(smartctl -d sat+megaraid,${device_id} -a /dev/sda | grep "Media_Wearout_Indicator" | awk '{print $4}') 

Then the above information can be used in different fashions, we could raise an alert if its nearing the threshold value, or measure how quickly the values decrease and then use the rate of decrease to estimate when the drive could fail.

SSDs and RAID levels

RAID have been typically with HDD used for data protection via redundancy and for increased performance, and they have found their use with SSD as well. Its common to see RAID level 5 or 6 being used with SSD on mixed read/write workloads, because the write penalty visible by using these level with rotating disks, is not of that extent when talking about SSD because there is no disk seek involved, so the read-modify-write cycle typically involved with parity based RAID levels does not cause a lot of performance hit. On the other hand striping and mirroring does improve the read performance of the SSD a lot and redundant arrays using SSD deliver far better performance as compared to HDD arrays.
But what about data protection? Do the parity-based RAID levels and mirroring provide the same level of data protection for SSDs as they are thought of? I am skeptical about that, because as I have mentioned above the endurance of a SSD depends a lot on how much it has been written to. In parity-based RAID configurations, a lot of extra writes are generated because of parity changes and they of course decrease the lifespan of the SSD, similarly in the case of mirroring, I am not sure it can provide any benefit in case of wearing out of SSD, if both the SSD in the mirror configuration have the same age, why? Because in mirroring both the SSDs in the array would be receiving the same amount of writes and hence the lifespan would decrease at the same amount of time.
I would think that there is some drastic changes that are needed to the thought process when thinking of data protection and RAID levels, because for me parity-based configuration or mirroring configuration are not going to provide any extra data protection in cases where the SSD used are of similar ages. It might actually be a good idea to periodically replace drives with younger ones so as to make sure that all the drives do not age together.

I would like to know what my readers think!

Ovais is a storage architect with keen focus on reliability, efficiency and performance of OLTP databases, more specifically MySQL. He currently works at Uber on ensuring storage reliability and scalability. Previously, he helped different web properties scale their MySQL infrastructure. He also worked at Percona as a Senior MySQL consultant and at a few other startups in various capacities.

  • PFY

    The claim that you can under partition to increase over-provisioning seems dubious to me, can you site a reference for this claim?

  • PFY: Stroking the disk will certainly help on TRIM backed installations (ext4 and modern linux kernel for example) because the space will never be marked as being used and the disk will be free to write level over it without problems.

  • PFY

    Justin: Thanks for the info, those are new terms to me that leave me with a lot of reading to do. For others:
    TRIM :http://en.wikipedia.org/wiki/TRIM
    Short-stroking: http://en.wikipedia.org/wiki/Short_stroking

  • Andy

    SSD & RAID – is there any way to get TRIM to work for SSDs in RAID?

  • Pingback: Log Buffer #290, A Carnival of the Vanities for DBAs | The Pythian Blog()

  • Pingback: Atom Wire » Blog Archive » Log Buffer #290, A Carnival of the Vanities for DBAs()

  • Pingback: Atom Wire » Blog Archive » On SSDs – Lifespans, Health Measurement and RAID()

  • Baruch Even

    You assume that all the disks will fail at the same time however even at the end of life the drives are more likely to fail in a staggered form since the cycle counts are only the minimum time guarantees but not the upper bound real-life case. A page/block is considered failed when that page or a certain number of pages in the block fail to be read, written or erased. That doesn’t happen all once due to various considerations, some are environmental (writing when the ssd is too hot or too cold f.ex.) and some are manufacturing related due to defects or inconsistencies across the wafer. That’s also why you see some blocks failing before others in the first place, it’s not just the raw write/erase cycles it the inherent attributes of that block.
    It should also be noted that the page is considered failed not when it can’t program some bit but when the ECC can’t correct enough bits so you may go around with some failed bits for a long time before dropping that page.
    On top of that there are earlier failures that happen in the guaranteed period, these do happen and without RAID you’d lose the data very early on.
    All that said, there is room to consider what happens to a RAID group at its end of life, the monitoring of SSDs should definitely be constructed so that if too many SSDs show signs of nearing their end of life then proactive maintenance is in order and some of them should be replaced with newer ones.
    One thing to add to the monitoring logic that you present, SMART itself is not sufficient, the real metric is the disk latency measures. If you are not monitoring that you are far more likely to have an unexpected disk failure. I’m not aware of an open study on this with regard to SSDs but the Google study on HDDs showed that SMART was not very useful. And my gut feeling and experience so far showed few failures from SSDs getting to their end of life and more failing due to other reasons and most of these were not predicted by SMART.

  • Raid with a hotspare should provide a reliable solution.

  • Baruch,

    You have raised some very interesting points. However, I still think that SSD lifespan can still be gauged by measuring the number of write and erase cycles. Of course this will not be 100% accurate, but still it will provide a good approximation. Every SSD has a lifespan that can defined as the number of writes that it can sustain. So that should be a good number to approximate when the drive could fail. Of course there may be other reasons of drive failures such as unexpected hardware failure and in that case RAID would certainly help. But my point is that considering that the SSDs attached to a controller are all being accessed and written to at the same time, then its more than likely that they will fail nearly at the same time. And that to me is the concerning factor when using raid levels such as RAID 1. You are right there are other environmental factors that should be taken into account. But when considering disks attached to the same controller, I do not think the environmental factor will come into account separately for each disk as they would all be operating under the same environment.

    Regarding SMART, its by no means a very perfect way to monitor the health of a disk, but when you do not have other reliable information, SMART gives enough information that can be used to predict when the life span of a SSD is nearing its end.

  • Steve Toth

    In my experience, SSDs are much more unreliable then spinning disks, and when they fail you get little warning from SMART. Mirroring should help even if drives are of the same age as when one fails, the others in the array should still function long enough to swap in a good drive.

    One thing you are not mentioning is that SSDs also need to be refreshed every so often even due to reads. http://en.wikipedia.org/wiki/Read_disturb#Read_disturb

    This is rare though.

  • Of course SMART is not supposed to warn you strictly about a failure, what I am suggesting is that you use it for predicting failure. The way SSDs function is very different from HDDs. HDDs unlike SSDs do not have any limit on the number of erasures or writes that can be done.
    SSDs do not typically age the same way as rotating disks, and hence you cannot really rely on mirroring all the time. The life of a SSD depends on how many writes/erasures are done, and in mirroring you are basically doing the same amount of writes and erasures on both the drives, so unless you make sure that the mirror consists of drives that will age out at different times, you cannot really rely on mirroring the same way you do for HDDs

  • This problem can not be solved? I’m worried about this, can my data gone? and then I will lose jobs