Latent Sector Error Modeling and Detection for NAND Flash-based SSDs

Guanying Wu,Chentao Wu,Xubin He
2011-01-01
Abstract:Latent Sector Error (LSE) is a well-known problem in HDD-based storage systems. LSEs, which occur silently, may result in data loss during RAID recovery from disk failure. LSEs in HDDs are caused by various reasons such as write errors or media imperfections [1], which result in bit/symbol errors that cannot be corrected with ECC. Disk scrubbing is used to detect LSEs by scrubbing the disk in the background. As pointed out in [2], the scrubbing strategy optimization requires a good model of LSE development, i.e., when and where LSE would likely happen. Inspired by previous work [1] [2], we are investigating the problem of modeling and detecting LSEs of NAND flash-based SSDs. Due to increased density, NAND flash memory is becoming more and more prone to bit errors [3], which increases the probability of LSE hazards in SSDs. For example, reduced feature size potentially shrinks the volume of the floating gate, which is dedicated to store the electric charge. Therefore, the threshold voltage differences (determined by the amount of charge in the cell) among the cell levels are reduced. In addition, for the MLC technique, the more bits per cell, the more cell levels are there to share the threshold window. Resulted from the feature scaling and MLC, the reduced charge difference between cell levels are more vulnerable to errors caused by noise or disturbs. In addition, with a thinner oxide layer that isolates the floating gate, the feature size scaling amplifies the impact of P/E cycling, which introduces bit errors and reduces the lifetime of the flash. The high bit error rate may be addressed by stronger ECC. However, dealing with large-size sectors (due to MLC), most ECC schemes require more extra bits to store ECC. In addition, the ECC decoding latency grows with increased codeword length and ECC strength [4]. The model of LSE development in HDDs is built upon the real field data, which are not available for SSDs due to the limited population. However, we may model LSE according to the underlying mechanism of NAND flash bit errors. Specifically, the model we are currently working on considers the following factors:
What problem does this paper attempt to address?