Intermittent hardware errors and recovery: modelling and evaluation

Layali Rashid, Karthik Pattabiraman and Sathish Gopalakrishnan
International Conference on Quantitative Evaluation of Systems (QEST)
September 2012

The frequency of hardware errors is increasing due to shrinking feature sizes, higher levels of integration, and increasing design complexity. Intermittent errors are those that occur non-deterministically at the same location. It has been shown that intermittent hardware errors contribute to about 39% of the total hardware failures. Recovery from intermittent hardware errors has been a challenge since these errors have characteristics that are different than transient and permanent errors. In this paper, we evaluate the impact of different intermittent error recovery scenarios on the processor performance. To achieve this, we model a system that consists of (1) a model of a fault-tolerant processor, (2) a few models of intermittent hardware faults. Due to the lack of information about intermit- tent faults exact characteristics, our fault models are based on insights from related work at the physical level. We find that the frequency of the intermittent error and the relative importance of the error location play an important role in choosing the recovery action that maximizes the processor’s performance.


EmailEmail Article to Friend