How to find disk failures

There are couple of methods to find disk errors or disk failures, below are some of the methods to find failures.

1. dmesg
dmesg (Display message or driver message) is a command which will show Kernel ring buffers. These messages contain valuable information about device drivers loaded into the kernel at the time of booting as well as when we connect a hardware to the system on the fly. In other words dmesg will give us details about hardware drivers connected to, disconnected from a machine and any errors when hardware driver is loaded into the kernel. These messages are helpful in diagnosing or debugging hardware and device driver issues

dmesg -T

In above command "-T" will be used to display "dmesg" output with time.

[Sun Jan  7 20:37:48 2018] blk_update_request: critical medium error, dev sdh, sector 3766300328
[Sun Jan  7 20:37:54 2018] sd 0:0:7:0: [sdh] tag#1 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Sun Jan  7 20:37:54 2018] sd 0:0:7:0: [sdh] tag#1 Sense Key : Medium Error [current]
[Sun Jan  7 20:37:54 2018] sd 0:0:7:0: [sdh] tag#1 Add. Sense: Read retries exhausted
[Sun Jan  7 20:37:54 2018] sd 0:0:7:0: [sdh] tag#1 CDB: Read(10) 28 00 e0 7d 2e a8 00 00 08 00
[Sun Jan  7 20:37:54 2018] blk_update_request: critical medium error, dev sdh, sector 3766300328
[Sun Jan  7 20:47:35 2018] sd 0:0:7:0: [sdh] tag#20 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Sun Jan  7 20:47:35 2018] sd 0:0:7:0: [sdh] tag#20 Sense Key : Medium Error [current]
[Sun Jan  7 20:47:35 2018] sd 0:0:7:0: [sdh] tag#20 Add. Sense: Read retries exhausted
[Sun Jan  7 20:47:35 2018] sd 0:0:7:0: [sdh] tag#20 CDB: Read(10) 28 00 e0 7d 2e a8 00 00 08 00
[Sun Jan  7 20:47:35 2018] blk_update_request: critical medium error, dev sdh, sector 3766300328

In above output of "dmesg" you will find, drive "sdh" showing some information related to failure. Now will dip deeper to find what is wrong with above drive.

smartctl -a /dev/sdh

Smartmontools is a set of utility programs to control and monitor computer storage systems using the Self-Monitoring, Analysis and Reporting Technology system built into most modern ATA, Serial ATA, SCSI/SAS and NVMe hard drives

In above command "-a" will Prints all SMART information about the disk

[root@hostname ~]# smartctl -a /dev/sdh
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.10.0-327.el7.x86_64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               TOSHIBA
Product:              MG03SCA200
Revision:             DG09
User Capacity:        2,000,398,934,016 bytes [2.00 TB]
Logical block size:   512 bytes
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x50000396a8c9a909
Serial number:        Y550A0XMFVM4
Device type:          disk
Transport protocol:   SAS
Local Time is:        Mon Jan  8 05:29:41 2018 GMT
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Disabled or Not Supported

=== START OF READ SMART DATA SECTION ===
SMART Health Status: HARDWARE IMPENDING FAILURE DATA ERROR RATE TOO HIGH [asc=5d, ascq=12]

Current Drive Temperature:     25 C
Drive Trip Temperature:        65 C

Manufactured in week 45 of year 2015
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  374
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  22523
Elements in grown defect list: 2580

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0   512895    532266    512895   10347396     651641.519       19371
write:         0        5         5         5          5      45684.586           0
verify:        0        0         0         0          0         10.401           0

Non-medium error count:      451

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Completed                  64       5                 - [-   -    -]
# 2  Reserved(7)       Completed                  64       3                 - [-   -    -]
Long (extended) Self Test duration: 18430 seconds [307.2 minutes]

In above output of "smartctl" command, you will find the disk failure status, as below

=== START OF READ SMART DATA SECTION ===
SMART Health Status: HARDWARE IMPENDING FAILURE DATA ERROR RATE TOO HIGH [asc=5d, ascq=12]

BigData- The Next Big Thing

1 comments: