, Johann Schmitz

Since half a year I have had a nasty breakdown of my HDDs. I did alot of debugging but didn't find a solution until today. I will describe in short what happened and what I did. I have two Samsung HD403LJ in my box. They aren't the same age some the firmware differs but physically they should be the same. So from time to time randomly one the two quit to work. Following error shows up:

[  719.004117] ata1: EH in SWNCQ mode,QC:qc_active 0x7FFFFFFF sactive 0x7FFFFFFF
[  719.004329] ata1: SWNCQ:qc_active 0x1 defer_bits 0x7FFFFFFE last_issue_tag 0x0
[  719.004331]   dhfis 0x0 dmafis 0x0 sdbfis 0x0
[  719.005861] ata1: ATA_REG 0x40 ERR_REG 0x0
[  719.007651] ata1: tag : dhfis dmafis sdbfis sacitve
[  719.007654] ata1: tag 0x0: 0 0 0 1  
[  719.007667] ata1.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x6 frozen
[  719.007673] ata1.00: cmd 61/48:00:cc:1d:e2/01:00:00:00:00/40 tag 0 ncq 167936 out
[  719.007675]          res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[  719.007678] ata1.00: status: { DRDY }
[  719.007682] ata1.00: cmd 61/c8:08:4c:1f:e2/01:00:00:00:00/40 tag 1 ncq 233472 out
[  719.007684]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  719.007686] ata1.00: status: { DRDY }
[  719.007690] ata1.00: cmd 61/08:10:4c:21:e2/00:00:00:00:00/40 tag 2 ncq 4096 out
[  719.007691]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  719.007694] ata1.00: status: { DRDY }
[  719.007698] ata1.00: cmd 61/08:18:dc:15:e0/00:00:00:00:00/40 tag 3 ncq 4096 out
[  719.007699]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  719.007702] ata1.00: status: { DRDY }
[  719.007706] ata1.00: cmd 61/c0:20:54:21:e2/01:00:00:00:00/40 tag 4 ncq 229376 out
[  719.007707]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  719.007709] ata1.00: status: { DRDY }

....

[  719.007882] ata1.00: cmd 61/a8:d8:ec:77:dc/03:00:00:00:00/40 tag 27 ncq 479232 out
[  719.007884]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  719.007886] ata1.00: status: { DRDY }
[  719.007890] ata1.00: cmd 61/30:e0:54:55:e2/00:00:00:00:00/40 tag 28 ncq 24576 out
[  719.007891]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  719.007894] ata1.00: status: { DRDY }
[  719.007898] ata1.00: cmd 61/80:e8:ac:7b:dc/00:00:00:00:00/40 tag 29 ncq 65536 out
[  719.007899]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  719.007901] ata1.00: status: { DRDY }
[  719.007905] ata1.00: cmd 61/60:f0:34:7c:dc/00:00:00:00:00/40 tag 30 ncq 49152 out
[  719.007907]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  719.007909] ata1.00: status: { DRDY }
[  719.320779] ata1: soft resetting link
[  719.473300] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[  724.496619] ata1.00: qc timeout (cmd 0x27)
[  724.496820] ata1.00: failed to read native max address (err_mask=0x4)
[  724.496987] ata1.00: HPA support seems broken, skipping HPA handling
[  724.500109] ata1.00: revalidation failed (errno=-5)
[  724.809942] ata1: soft resetting link
[  724.963301] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[  724.977604] ata1.00: configured for UDMA/133
[  724.977663] ata1: EH complete
[  754.976619] ata1: EH in SWNCQ mode,QC:qc_active 0x7FFFFFFF sactive 0x7FFFFFFF
[  754.976640] ata1: SWNCQ:qc_active 0x1 defer_bits 0x7FFFFFFE last_issue_tag 0x0
[  754.976642]   dhfis 0x0 dmafis 0x0 sdbfis 0x0
[  754.976659] ata1: ATA_REG 0x40 ERR_REG 0x0
[  754.976667] ata1: tag : dhfis dmafis sdbfis sacitve
[  754.976677] ata1: tag 0x0: 0 0 0 1  
[  754.976695] ata1.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x6 frozen
[  754.976710] ata1.00: cmd 61/60:00:34:7c:dc/00:00:00:00:00/40 tag 0 ncq 49152 out
[  754.976711]          res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[  754.976730] ata1.00: status: { DRDY }
[  754.976741] ata1.00: cmd 61/80:08:ac:7b:dc/00:00:00:00:00/40 tag 1 ncq 65536 out
[  754.976742]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  754.976760] ata1.00: status: { DRDY }
[  754.976770] ata1.00: cmd 61/30:10:54:55:e2/00:00:00:00:00/40 tag 2 ncq 24576 out
[  754.976772]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  754.976790] ata1.00: status: { DRDY }
[  754.976801] ata1.00: cmd 61/a8:18:ec:77:dc/03:00:00:00:00/40 tag 3 ncq 479232 out
[  754.976802]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  754.976820] ata1.00: status: { DRDY }
[  754.976831] ata1.00: cmd 61/e8:20:2c:75:dc/01:00:00:00:00/40 tag 4 ncq 249856 out
[  754.976832]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

And so on. I googled alot but didn't find any solution. The problem was that I had no way to force this error so that I could go for real bug finding. Interestingly two days ago I did a

dd if=/dev/sda of=/dev/null
dd if=/dev/sdb of=/dev/null

in two windows of a screen session and after around 20 seconds the disks crashed. From there I started to reproduce it with different live cds with different kernels but with no success. Only my latest working kernel (gentoo- sources-2.6.27-gentoo-r8) did it. So I started over took the livecd kernel from the latest minimal cd from gentoo and started round by round modifying the kernel towards what I had running. Long story short (I did built 11 different kernels, reboot, dded, reboot, built kernel, reboot...) it was the in kernel irq-balancing in combination with my nforce chips. Disabling the irq-balancing dropped the throughput by 10% but my disks are running stable again! (Beside the fact that the internal speed difference of the disk are 25%. So if anyone can give me new firmware for samsung disks, please drop a line) Happy day!!!