Odroid H3 - 2nd NVME in 8 weeks to break

Post Reply
nielsl1985
Posts: 5
Joined: Sat Feb 03, 2024 3:29 am
languages_spoken: english
ODROIDs: H3
Has thanked: 0
Been thanked: 0
Contact:

Odroid H3 - 2nd NVME in 8 weeks to break

Post by nielsl1985 »

Hi,

I am running an H3 with 8GB of RAM and a WD SN570 1TB NVME SSD. I have installed Ubuntu 22.04 LTS in minimal mode, and am running a few docker containers, all latest version packages. My system went for the 2nd time, and with a 2nd brand new SSD in read only mode. I have added smartctl --all /dev/nvme0n1 below, but the disk is clearly fried for some reason. However, the disk doesnt tell why. Its not the temperature, my H3 has a fan on top, its not the number of hours the disk has been on. I had 2 unsafe shutdowns, but that in itself doesnt perse kill the thing, and after the last unsafe shutdown, it ran good for 2 weeks.

With the first disk I thought I just had bad luck with the NVME, with this second, I think something in my NAS is causing it to break.

Does anybody know who this is happening?

Code: Select all

root@apollonas:~# smartctl --all /dev/nvme0n1
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.0-92-generic] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       WD Blue SN570 1TB
Serial Number:                      23252X802342
Firmware Version:                   234110WD
PCI Vendor/Subsystem ID:            0x15b7
IEEE OUI Identifier:                0x001b44
Total NVM Capacity:                 1,000,204,886,016 [1.00 TB]
Unallocated NVM Capacity:           0
Controller ID:                      0
NVMe Version:                       1.4
Number of Namespaces:               1
Namespace 1 Size/Capacity:          1,000,204,886,016 [1.00 TB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            001b44 8b4ae10114
Local Time is:                      Fri Feb  2 19:33:06 2024 CET
Firmware Updates (0x14):            2 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x1e):         Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg Pers_Ev_Lg
Maximum Data Transfer Size:         128 Pages
Warning  Comp. Temp. Threshold:     80 Celsius
Critical Comp. Temp. Threshold:     85 Celsius
Namespace 1 Features (0x02):        NA_Fields

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     4.20W    3.70W       -    0  0  0  0        0       0
 1 +     2.70W    2.30W       -    0  0  0  0        0       0
 2 +     1.90W    1.80W       -    0  0  0  0        0       0
 3 -   0.0250W       -        -    3  3  3  3     3900   11000
 4 -   0.0050W       -        -    4  4  4  4     5000   44000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         2
 1 -    4096       0         1

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
- NVM subsystem reliability has been degraded
- media has been placed in read only mode

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x0c
Temperature:                        37 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    1%
Data Units Read:                    629,680 [322 GB]
Data Units Written:                 1,189,943 [609 GB]
Host Read Commands:                 4,346,292
Host Write Commands:                12,976,285
Controller Busy Time:               97
Power Cycles:                       7
Power On Hours:                     583
Unsafe Shutdowns:                   2
Media and Data Integrity Errors:    4,287
Error Information Log Entries:      4,288
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0

Error Information (NVMe Log 0x01, 16 of 256 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
  0       4288     2  0xb067  0xc502      -   1279382504     1     -
  1       4287     4  0xb0b4  0xc502      -    404922648     1     -
  2       4286     1  0x8263  0xc502      -      2295504     1     -
  3       4285     1  0x7262  0xc502      -      2295504     1     -
  4       4284     1  0x7260  0xc502      -    404922656     1     -
  5       4283     1  0x825e  0xc502      -    404922656     1     -
  6       4282     1  0x825c  0xc502      -    404922648     1     -
  7       4281     1  0x625a  0xc502      -    404922648     1     -
  8       4280     1  0x824f  0xc502      -   1279382512     1     -
  9       4279     1  0x824e  0xc502      -   1279382504     1     -
 10       4278     1  0x4258  0xc502      -   1264492592     1     -
 11       4277     1  0x5257  0xc502      -   1264492592     1     -
 12       4276     3  0xe114  0xc502      -   1260560672     1     -
 13       4275     3  0xe112  0xc502      -   1260560672     1     -
 14       4274     3  0xe110  0xc502      -   1260560672     1     -
 15       4273     1  0x3254  0xc502      -   1264492592     1     -

User avatar
odroid
Site Admin
Posts: 42177
Joined: Fri Feb 22, 2013 11:14 pm
languages_spoken: English, Korean
ODROIDs: ODROID
Has thanked: 3609 times
Been thanked: 2004 times
Contact:

Re: Odroid H3 - 2nd NVME in 8 weeks to break

Post by odroid »

I have been using a Samsung PMA981 512GB NVMe for near 15 months after installing it on my H3.
After conducting various tests, the Unsafe Shutdowns count value reached 151.
However, no critical problems have been found in use so far.

Code: Select all

smartctl --all /dev/nvme0n1
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-6.5.0-14-generic] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       SAMSUNG MZVLB512HAJQ-00000
Serial Number:                      S3W8NX0N710098
Firmware Version:                   EXA7301Q
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 512,110,190,592 [512 GB]
Unallocated NVM Capacity:           0
Controller ID:                      4
NVMe Version:                       1.2
Number of Namespaces:               1
Namespace 1 Size/Capacity:          512,110,190,592 [512 GB]
Namespace 1 Utilization:            317,682,741,248 [317 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            002538 8701b0fe50
Local Time is:                      Mon Feb  5 10:21:42 2024 KST
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x001f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
Log Page Attributes (0x03):         S/H_per_NS Cmd_Eff_Lg
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     81 Celsius
Critical Comp. Temp. Threshold:     82 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     7.02W       -        -    0  0  0  0        0       0
 1 +     6.30W       -        -    1  1  1  1        0       0
 2 +     3.50W       -        -    2  2  2  2        0       0
 3 -   0.0760W       -        -    3  3  3  3      210    1200
 4 -   0.0050W       -        -    4  4  4  4     2000    8000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        29 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    1%
Data Units Read:                    1,962,326,540 [1.00 PB]
Data Units Written:                 30,797,954 [15.7 TB]
Host Read Commands:                 2,787,566,892
Host Write Commands:                599,624,626
Controller Busy Time:               9,833
Power Cycles:                       483
Power On Hours:                     2,529
Unsafe Shutdowns:                   151
Media and Data Integrity Errors:    0
Error Information Log Entries:      880
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               29 Celsius
Temperature Sensor 2:               31 Celsius

Error Information (NVMe Log 0x01, 16 of 64 entries)
No Errors Logged
Have you used our official 15V/4A PSU with your H3?
Which file system do you use? I'm using very generic EXT4.

nielsl1985
Posts: 5
Joined: Sat Feb 03, 2024 3:29 am
languages_spoken: english
ODROIDs: H3
Has thanked: 0
Been thanked: 0
Contact:

Re: Odroid H3 - 2nd NVME in 8 weeks to break

Post by nielsl1985 »

Hi,

Yes, I use the official 15V/4A power supply, all ordered directly from Hardkernel, so that should be OK i Guess. The filesystem has indeed been EXT4, the default for ubuntu 22.04 lts.

Are there any bios settings I should take into account or validate? And are Western Digital nvmes supported? Theyre an A type brand like Samsung, so I would't expect much issues.

Thanks
Niels

User avatar
odroid
Site Admin
Posts: 42177
Joined: Fri Feb 22, 2013 11:14 pm
languages_spoken: English, Korean
ODROIDs: ODROID
Has thanked: 3609 times
Been thanked: 2004 times
Contact:

Re: Odroid H3 - 2nd NVME in 8 weeks to break

Post by odroid »

There are no special setting menus related to NVMe/PCIe in the BIOS.
We also tested a few different types of WD's NVMe internally, but have not encountered any problems.

How about running memtest86 for a day to check system stability?
Also trying search the Ubuntu system logs on your read-only device for clues about NVMe-related issues.

nielsl1985
Posts: 5
Joined: Sat Feb 03, 2024 3:29 am
languages_spoken: english
ODROIDs: H3
Has thanked: 0
Been thanked: 0
Contact:

Re: Odroid H3 - 2nd NVME in 8 weeks to break

Post by nielsl1985 »

Sure, Will do that over the weekend and Come back to you

nielsl1985
Posts: 5
Joined: Sat Feb 03, 2024 3:29 am
languages_spoken: english
ODROIDs: H3
Has thanked: 0
Been thanked: 0
Contact:

Re: Odroid H3 - 2nd NVME in 8 weeks to break

Post by nielsl1985 »

Image

Memtest ran for a day, and nothing wrong there. Dmesg didnt tell me anything from the last time it booted in RW mode. I installed ubuntu complete out of the box. The first time my Ssd crashed its was an LVM volume based on EXT4, seconde Ssd it was without lvm.

Im using the latest bios version.

Any other suggestion I can check?
Attachments
20240210_181636.jpg
20240210_181636.jpg (467.92 KiB) Viewed 302 times

User avatar
odroid
Site Admin
Posts: 42177
Joined: Fri Feb 22, 2013 11:14 pm
languages_spoken: English, Korean
ODROIDs: ODROID
Has thanked: 3609 times
Been thanked: 2004 times
Contact:

Re: Odroid H3 - 2nd NVME in 8 weeks to break

Post by odroid »

Your system looks quite stable. There is no problem in my actual use, but my system shows 2~3 bit errors when I ran memtest for 24 hours.
But I still don't know why your second SSD died after two months again.

Check whether the ASPM value in BIOS settings -> Chipset -> PCI Epress Configurations -> PCI Express Root Port 5 is set to Disabled.
For stability/compatibility with some SSDs, the default value was set to Disabled. However, if it is set to Auto, please try changing it to "Disabled".
If it's already set to "Disabled" I honestly don't think there's anything left for you to try.

GioStyle
Posts: 3
Joined: Fri Dec 16, 2022 3:50 am
languages_spoken: english
ODROIDs: Odroid H2+, Odroid H3 (3x)
Has thanked: 1 time
Been thanked: 0
Contact:

Re: Odroid H3 - 2nd NVME in 8 weeks to break

Post by GioStyle »

Any update on this issue? I got the same issues. It already happened to me three times. I have now two replaced 2TB SN570's en one 2TB SN580, but I'm afraid to use those ssd's with Odroid H3's.

Other types, brands, you name it, all nvme ssd's works fine with Odroid H2+ or H3's, but specific the Western Digital SN570 and SN580 have issues with the H3. Over time something triggers the ssd to get in read-only mode.

nielsl1985
Posts: 5
Joined: Sat Feb 03, 2024 3:29 am
languages_spoken: english
ODROIDs: H3
Has thanked: 0
Been thanked: 0
Contact:

Re: Odroid H3 - 2nd NVME in 8 weeks to break

Post by nielsl1985 »

It was indeed set to automatic. I have now put it on disabled for port 5. Can you explain what this setting does and how it could affect this issue?

@GioStyle.i have also had this issue with a WD SN350 green. That was the first Ssd that went into R/O mode after a few weeks

User avatar
odroid
Site Admin
Posts: 42177
Joined: Fri Feb 22, 2013 11:14 pm
languages_spoken: English, Korean
ODROIDs: ODROID
Has thanked: 3609 times
Been thanked: 2004 times
Contact:

Re: Odroid H3 - 2nd NVME in 8 weeks to break

Post by odroid »

nielsl1985 wrote:
Mon Feb 19, 2024 1:49 am
It was indeed set to automatic. I have now put it on disabled for port 5. Can you explain what this setting does and how it could affect this issue?
Although there is no scientific logic or basis, I just heard that some NVMes have bugs in ASPM-related functions (NVMe internal firmware) and must be used in a disabled state.

GioStyle
Posts: 3
Joined: Fri Dec 16, 2022 3:50 am
languages_spoken: english
ODROIDs: Odroid H2+, Odroid H3 (3x)
Has thanked: 1 time
Been thanked: 0
Contact:

Re: Odroid H3 - 2nd NVME in 8 weeks to break

Post by GioStyle »

odroid wrote:
Mon Feb 19, 2024 10:35 am
nielsl1985 wrote:
Mon Feb 19, 2024 1:49 am
It was indeed set to automatic. I have now put it on disabled for port 5. Can you explain what this setting does and how it could affect this issue?
Although there is no scientific logic or basis, I just heard that some NVMes have bugs in ASPM-related functions (NVMe internal firmware) and must be used in a disabled state.
Interesting, but unfortunate. I use my H3's as small home servers and I prefer those pc's use as little power as possible.

xnd
Posts: 83
Joined: Sun Dec 04, 2022 7:48 pm
languages_spoken: english, czech
ODROIDs: H3
Location: Slovakia
Has thanked: 33 times
Been thanked: 27 times
Contact:

Re: Odroid H3 - 2nd NVME in 8 weeks to break

Post by xnd »

In my H3 I have used 1TB Samsung 980 Pro, now I switched to 2TB version. Especially Pro version again because of great power management and very low power consumption in idle

Code: Select all

root@unr:~# smartctl --all /dev/nvme0
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.1.74-Unraid] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       Samsung SSD 980 PRO 2TB
Serial Number:                      S69ENF0WB19769D
Firmware Version:                   5B2QGXA7
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 2,000,398,934,016 [2.00 TB]
Unallocated NVM Capacity:           0
Controller ID:                      6
NVMe Version:                       1.3
Number of Namespaces:               1
Namespace 1 Size/Capacity:          2,000,398,934,016 [2.00 TB]
Namespace 1 Utilization:            8,655,663,104 [8.65 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            002538 bb31b2d70f
Local Time is:                      Wed Feb 21 15:36:51 2024 CET
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x0057):     Comp Wr_Unc DS_Mngmt Sav/Sel_Feat Timestmp
Log Page Attributes (0x0f):         S/H_per_NS Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg
Maximum Data Transfer Size:         128 Pages
Warning  Comp. Temp. Threshold:     82 Celsius
Critical Comp. Temp. Threshold:     85 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     8.49W       -        -    0  0  0  0        0       0
 1 +     4.48W       -        -    1  1  1  1        0     200
 2 +     3.18W       -        -    2  2  2  2        0    1000
 3 -   0.0400W       -        -    3  3  3  3     2000    1200
 4 -   0.0050W       -        -    4  4  4  4      500    9500

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        36 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    24,043 [12.3 GB]
Data Units Written:                 37,874 [19.3 GB]
Host Read Commands:                 239,620
Host Write Commands:                311,782
Controller Busy Time:               19
Power Cycles:                       40
Power On Hours:                     7
Unsafe Shutdowns:                   18
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               36 Celsius
Temperature Sensor 2:               35 Celsius

Error Information (NVMe Log 0x01, 16 of 64 entries)
No Errors Logged

Self-test Log (NVMe Log 0x06)
Self-test status: No self-test in progress
Num  Test_Description  Status                       Power_on_Hours  Failing_LBA  NSID Seg SCT Code
 0   Extended          Completed without error                   4            -     -   -   -    -
 1   Short             Completed without error                   3            -     -   -   -    -
 2   Short             Completed without error                   3            -     -   -   -    -
 3   Extended          Completed without error                   2            -     -   -   -    -
 4   Short             Completed without error                   2            -     -   -   -    -

- hmm but Power On Hours: 7 7 hours seems not real, it should be 20+ :roll:
proud owner of Odroid H3 ( + 48 GB RAM: 16GB+32GB Crucial CT32G4SFD832A & Samsung SSD 980 PRO 1TB M.2; OS: DietPi - Debian 12; idle power consumption: ~ 1.1—2W)

fvolk
Posts: 899
Joined: Sun Jun 05, 2016 11:04 pm
languages_spoken: english
ODROIDs: C4, H3, M1S
Has thanked: 0
Been thanked: 142 times
Contact:

Re: Odroid H3 - 2nd NVME in 8 weeks to break

Post by fvolk »

xnd wrote:
Wed Feb 21, 2024 11:41 pm
- hmm but Power On Hours: 7 7 hours seems not real, it should be 20+ :roll:
When NVME drives sleep, they don't count as powered on.

xnd
Posts: 83
Joined: Sun Dec 04, 2022 7:48 pm
languages_spoken: english, czech
ODROIDs: H3
Location: Slovakia
Has thanked: 33 times
Been thanked: 27 times
Contact:

Re: Odroid H3 - 2nd NVME in 8 weeks to break

Post by xnd »

Oh, i didn’t know it, but i started assuming this. Thanks.
proud owner of Odroid H3 ( + 48 GB RAM: 16GB+32GB Crucial CT32G4SFD832A & Samsung SSD 980 PRO 1TB M.2; OS: DietPi - Debian 12; idle power consumption: ~ 1.1—2W)

Post Reply

Return to “Issues”

Who is online

Users browsing this forum: No registered users and 1 guest