[Odroid HC2 - Ubuntu 20.04 - 5.4] Seemingly random unavailability/crashing

Post Reply
barelycompetent
Posts: 15
Joined: Sat Jun 26, 2021 12:05 am
languages_spoken: English, Dutch
Has thanked: 13 times
Been thanked: 0
Contact:

[Odroid HC2 - Ubuntu 20.04 - 5.4] Seemingly random unavailability/crashing

Post by barelycompetent »

Hi everyone,

I come to you in a time of great need. I was hoping someone could help me out or point me in the right direction, if at all possible.
Any input is valuable and very much appreciated!

To give some backstory, we have hundreds of Odroid HC2 units currently in the field functioning as small network video recorders.
Unfortunately, we've recently been having issues with units seemingly becoming unresponsive and not automatically powering back up.
Oddly enough, only a fraction of the units is experiencing the issue and it has been impossible to reproduce.
At this moment we have at least 7 sites that have identical problems, making things increasingly more problematic.

Despite the many hours of Google-fu, trial-and-error and long nights of reading through log files at the office, I see no other option but to turn to those more knowledgable than myself.

Problem description:
  • Seemingly at random, the unit loses connection on every observable level - unable to ping, ssh, ...
  • Only the red power LED will be lit up when inspected.
  • Power cycle seems to temporarily bring the unit back online, but the issue has a tendency to repeat itself within 24-48 hours.
  • Nothing in the system logs immediately points me to a possible cause but I am by no means an expert and thus could be missing something.
  • When the unit comes back online after a power cycle, it seems that it did not record any video during the period it was unreachable.
Attempted solutions:
  • When we tried swapping affected units along with their PSU, the issue still seemed to persist.
  • We tried switching from the official power supply to a slightly more powerful one (2A > 3A) but this didn't appear to help.
  • I had written a simple script to log the system load and core temperatures every 5 minutes, but it doesn't appear that they're overheating when the issue presents itself.
  • I've attempted disabling the hibernation and suspend settings in /etc/systemd/logind.conf.d/ and /etc/systemd/sleep.conf.d/ without any result - see attachments nosleep.conf and altlogind.conf.


Other observations:
  • Issue occurs both in dynamic and static IP-configurations.
  • On one occasion we were able to use a serial connection to gain access to one such unit without power cycling, from this not much was learned other than the fact that it doesn't seem to actually be dead.
    We've distributed USB-to-UART-cables to some of the affected sites, but to this date only one site was able to lend me access through a serial connection.
    For some reason, the other sites that we've sent these to haven't had the issue appear anymore.
    Due to the single unit sample size, I recognize that this might have been an unrelated issue, but I thought I'd mention it regardless.
  • In one instance, a unit stopped experiencing the issue for about 2 months, only for it to return with a vengeance afterwards.
  • We can't establish any environmental commonalities between the locations that might cause the issue (yet).
  • We've asked for some of these units to be returned to our office for testing, but we are unable to reproduce the issue even though we use it exactly as it would be at its original site.


System information:
Distribution Ubuntu Server 20.04, without desktop environment
Kernel Ver. 5.4.58-215
SD-Card Samsung EVO Plus 32GB (MB-MC32G)
HDD Toshiba S300 (Pro)/WD Purple - Capacity generally ranging between 1TB and 8TB

Important manually installed packages:
  1. webmin - Web-based system administration panel
  2. networkoptix-mediaserver - Video management software
Additional information:
  • These systems run a script on startup that checks for a storage drive, partitions it and copies the root-folder to one of two partitions on the attached (SATA) storage drive. After this, the system runs off said partition on the storage drive. Effectively making the SD-card a tool for system recoveries.
  • The units are shipped with a 3D-printed PLA cover, this being the model that Hardkernel provides the files for on their website.
  • We currently have a few pending information requests for the affected sites, requesting more information about the conditions in which the units are being used as well as the information necessary for us to determine affected batches. Once this is available, I'll make sure to edit the post to reflect the new information gained.
  • Please find attached some logfiles from the latest last unit to experience the issue. (kern.log, kern.log.1, syslog, syslog.1, syslog.2, dmesg and dmesg.0)
Bonus points:
These are just some of my speculations on what could cause the issue.
Unfortunately I lack the expertise required to really understand parts of the subjects involved, making it nothing more than sheer, hollow speculation.
  • My first thought was that it could be a power management/suspend/hibernation issue as the observed behavior could be interpreted as the machine being unable to wake from suspending - explaining the lack of LED activity aside from the static power LED and why the machine doesn't respond to network interactions, but seeing as my changes to sleep.conf and logind.conf haven't resolved anything it would appear that this isn't the case. Perhaps there's another power management module that I am not aware of that may cause this?
  • At some point one of the people on-site suggested it could have something to do with difference in power phases, explaining the issue occuring at some sites even when the unit (and its PSU) was replaced. However, I am not far from a complete dummy when it comes to this subject and as a result I can't speak on whether or not this could have a significant impact.
  • I recently noticed that, despite using networkd as the network renderer (through netplan) NetworkManager is also active on the system. I thought this a bit weird as I'm used to NetworkManager only being installed on Ubuntu Desktop by default, nor do I recall manually installing it to the images we use. This leads me to believe that it's packed with the distro's image. More importantly, I'm not sure if this could conflict with networkd in any way?
If I can provide any more information that could be of use, be sure to let me know and I'll get on it as soon as possible and if you've made it this far into the post - my sincere appreciation for taking a chunk out of your time to consider helping me out!

With kind regards,

:ugeek: barelycompetent
Attachments
altlogind.conf.txt
(139 Bytes) Downloaded 8 times
nosleep.conf.txt
(93 Bytes) Downloaded 7 times
syslog.txt
(1.57 MiB) Downloaded 12 times
syslog.2.txt
(3.63 MiB) Downloaded 7 times
syslog.1.txt
(1.86 MiB) Downloaded 7 times
kern.log.txt
(38.95 KiB) Downloaded 8 times
kern.log.1.txt
(78.69 KiB) Downloaded 7 times
dmesg.txt
(32.46 KiB) Downloaded 9 times
dmesg.0.txt
(33.5 KiB) Downloaded 7 times

User avatar
odroid
Site Admin
Posts: 38033
Joined: Fri Feb 22, 2013 11:14 pm
languages_spoken: English, Korean
ODROIDs: ODROID
Has thanked: 1999 times
Been thanked: 1206 times
Contact:

Re: [Odroid HC2 - Ubuntu 20.04 - 5.4] Seemingly random unavailability/crashing

Post by odroid »

7 from hundreds units seems to be a high failure rate issue. :o

We have a few running ODROID-HC1/HC2 units in hour office and their up-time is more than one year.
But, we've used them for SMB/NFS server internally only.

I've looked into the log files but I couldn't find any clue.
If the Blue LED doesn't blink, the kernel was completely crashed or system did shutdown by uncertain causes.

I have a couple of questions.
What is the longest up-time from your hundreds of ODROID-HC2 units?
What is the maximum/average CPU temperature in the log which captured by your own script?
Can you keep monitoring amount of available RAM by editing your script? I heard few users reported some out-of-memory situation could crash the kernel due to imperfect low-memory-kill function.
These users thanked the author odroid for the post:
barelycompetent (Wed Aug 04, 2021 5:35 pm)

User avatar
rooted
Posts: 8723
Joined: Fri Dec 19, 2014 9:12 am
languages_spoken: english
Location: Gulf of Mexico, US
Has thanked: 743 times
Been thanked: 375 times
Contact:

Re: [Odroid HC2 - Ubuntu 20.04 - 5.4] Seemingly random unavailability/crashing

Post by rooted »

You may want to try a small UPS at one of the affected areas as a test, perhaps it's something related to "dirty power".

https://blog.chron.com/techblog/2011/05 ... rty-power/
These users thanked the author rooted for the post:
barelycompetent (Wed Aug 04, 2021 5:35 pm)

barelycompetent
Posts: 15
Joined: Sat Jun 26, 2021 12:05 am
languages_spoken: English, Dutch
Has thanked: 13 times
Been thanked: 0
Contact:

Re: [Odroid HC2 - Ubuntu 20.04 - 5.4] Seemingly random unavailability/crashing

Post by barelycompetent »

odroid wrote:
Wed Aug 04, 2021 9:47 am
7 from hundreds units seems to be a high failure rate issue. :o

We have a few running ODROID-HC1/HC2 units in hour office and their up-time is more than one year.
But, we've used them for SMB/NFS server internally only.

I've looked into the log files but I couldn't find any clue.
If the Blue LED doesn't blink, the kernel was completely crashed or system did shutdown by uncertain causes.

I have a couple of questions.
Hi there, thank you for the swift response!
Considering we've rolled out around 600 units up to this point, I'd say it's relatively limited but definitely far from ideal.
What is the longest up-time from your hundreds of ODROID-HC2 units?
Unfortunately, it's hard to say what the exact longest uptime is on these but most of them don't experience any issues.
We've had units run without issue for weeks on end at our office when testing, and we started rolling out these devices continuously since May of 2020.
(In the last months of 2020, we switched from 18.04 to 20.04 but I can't say for certain if another kernel version was used under 18.04)
What is the maximum/average CPU temperature in the log which captured by your own script?
I don't seem to have the logs of that script saved (I tried this some time ago), but I mean to recall that the maximum core temperature was around 75-80°C at most.
If I recall correctly, a thread on the forums here mentioned that this kernel makes the APU start throttling at 80°C but to my knowledge that shouldn't cause a crash.
This was on a system with what we consider to be the supported maximum amount of IP cameras, being four times 1080p@30fps.
I'd expect that they heat up significantly during the processing of eight videostreams, of course - but again, this doesn't seem to happen when we test the units at the office with even higher resolution streams (4000x3000, 2688x1920, ...) to try and put it under a significant amount of load.
Can you keep monitoring amount of available RAM by editing your script?
The VMS does something similar every 30 minutes where it logs system load parameters.
I've attached the filtered output of the VMS log that shows CPU, memory, disk and network usage in %.

In the meantime I'll start running an updated variant of the script to gather temperature data as well as other values and post results here once a decent sample size is available.
I take it data from a day or two should suffice?

If it helps, I could send you the image that we use through a private message so that you can flash it to one of your units and have a look?
This is pretty much just a preconfigured system image that we use for easy rollout.
rooted wrote:
Wed Aug 04, 2021 2:23 pm
You may want to try a small UPS at one of the affected areas as a test, perhaps it's something related to "dirty power".

https://blog.chron.com/techblog/2011/05 ... rty-power/
Hi, thank you very much for your input!

I'm afraid I forgot to mention this, but a few sites have tried putting these on UPSes and this unfortunately didn't improve things.
It seems the blog you linked can't be displayed in my region, but I googled the phenomenon you mentioned and it does seem like it would explain a few of the things we're encountering.
That said, my colleagues have assured me that this is unlikely to be an issue in Belgium (where most systems are active), but we could be mistaken.
Might still be something to have a look at, though!
Attachments
log_file.log
(398.61 KiB) Downloaded 9 times

User avatar
odroid
Site Admin
Posts: 38033
Joined: Fri Feb 22, 2013 11:14 pm
languages_spoken: English, Korean
ODROIDs: ODROID
Has thanked: 1999 times
Been thanked: 1206 times
Contact:

Re: [Odroid HC2 - Ubuntu 20.04 - 5.4] Seemingly random unavailability/crashing

Post by odroid »

Thank you for the clarification and log file.
According to the log file, CPU & Memory usage seems to be quite lower than my guess.
In the past 15 days (350 hours approx.), the memory usage has been only around 35% in steady state.
Therefore, I don't think any memory leak or out-of-memory could be a root cause.

One or two days of temperature log file should be sufficient.
I think it will be hard to reproduce the issue on our side even you share an OS image because we don't have any IP cameras to make a correlated test environment.

BTW, is this Watchdog-Timer(WDT) can be a workaround? Once the Kernel crashes, the system will reboot automatically.
https://wiki.odroid.com/odroid-xu4/appl ... x_watchdog
After enabling WDT, you need to inject a test kernel panic trigger via this command to test the WDT functionality.
echo c > /proc/sysrq-trigger
As far as I remember, we've not tested WDT feature on Kernel 5.4 yet. But, it should work probably.
If it doesn't work, we will try implementing the WDT driver for Kernel 5.4 soon.

I also have a doubt in the power source stability as @rooted mentioned.
Other than that, you might need to try an old Ubuntu 18.04 + Kernel 4.14 image(what you used in the last year) to narrow down root causes.
These users thanked the author odroid for the post:
barelycompetent (Wed Aug 04, 2021 10:07 pm)

barelycompetent
Posts: 15
Joined: Sat Jun 26, 2021 12:05 am
languages_spoken: English, Dutch
Has thanked: 13 times
Been thanked: 0
Contact:

Re: [Odroid HC2 - Ubuntu 20.04 - 5.4] Seemingly random unavailability/crashing

Post by barelycompetent »

odroid wrote:
Wed Aug 04, 2021 6:39 pm
Thank you for the clarification and log file.
According to the log file, CPU & Memory usage seems to be quite lower than my guess.
In the past 15 days (350 hours approx.), the memory usage has been only around 35% in steady state.
Therefore, I don't think any memory leak or out-of-memory could be a root cause.

One or two days of temperature log file should be sufficient.
I think it will be hard to reproduce the issue on our side even you share an OS image because we don't have any IP cameras to make a correlated test environment.

BTW, is this Watchdog-Timer(WDT) can be a workaround? Once the Kernel crashes, the system will reboot automatically.
https://wiki.odroid.com/odroid-xu4/appl ... x_watchdog
After enabling WDT, you need to inject a test kernel panic trigger via this command to test the WDT functionality.
echo c > /proc/sysrq-trigger
As far as I remember, we've not tested WDT feature on Kernel 5.4 yet. But, it should work probably.
If it doesn't work, we will try implementing the WDT driver for Kernel 5.4 soon.

I also have a doubt in the power source stability as @rooted mentioned.
Other than that, you might need to try an old Ubuntu 18.04 + Kernel 4.14 image(what you used in the last year) to narrow down root causes.
Thank you for suggesting the WDT workaround, I've been able to confirm that it works by triggering a kernel panic as instructed!
It comes back online without intervention, which is already a really useful way of mitigating the issue.
I hope I can catch this unit triggering WDT soon so that we have full confirmation that it works under the desired circumstances.

The adjusted script is now also running every minute using a simple cron job. So I'll get back to you with the logfile either tomorrow or this weekend.

I'll create a configuration based on the 18.04 (4.14) image as you suggested and provide it to the next site that has the issue to see if it provides a solution.

That all said, the fact that you don't have any IP-cameras on-hand isn't necessarily an issue.
If there's an interest in trying to replicate the configuration, I can provide a utility for creating simulated IP cameras on your local network (the source would be a video file).
This would make it so that you can simulate a similar system load to what most of our units would be experiencing.

I'll make sure to follow up once more information becomes available, again thank you very much for your time!

User avatar
odroid
Site Admin
Posts: 38033
Joined: Fri Feb 22, 2013 11:14 pm
languages_spoken: English, Korean
ODROIDs: ODROID
Has thanked: 1999 times
Been thanked: 1206 times
Contact:

Re: [Odroid HC2 - Ubuntu 20.04 - 5.4] Seemingly random unavailability/crashing

Post by odroid »

I see. We will try to make a test bed early next week since it is already Friday in Korea here.
Meanwhile, please prepare the utility of IP cameras emulator and simple installation instruction.
These users thanked the author odroid for the post:
barelycompetent (Mon Aug 09, 2021 7:18 pm)

User avatar
mad_ady
Posts: 9689
Joined: Wed Jul 15, 2015 5:00 pm
languages_spoken: english
ODROIDs: XU4, C1+, C2, C4, N1, N2, H2, Go, Go Advance
Location: Bucharest, Romania
Has thanked: 609 times
Been thanked: 721 times
Contact:

Re: [Odroid HC2 - Ubuntu 20.04 - 5.4] Seemingly random unavailability/crashing

Post by mad_ady »

Some more suggestions:
- consider installing a monitoring tool on all devices. I recommend going with grafana+prometheus on a central server + node_exporter on the hc2s. You can deploy it through ansible, for instance. This would allow you to monitor cpu/memory/temperature and other parameters for all devices. You can get detailed information - such as kernel slab usage - which was an issue for me on a different board.
- if possible, try enabling netconsole for problematic units and send the logs to a centralized syslog location. That would send you kernel messages as the unit dies of a kernel panic. Test it with the kernel crash command above. Here are some details: https://magazine.odroid.com/wp-content/ ... pdf#page=8. You will need to use the collector's IP (most likely it needs to be the public ip, not sure if it can send those messages over a vpn terminated on the hc2), and the mac address of your router.
- the watchdog can run an external script that can do health check or log extra info. Here's mine, but may need tweaking: https://paste.ubuntu.com/p/nPp5yYS7jD/

You might get a post-mortem, or trigger the watchdog when something unexpected happens (like rootfs remounts readonly)

Good luck!
These users thanked the author mad_ady for the post:
barelycompetent (Mon Aug 09, 2021 7:18 pm)

barelycompetent
Posts: 15
Joined: Sat Jun 26, 2021 12:05 am
languages_spoken: English, Dutch
Has thanked: 13 times
Been thanked: 0
Contact:

Re: [Odroid HC2 - Ubuntu 20.04 - 5.4] Seemingly random unavailability/crashing

Post by barelycompetent »

odroid wrote:
Fri Aug 06, 2021 9:38 am
I see. We will try to make a test bed early next week since it is already Friday in Korea here.
Meanwhile, please prepare the utility of IP cameras emulator and simple installation instruction.
Not to worry, I've just sent an e-mail containing all relevant information - as I can't send a private message through the forums.
Unfortunately, I can't share some of this information publicly.
mad_ady wrote:
Fri Aug 06, 2021 2:39 pm
Some more suggestions:
- consider installing a monitoring tool on all devices. I recommend going with grafana+prometheus on a central server + node_exporter on the hc2s. You can deploy it through ansible, for instance. This would allow you to monitor cpu/memory/temperature and other parameters for all devices. You can get detailed information - such as kernel slab usage - which was an issue for me on a different board.
- if possible, try enabling netconsole for problematic units and send the logs to a centralized syslog location. That would send you kernel messages as the unit dies of a kernel panic. Test it with the kernel crash command above. Here are some details: https://magazine.odroid.com/wp-content/ ... pdf#page=8. You will need to use the collector's IP (most likely it needs to be the public ip, not sure if it can send those messages over a vpn terminated on the hc2), and the mac address of your router.
- the watchdog can run an external script that can do health check or log extra info. Here's mine, but may need tweaking: https://paste.ubuntu.com/p/nPp5yYS7jD/

You might get a post-mortem, or trigger the watchdog when something unexpected happens (like rootfs remounts readonly)

Good luck!
Thank you for your suggestions!
Unfortunately, a centralized monitoring tool isn't something we can provide these machines with, or not in every case - at least.
Seeing as we don't own most of these sites, and not every site has an active internet connection, we're quite limited in this regard.
I'll see if there's a way we can set up something akin to this, but I'm not too optimistic as there's a lot of red tape.

User avatar
odroid
Site Admin
Posts: 38033
Joined: Fri Feb 22, 2013 11:14 pm
languages_spoken: English, Korean
ODROIDs: ODROID
Has thanked: 1999 times
Been thanked: 1206 times
Contact:

Re: [Odroid HC2 - Ubuntu 20.04 - 5.4] Seemingly random unavailability/crashing

Post by odroid »

You needed to write 5 posts to enable the PM feature. It is a countermeasure for PM spammer in this forum.
Anyway, we've got your email and we will used an HC2 and an XU4 (for camera emulation).
If we have any issue while following your instruction, we will contact you first.
These users thanked the author odroid for the post:
barelycompetent (Wed Aug 11, 2021 3:53 pm)

barelycompetent
Posts: 15
Joined: Sat Jun 26, 2021 12:05 am
languages_spoken: English, Dutch
Has thanked: 13 times
Been thanked: 0
Contact:

Re: [Odroid HC2 - Ubuntu 20.04 - 5.4] Seemingly random unavailability/crashing

Post by barelycompetent »

odroid wrote:
Tue Aug 10, 2021 9:40 am
You needed to write 5 posts to enable the PM feature. It is a countermeasure for PM spammer in this forum.
Anyway, we've got your email and we will used an HC2 and an XU4 (for camera emulation).
If we have any issue while following your instruction, we will contact you first.
Ah, I suspected that'd be a potential reason.
Thank you very much for taking the time to look into this!

Also, I've attached the temperature logs, as requested.
It doesn't seem like this unit is suffering from high temperatures for as far as I can see.
Attachments
opt.log
(1.04 MiB) Downloaded 6 times

User avatar
odroid
Site Admin
Posts: 38033
Joined: Fri Feb 22, 2013 11:14 pm
languages_spoken: English, Korean
ODROIDs: ODROID
Has thanked: 1999 times
Been thanked: 1206 times
Contact:

Re: [Odroid HC2 - Ubuntu 20.04 - 5.4] Seemingly random unavailability/crashing

Post by odroid »

Thank you for the temperature log. It has been below 60°C.
I think the CPU temperature is not a root cause since it is very far from the critical point 85°C.

BTW, the instruction looks more complicate than our expectation.
I hope we can start a long term stability test before this weekend.
These users thanked the author odroid for the post:
barelycompetent (Wed Aug 11, 2021 3:53 pm)

barelycompetent
Posts: 15
Joined: Sat Jun 26, 2021 12:05 am
languages_spoken: English, Dutch
Has thanked: 13 times
Been thanked: 0
Contact:

Re: [Odroid HC2 - Ubuntu 20.04 - 5.4] Seemingly random unavailability/crashing

Post by barelycompetent »

odroid wrote:
Wed Aug 11, 2021 10:21 am
Thank you for the temperature log. It has been below 60°C.
I think the CPU temperature is not a root cause since it is very far from the critical point 85°C.

BTW, the instruction looks more complicate than our expectation.
I hope we can start a long term stability test before this weekend.
Yeah, it does seem like it's not getting too toasty for comfort, so that's something!

And the instructions are pretty detailed, but I personally think it's quite straightforward once you get to it.
I like to describe the steps in detail, just in case it's not immediately clear. The client software should have localization options for Korean, if that helps.
Again, if something isn't clear or if I can help with the setup, I really don't mind lending some support to speed things up for you folks (least I can do).
I'm pretty much always available at the office from 8:30 AM to 6:00 PM CEST, usually beyond that point as well.

User avatar
mctom
Posts: 438
Joined: Wed Nov 11, 2020 4:44 am
languages_spoken: english, polish
ODROIDs: N2+, Game Advance, a few XU4
Location: Gdansk, Poland
Has thanked: 48 times
Been thanked: 41 times
Contact:

Re: [Odroid HC2 - Ubuntu 20.04 - 5.4] Seemingly random unavailability/crashing

Post by mctom »

barelycompetent wrote:
Wed Aug 04, 2021 2:13 am
When we tried swapping affected units along with their PSU, the issue still seemed to persist.
That is the key information to me - the problem seems to be site related, and most probably impossible to spontaneously appear outside that environment.

Off the top of my head, here are a few things to try recreate the problem in the office.
- What would happen if you plug out the ethernet cable for a minute or two? Will it recover on its own?
- What if you overheat the unit (disconnect a fan if available, wrap in a blanket or something)
- What if you cool down the unit (freezer?)
(In one of the companies I worked with, the outdoor units worked fine in the lab but not on site - turned out the problem was -25 deg C ambient somewhere in Syberia)
(The CPU might not be freezing, but I'm not so sure about the rest of the board!)
- Are you sure there is no condensating moisture at night?
These users thanked the author mctom for the post:
barelycompetent (Fri Aug 13, 2021 12:15 am)
Punk ain't no religious cult, punk means thinking for yourself!

Maintainer of PiStackMon

barelycompetent
Posts: 15
Joined: Sat Jun 26, 2021 12:05 am
languages_spoken: English, Dutch
Has thanked: 13 times
Been thanked: 0
Contact:

Re: [Odroid HC2 - Ubuntu 20.04 - 5.4] Seemingly random unavailability/crashing

Post by barelycompetent »

mctom wrote:
Wed Aug 11, 2021 5:59 pm
barelycompetent wrote:
Wed Aug 04, 2021 2:13 am
When we tried swapping affected units along with their PSU, the issue still seemed to persist.
That is the key information to me - the problem seems to be site related, and most probably impossible to spontaneously appear outside that environment.

Off the top of my head, here are a few things to try recreate the problem in the office.
- What would happen if you plug out the ethernet cable for a minute or two? Will it recover on its own?
- What if you overheat the unit (disconnect a fan if available, wrap in a blanket or something)
- What if you cool down the unit (freezer?)
(In one of the companies I worked with, the outdoor units worked fine in the lab but not on site - turned out the problem was -25 deg C ambient somewhere in Syberia)
(The CPU might not be freezing, but I'm not so sure about the rest of the board!)
- Are you sure there is no condensating moisture at night?
Hi!

Thank you for sharing your experiences, I must admit the Syberia part gave me a chuckle as I visualized it.
I can at least offer one answer, being to the first bullet-point.
Unplugging the network cable and replugging it was tried on the board that we had a serial connection with at the time, but it didn't manage to come back online.

Just to make sure that none of the suggested environmental variables could be triggering this, I'll be creating a test environment for a few of these in the following week.
Lucky for me, I just got shot a message about another unit displaying the same (or similar) issue. So, I'm hoping that I can get access to more details about this one soon so that I can investigate.
I'll be asking for information about its direct environment as well, hopefully that gets us closer to a resolution!

User avatar
mctom
Posts: 438
Joined: Wed Nov 11, 2020 4:44 am
languages_spoken: english, polish
ODROIDs: N2+, Game Advance, a few XU4
Location: Gdansk, Poland
Has thanked: 48 times
Been thanked: 41 times
Contact:

Re: [Odroid HC2 - Ubuntu 20.04 - 5.4] Seemingly random unavailability/crashing

Post by mctom »

Fortunately I was never a part of a team working in the field, but indeed, at the very end the company had to offer units with self-heating capabilities for Russian market. And with much bigger varistors :roll:

I also have a childhood memory of my dad fixing a computer for someone. He had a machine that didn't work properly until was heated up for 1 hour, and then reset. He sprayed some sort of cooling spray on motherboard chips, and he found one that being cooled down.. froze the system :D

Anyway, that is a very rude way to treat an Odroid, but perhaps instead of sticking it in the freezer, this is something that could be tried. Just make sure you don't cool it down too fast. :)
barelycompetent wrote:
Fri Aug 13, 2021 1:09 am
Unplugging the network cable and replugging it was tried on the board that we had a serial connection with at the time, but it didn't manage to come back online.
This is a problem on its own right, but I was thinking. What happens with the camera footage in this case? Is it lost or does it fill up the storage? ;)
These users thanked the author mctom for the post:
barelycompetent (Fri Aug 13, 2021 2:52 am)
Punk ain't no religious cult, punk means thinking for yourself!

Maintainer of PiStackMon

barelycompetent
Posts: 15
Joined: Sat Jun 26, 2021 12:05 am
languages_spoken: English, Dutch
Has thanked: 13 times
Been thanked: 0
Contact:

Re: [Odroid HC2 - Ubuntu 20.04 - 5.4] Seemingly random unavailability/crashing

Post by barelycompetent »

mctom wrote:
Fri Aug 13, 2021 1:54 am
Fortunately I was never a part of a team working in the field, but indeed, at the very end the company had to offer units with self-heating capabilities for Russian market. And with much bigger varistors :roll:

I also have a childhood memory of my dad fixing a computer for someone. He had a machine that didn't work properly until was heated up for 1 hour, and then reset. He sprayed some sort of cooling spray on motherboard chips, and he found one that being cooled down.. froze the system :D

Anyway, that is a very rude way to treat an Odroid, but perhaps instead of sticking it in the freezer, this is something that could be tried. Just make sure you don't cool it down too fast. :)
Image

In all seriousness though, it seems unlikely that the units are exposed to a sufficiently cold climate considering their locations and use-cases.
Seeing as they're completely fanless and usually installed indoors at office- or home locations, I'd expect them to overheat if anything.
Nevertheless, it's good to know how they would respond to such a climate - so I'll be putting it on my to-test list anyway!
mctom wrote: This is a problem on its own right, but I was thinking. What happens with the camera footage in this case? Is it lost or does it fill up the storage? ;)
Well, as one would expect the already recorded footage is preserved but seeing as these are all IP cameras, even if the mediaserver itself were to continue running (which I'm not sure it actually does) there's no network devices to receive a videostream from so there's no recording happening from that point until the device is power cycled. This is part of the reason why the issue is slowly becoming critical for us, as they're supposed to be used as part of a surveillance system. You can imagine it's particularly annoying when it happens at some remote location that requires the on-site to drive to another timezone. Then again, I'm pretty sure that they've never had to roll in on a Siberian peninsula at -25°C to go and power cycle one of these! :lol:

User avatar
mctom
Posts: 438
Joined: Wed Nov 11, 2020 4:44 am
languages_spoken: english, polish
ODROIDs: N2+, Game Advance, a few XU4
Location: Gdansk, Poland
Has thanked: 48 times
Been thanked: 41 times
Contact:

Re: [Odroid HC2 - Ubuntu 20.04 - 5.4] Seemingly random unavailability/crashing

Post by mctom »

You never said explicitly where there odroids are installed, so I assumed some sites can be chilly.
On the other hand, if cold would be a reason of failures, I guess you'd observe a pattern here - that this happens mostly during night, or absence of anyone around (and heating turned off).
And finally, it's summer after all.
These users thanked the author mctom for the post:
barelycompetent (Tue Aug 17, 2021 11:54 pm)
Punk ain't no religious cult, punk means thinking for yourself!

Maintainer of PiStackMon

User avatar
mctom
Posts: 438
Joined: Wed Nov 11, 2020 4:44 am
languages_spoken: english, polish
ODROIDs: N2+, Game Advance, a few XU4
Location: Gdansk, Poland
Has thanked: 48 times
Been thanked: 41 times
Contact:

Re: [Odroid HC2 - Ubuntu 20.04 - 5.4] Seemingly random unavailability/crashing

Post by mctom »

Code: Select all

mctom@Tomusiomat-ARM:/tmp/mozilla_mctom0$ cat dmesg.txt | cut -c 16-1000 > dmesg.mct
mctom@Tomusiomat-ARM:/tmp/mozilla_mctom0$ cat dmesg.0.txt | cut -c 16-1000 > dmesg.0.mct
mctom@Tomusiomat-ARM:/tmp/mozilla_mctom0$ diff dmesg.mct dmesg.0.mct
2c2
< kernel: Linux version 5.4.134-224 (root@1604_builder_armhf) (gcc version 9.3.0 (Ubuntu 9.3.0-10ubuntu2)) #1 SMP PREEMPT Tue Jul 27 15:39:41 EDT 2021
---
> kernel: Linux version 5.4.58-215 (root@1604_builder_armhf) (gcc version 9.3.0 (Ubuntu 9.3.0-10ubuntu2)) #1 SMP PREEMPT Wed Sep 2 09:59:59 EDT 2020
25c25
< kernel: Memory: 1899044K/2074624K available (9216K kernel code, 772K rwdata, 2640K rodata, 1024K init, 323K bss, 44508K reserved, 131072K cma-reserved, 1157120K highmem)
---
> kernel: Memory: 1899060K/2074624K available (9216K kernel code, 777K rwdata, 2748K rodata, 1024K init, 322K bss, 44492K reserved, 131072K cma-reserved, 1157120K highmem)
71c71
< kernel: VFP support v0.3: implementor 41 architecture 4 part 30 variant f rev 0
---
> kernel: VFP support v0.3: implementor 41 architecture 2 part 30 variant 7 rev 3
78c78
< kernel: audit: type=2000 audit(0.048:1): state=initialized audit_enabled=0 res=1
---
> kernel: audit: type=2000 audit(0.320:1): state=initialized audit_enabled=0 res=1
82a83
> kernel: random: fast init done
103d103
< kernel: IP idents hash table entries: 16384 (order: 5, 131072 bytes, linear)
116,117c116
< kernel: random: fast init done
< kernel: Freeing initrd memory: 10080K
---
> kernel: Freeing initrd memory: 10060K
127d125
< kernel: nfs4flexfilelayout_init: NFSv4 Flexfile Layout Driver Registering...
147a146
> kernel: samsung-uart 12c00000.serial: IRQ index 1 not found
148a148
> kernel: samsung-uart 12c20000.serial: IRQ index 1 not found
154a155,156
> kernel: mali 11800000.gpu: Failed to get regulator
> kernel: mali 11800000.gpu: Power control initialization failed
169c171
< kernel: usb usb1: Manufacturer: Linux 5.4.134-224 ehci_hcd
---
> kernel: usb usb1: Manufacturer: Linux 5.4.58-215 ehci_hcd
181c183
< kernel: usb usb2: Manufacturer: Linux 5.4.134-224 ohci_hcd
---
> kernel: usb usb2: Manufacturer: Linux 5.4.58-215 ohci_hcd
218a221
> kernel: mmc_host mmc0: card is non-removable.
224a228
> kernel: mmc_host mmc0: Bus speed (slot 0) = 50000000Hz (slot req 300000Hz, actual 297619HZ div = 84)
231a236
> kernel: mmc_host mmc0: Bus speed (slot 0) = 50000000Hz (slot req 200000Hz, actual 200000HZ div = 125)
235d239
< kernel: mmc1: new ultra high speed SDR104 SDHC card at address 0001
237d240
< kernel: mmcblk1: mmc1:0001 EB1QT 29.8 GiB 
239c242
< kernel:  mmcblk1: p1 p2
---
> kernel: exynos5-dmc 10c20000.memory-controller: DMC initialized, in irq mode: 0
244a248
> kernel: mmc_host mmc0: Bus speed (slot 0) = 50000000Hz (slot req 100000Hz, actual 100000HZ div = 250)
254,255c258,259
< kernel: exynos-drm exynos-drm: bound 14450000.mixer (ops 0xc0a6d000)
< kernel: exynos-drm exynos-drm: bound 14530000.hdmi (ops 0xc0a6d684)
---
> kernel: exynos-drm exynos-drm: bound 14450000.mixer (ops 0xc0a6cfe0)
> kernel: exynos-drm exynos-drm: bound 14530000.hdmi (ops 0xc0a6d664)
257c261,263
< kernel: exynos-drm exynos-drm: bound 10850000.g2d (ops 0xc0a6e5cc)
---
> kernel: mmc1: new ultra high speed SDR104 SDHC card at address 0001
> kernel: exynos-drm exynos-drm: bound 10850000.g2d (ops 0xc0a6e5ac)
> kernel: mmcblk1: mmc1:0001 EB1QT 29.8 GiB 
259c265,266
< kernel: exynos-drm exynos-drm: bound 11c00000.rotator (ops 0xc0a6ee58)
---
> kernel:  mmcblk1: p1 p2
> kernel: exynos-drm exynos-drm: bound 11c00000.rotator (ops 0xc0a6ee38)
264a272,274
> kernel: mali 11800000.gpu: GPU identified as 0x0620 r0p1 status 0
> kernel: mali 11800000.gpu: Protected mode not available
> kernel: mali 11800000.gpu: Probed as mali0
273c283
< kernel: usb usb3: Manufacturer: Linux 5.4.134-224 xhci-hcd
---
> kernel: usb usb3: Manufacturer: Linux 5.4.58-215 xhci-hcd
281a292
> kernel: devfreq 11800000.gpu: Couldn't update frequency transition information.
284c295
< kernel: usb usb4: Manufacturer: Linux 5.4.134-224 xhci-hcd
---
> kernel: usb usb4: Manufacturer: Linux 5.4.58-215 xhci-hcd
296c307
< kernel: usb usb5: Manufacturer: Linux 5.4.134-224 xhci-hcd
---
> kernel: usb usb5: Manufacturer: Linux 5.4.58-215 xhci-hcd
307c318
< kernel: usb usb6: Manufacturer: Linux 5.4.134-224 xhci-hcd
---
> kernel: usb usb6: Manufacturer: Linux 5.4.58-215 xhci-hcd
311c322
< kernel: rtc rtc1: invalid alarm value: 1900-01-15T00:00:00
---
> kernel: rtc rtc1: invalid alarm value: 1900-01-11T00:00:00
326d336
< kernel: usb 3-1: new high-speed USB device number 2 using xhci-hcd
330c340
< kernel: s5m-rtc s2mps14-rtc: setting system clock to 2021-08-03T14:56:06 UTC (1628002566)
---
> kernel: s5m-rtc s2mps14-rtc: setting system clock to 2021-07-31T09:07:54 UTC (1627722474)
334a345
> kernel: usb 3-1: new high-speed USB device number 2 using xhci-hcd
372d382
< kernel: EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null)
373a384
> kernel: EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null)
375c386
< systemd[1]: systemd 245.4-4ubuntu3.11 running in system mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD +IDN2 -IDN +PCRE2 default-hierarchy=hybrid)
---
> systemd[1]: systemd 245.4-4ubuntu3.2 running in system mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD +IDN2 -IDN +PCRE2 default-hierarchy=hybrid)
377a389
> systemd[1]: /lib/systemd/system/dbus.socket:5: ListenStream= references a path below legacy directory /var/run/, updating /var/run/dbus/system_bus_socket → /run/dbus/system_bus_socket; please update the unit file accordingly.
406d417
< systemd[1]: Condition check resulted in OpenVSwitch configuration for cleanup being skipped.
412a424,425
> systemd[1]: Mounted POSIX Message Queue File System.
> systemd[1]: Mounted Kernel Debug File System.
420d432
< kernel: EXT4-fs (sda2): mounted filesystem with ordered data mode. Opts: errors=remount-ro
Did you know it did an update in the meantime? The kernel's changed.
These users thanked the author mctom for the post:
barelycompetent (Tue Aug 17, 2021 11:55 pm)
Punk ain't no religious cult, punk means thinking for yourself!

Maintainer of PiStackMon

User avatar
mad_ady
Posts: 9689
Joined: Wed Jul 15, 2015 5:00 pm
languages_spoken: english
ODROIDs: XU4, C1+, C2, C4, N1, N2, H2, Go, Go Advance
Location: Bucharest, Romania
Has thanked: 609 times
Been thanked: 721 times
Contact:

Re: [Odroid HC2 - Ubuntu 20.04 - 5.4] Seemingly random unavailability/crashing

Post by mad_ady »

there's no network devices to receive a videostream from so there's no recording happening from that point until the device is power cycled. This is part of the reason why the issue is slowly becoming critical for us, as they're supposed to be used as part of a surveillance system
Now, I don't know about the network environment your odroids are in to speculate (or troubleshoot) (but if you have serial to one of the isolated boxes, this is what I'd try to collect:

Code: Select all

ip addr show
ip link show
ip route show
ethtool eth0
ping -c4 $gateway
arp -an
tcpdump -n -i eth0 -c 100
), but you should work on an automatic recovery plan. For instance you could have a script run by cron that checks it's atill receiving ip camera traffic (either by checking recording size on disk (which is preferable, since it's end-to-end), or by checking network traffic (tcpdump -c 10 port $ip_camera_stream). If you find issues, write diagnostic data (dmesg output at least) and reboot.
These users thanked the author mad_ady for the post:
barelycompetent (Tue Aug 17, 2021 11:54 pm)

barelycompetent
Posts: 15
Joined: Sat Jun 26, 2021 12:05 am
languages_spoken: English, Dutch
Has thanked: 13 times
Been thanked: 0
Contact:

Re: [Odroid HC2 - Ubuntu 20.04 - 5.4] Seemingly random unavailability/crashing

Post by barelycompetent »

mctom wrote:
Fri Aug 13, 2021 4:37 am
You never said explicitly where there odroids are installed, so I assumed some sites can be chilly.
On the other hand, if cold would be a reason of failures, I guess you'd observe a pattern here - that this happens mostly during night, or absence of anyone around (and heating turned off).
And finally, it's summer after all.
Hi,

No, you're right - I apologize for being so all over the place!
Most of these are installed in pretty moderate climates, anyway.
But still it would be something to investigate, after all you never know what people subject these things to.
mctom wrote:
Fri Aug 13, 2021 4:54 am

Code: Select all

mctom@Tomusiomat-ARM:/tmp/mozilla_mctom0$ cat dmesg.txt | cut -c 16-1000 > dmesg.mct
mctom@Tomusiomat-ARM:/tmp/mozilla_mctom0$ cat dmesg.0.txt | cut -c 16-1000 > dmesg.0.mct
mctom@Tomusiomat-ARM:/tmp/mozilla_mctom0$ diff dmesg.mct dmesg.0.mct
2c2
< kernel: Linux version 5.4.134-224 (root@1604_builder_armhf) (gcc version 9.3.0 (Ubuntu 9.3.0-10ubuntu2)) #1 SMP PREEMPT Tue Jul 27 15:39:41 EDT 2021
---
> kernel: Linux version 5.4.58-215 (root@1604_builder_armhf) (gcc version 9.3.0 (Ubuntu 9.3.0-10ubuntu2)) #1 SMP PREEMPT Wed Sep 2 09:59:59 EDT 2020
25c25
< kernel: Memory: 1899044K/2074624K available (9216K kernel code, 772K rwdata, 2640K rodata, 1024K init, 323K bss, 44508K reserved, 131072K cma-reserved, 1157120K highmem)
---
> kernel: Memory: 1899060K/2074624K available (9216K kernel code, 777K rwdata, 2748K rodata, 1024K init, 322K bss, 44492K reserved, 131072K cma-reserved, 1157120K highmem)
71c71
< kernel: VFP support v0.3: implementor 41 architecture 4 part 30 variant f rev 0
---
> kernel: VFP support v0.3: implementor 41 architecture 2 part 30 variant 7 rev 3
78c78
< kernel: audit: type=2000 audit(0.048:1): state=initialized audit_enabled=0 res=1
---
> kernel: audit: type=2000 audit(0.320:1): state=initialized audit_enabled=0 res=1
82a83
> kernel: random: fast init done
103d103
< kernel: IP idents hash table entries: 16384 (order: 5, 131072 bytes, linear)
116,117c116
< kernel: random: fast init done
< kernel: Freeing initrd memory: 10080K
---
> kernel: Freeing initrd memory: 10060K
127d125
< kernel: nfs4flexfilelayout_init: NFSv4 Flexfile Layout Driver Registering...
147a146
> kernel: samsung-uart 12c00000.serial: IRQ index 1 not found
148a148
> kernel: samsung-uart 12c20000.serial: IRQ index 1 not found
154a155,156
> kernel: mali 11800000.gpu: Failed to get regulator
> kernel: mali 11800000.gpu: Power control initialization failed
169c171
< kernel: usb usb1: Manufacturer: Linux 5.4.134-224 ehci_hcd
---
> kernel: usb usb1: Manufacturer: Linux 5.4.58-215 ehci_hcd
181c183
< kernel: usb usb2: Manufacturer: Linux 5.4.134-224 ohci_hcd
---
> kernel: usb usb2: Manufacturer: Linux 5.4.58-215 ohci_hcd
218a221
> kernel: mmc_host mmc0: card is non-removable.
224a228
> kernel: mmc_host mmc0: Bus speed (slot 0) = 50000000Hz (slot req 300000Hz, actual 297619HZ div = 84)
231a236
> kernel: mmc_host mmc0: Bus speed (slot 0) = 50000000Hz (slot req 200000Hz, actual 200000HZ div = 125)
235d239
< kernel: mmc1: new ultra high speed SDR104 SDHC card at address 0001
237d240
< kernel: mmcblk1: mmc1:0001 EB1QT 29.8 GiB 
239c242
< kernel:  mmcblk1: p1 p2
---
> kernel: exynos5-dmc 10c20000.memory-controller: DMC initialized, in irq mode: 0
244a248
> kernel: mmc_host mmc0: Bus speed (slot 0) = 50000000Hz (slot req 100000Hz, actual 100000HZ div = 250)
254,255c258,259
< kernel: exynos-drm exynos-drm: bound 14450000.mixer (ops 0xc0a6d000)
< kernel: exynos-drm exynos-drm: bound 14530000.hdmi (ops 0xc0a6d684)
---
> kernel: exynos-drm exynos-drm: bound 14450000.mixer (ops 0xc0a6cfe0)
> kernel: exynos-drm exynos-drm: bound 14530000.hdmi (ops 0xc0a6d664)
257c261,263
< kernel: exynos-drm exynos-drm: bound 10850000.g2d (ops 0xc0a6e5cc)
---
> kernel: mmc1: new ultra high speed SDR104 SDHC card at address 0001
> kernel: exynos-drm exynos-drm: bound 10850000.g2d (ops 0xc0a6e5ac)
> kernel: mmcblk1: mmc1:0001 EB1QT 29.8 GiB 
259c265,266
< kernel: exynos-drm exynos-drm: bound 11c00000.rotator (ops 0xc0a6ee58)
---
> kernel:  mmcblk1: p1 p2
> kernel: exynos-drm exynos-drm: bound 11c00000.rotator (ops 0xc0a6ee38)
264a272,274
> kernel: mali 11800000.gpu: GPU identified as 0x0620 r0p1 status 0
> kernel: mali 11800000.gpu: Protected mode not available
> kernel: mali 11800000.gpu: Probed as mali0
273c283
< kernel: usb usb3: Manufacturer: Linux 5.4.134-224 xhci-hcd
---
> kernel: usb usb3: Manufacturer: Linux 5.4.58-215 xhci-hcd
281a292
> kernel: devfreq 11800000.gpu: Couldn't update frequency transition information.
284c295
< kernel: usb usb4: Manufacturer: Linux 5.4.134-224 xhci-hcd
---
> kernel: usb usb4: Manufacturer: Linux 5.4.58-215 xhci-hcd
296c307
< kernel: usb usb5: Manufacturer: Linux 5.4.134-224 xhci-hcd
---
> kernel: usb usb5: Manufacturer: Linux 5.4.58-215 xhci-hcd
307c318
< kernel: usb usb6: Manufacturer: Linux 5.4.134-224 xhci-hcd
---
> kernel: usb usb6: Manufacturer: Linux 5.4.58-215 xhci-hcd
311c322
< kernel: rtc rtc1: invalid alarm value: 1900-01-15T00:00:00
---
> kernel: rtc rtc1: invalid alarm value: 1900-01-11T00:00:00
326d336
< kernel: usb 3-1: new high-speed USB device number 2 using xhci-hcd
330c340
< kernel: s5m-rtc s2mps14-rtc: setting system clock to 2021-08-03T14:56:06 UTC (1628002566)
---
> kernel: s5m-rtc s2mps14-rtc: setting system clock to 2021-07-31T09:07:54 UTC (1627722474)
334a345
> kernel: usb 3-1: new high-speed USB device number 2 using xhci-hcd
372d382
< kernel: EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null)
373a384
> kernel: EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null)
375c386
< systemd[1]: systemd 245.4-4ubuntu3.11 running in system mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD +IDN2 -IDN +PCRE2 default-hierarchy=hybrid)
---
> systemd[1]: systemd 245.4-4ubuntu3.2 running in system mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD +IDN2 -IDN +PCRE2 default-hierarchy=hybrid)
377a389
> systemd[1]: /lib/systemd/system/dbus.socket:5: ListenStream= references a path below legacy directory /var/run/, updating /var/run/dbus/system_bus_socket → /run/dbus/system_bus_socket; please update the unit file accordingly.
406d417
< systemd[1]: Condition check resulted in OpenVSwitch configuration for cleanup being skipped.
412a424,425
> systemd[1]: Mounted POSIX Message Queue File System.
> systemd[1]: Mounted Kernel Debug File System.
420d432
< kernel: EXT4-fs (sda2): mounted filesystem with ordered data mode. Opts: errors=remount-ro
Did you know it did an update in the meantime? The kernel's changed.
Now that you mention it, I may have executed a kernel update on this particular device...
I can't believe I forgot to bring this up. Please forgive my continued ineptitude, this is a bit of a first for me. Could it perhaps be a simple as a kernel update to fix the issue?
mad_ady wrote:
Fri Aug 13, 2021 1:38 pm
there's no network devices to receive a videostream from so there's no recording happening from that point until the device is power cycled. This is part of the reason why the issue is slowly becoming critical for us, as they're supposed to be used as part of a surveillance system
Now, I don't know about the network environment your odroids are in to speculate (or troubleshoot) (but if you have serial to one of the isolated boxes, this is what I'd try to collect:

Code: Select all

ip addr show
ip link show
ip route show
ethtool eth0
ping -c4 $gateway
arp -an
tcpdump -n -i eth0 -c 100
), but you should work on an automatic recovery plan. For instance you could have a script run by cron that checks it's atill receiving ip camera traffic (either by checking recording size on disk (which is preferable, since it's end-to-end), or by checking network traffic (tcpdump -c 10 port $ip_camera_stream). If you find issues, write diagnostic data (dmesg output at least) and reboot.
Well, most of the setups are pretty generic, honestly. Usually they're connected to a relatively small local network with a home router or something among those lines.

I'll be keeping that list of commands on-hand for when I get the chance to connect to one of these units and report back. I've only got a limited amount of these serial cables that I can send out, so I've shipped a few out to the sites with the most issues. Hopefully this will result in one of them being available for me to poke at pretty soon.

lsc1117
Posts: 230
Joined: Thu Aug 22, 2013 12:46 am
languages_spoken: english
Location: South Korea
Has thanked: 3 times
Been thanked: 28 times
Contact:

Re: [Odroid HC2 - Ubuntu 20.04 - 5.4] Seemingly random unavailability/crashing

Post by lsc1117 »

@barelycompetent,

Hi, we have installed the VMS with your instruction last week.
We are using HC2 with 4TB NAS HDD for the media server and using N2+ for Client and camera emulations.
The recording of the video works well on HC2. It has about two days record in recent.

If we have any issues while the test, we will tell you.
These users thanked the author lsc1117 for the post:
barelycompetent (Tue Aug 17, 2021 11:54 pm)

barelycompetent
Posts: 15
Joined: Sat Jun 26, 2021 12:05 am
languages_spoken: English, Dutch
Has thanked: 13 times
Been thanked: 0
Contact:

Re: [Odroid HC2 - Ubuntu 20.04 - 5.4] Seemingly random unavailability/crashing

Post by barelycompetent »

lsc1117 wrote:
Tue Aug 17, 2021 3:49 pm
@barelycompetent,

Hi, we have installed the VMS with your instruction last week.
We are using HC2 with 4TB NAS HDD for the media server and using N2+ for Client and camera emulations.
The recording of the video works well on HC2. It has about two days record in recent.

If we have any issues while the test, we will tell you.
Hi lsc1117,

That's great to hear, I hope the issue presents itself and we can further investigate.
Until that time, if anything else can be done on my part - be sure to let me know.
Again, the effort is very much appreciated!

lsc1117
Posts: 230
Joined: Thu Aug 22, 2013 12:46 am
languages_spoken: english
Location: South Korea
Has thanked: 3 times
Been thanked: 28 times
Contact:

Re: [Odroid HC2 - Ubuntu 20.04 - 5.4] Seemingly random unavailability/crashing

Post by lsc1117 »

@barelycompetent,

We've been running the VMS and have no issues so far, current uptime is 12days.
I share the htop on my HC2.
hc2_vms.png
hc2_vms.png (191.25 KiB) Viewed 362 times
We will keep running this test bed a few more weeks if you don't mind.

barelycompetent
Posts: 15
Joined: Sat Jun 26, 2021 12:05 am
languages_spoken: English, Dutch
Has thanked: 13 times
Been thanked: 0
Contact:

Re: [Odroid HC2 - Ubuntu 20.04 - 5.4] Seemingly random unavailability/crashing

Post by barelycompetent »

lsc1117 wrote:
Tue Aug 24, 2021 4:52 pm
@barelycompetent,

We've been running the VMS and have no issues so far, current uptime is 12days.
I share the htop on my HC2.

hc2_vms.png

We will keep running this test bed a few more weeks if you don't mind.
Hi lsc1117!

Apologies for the delayed response there.
No problem at all, test for as long as you deem necessary!

I've now also been able to confirm that at least one machine with the latest kernel-update and WDT active has experienced the same issue again.
This unit was also connected to a UPS, installed inside a dry and cool environment and placed on top of an aluminium surface inside of a server cabinet.
I'll soon receive this unit back from the site in question to check whether or not it's actually related, but from the sounds of it - it's likely that it's the same problem occuring again.

In the meantime, I'm creating new images from scratch. I'll be attempting to create one with Ubuntu 18.04 (Linux 4.9) and 20.04 (Linux 5.4), but it may take some time before these are able to see deployment to affected sites.
On top of that I'll try and implement netconsole to make an attempt at receiving a post-mortem, as mad_ady suggested earlier in the thread.

To give a bit of an overview, the issue was confirmed to impact units with the kernel versions stated below:
  • 5.4.58-215
  • 5.4.109-220
  • 5.4.134-225
Once again, I want to express my sincere gratitude to all of you who've taken the time to try and help with this issue - it cannot be overstated.

barelycompetent
Posts: 15
Joined: Sat Jun 26, 2021 12:05 am
languages_spoken: English, Dutch
Has thanked: 13 times
Been thanked: 0
Contact:

Re: [Odroid HC2 - Ubuntu 20.04 - 5.4] Seemingly random unavailability/crashing

Post by barelycompetent »

Hi everyone,

Sorry for the lengthy delay in updates, I've been a bit busy over the past few weeks.
A few things have happened since the last time I posted here, but I'll try to keep it brief.

I've been able to visit a few of the sites that were having issues and it appears that the working conditions of the devices are about as favorable as one can hope.
Despite the fact that this is the case, and we swapped the units at 2 out of 3 sites, one of the swapped units went offline again shortly after the weekend.

Many thanks to mad_ady for the guide on Netconsole, which I was able to set up in combination with a VPN and rsyslog.
This has been implemented on a few of the devices, and though some don't appear to do exactly as we'd expect, we do have a few devices that output to our remote logging server.

Now, I've been trying to get an idea of what's happening and I noticed that the device that failed was outputting some errors related to the VPN connection.
As a result, I tried further looking at the network side of things and (since the device works with DHCP), I started suspecting an issue with our image.
Turns out that both NetworkManager and networkd were set to enabled in systemctl, which I've read could cause issues.

Has anyone got an idea as to what kind of impact it could have when both of these are active?
I've now turned off NetworkManager on all these devices and specified networkd as the renderer in the respective netplan configuration files.

lsc1117
Posts: 230
Joined: Thu Aug 22, 2013 12:46 am
languages_spoken: english
Location: South Korea
Has thanked: 3 times
Been thanked: 28 times
Contact:

Re: [Odroid HC2 - Ubuntu 20.04 - 5.4] Seemingly random unavailability/crashing

Post by lsc1117 »

@barelycompetent,

Sorry, I have no experience looking into the NetworkManager or networkd.

And we have kept running the VMS on our HC2 and N2(camera emulator). Uptime is near one month. We will keep it a couple of more weeks.
htop_hc2_20210908.png
htop_hc2_20210908.png (358.71 KiB) Viewed 289 times

User avatar
mctom
Posts: 438
Joined: Wed Nov 11, 2020 4:44 am
languages_spoken: english, polish
ODROIDs: N2+, Game Advance, a few XU4
Location: Gdansk, Poland
Has thanked: 48 times
Been thanked: 41 times
Contact:

Re: [Odroid HC2 - Ubuntu 20.04 - 5.4] Seemingly random unavailability/crashing

Post by mctom »

It's hard to imagine that the machine would freeze because of that, but losing Ethernet connectivity is my specialty lately. With recently gained experience with networkd and networkmanager I can say it's definitely a good idea to have the unused renderer turned off (masked in systemctl). Probably not the best idea to purge it completely, because it may take some useful components along, like wpa-supplicant.

NetworkManager is supposed to be controlled by netplan, but may also source its own config files from somewhere else. In my case, it had a bright idea of turning off wlan adapters when bored (albeit usually at boot time, but I never keep my lappie on for days)

If it is possible, post your netplan config files. If you happen to use bridges or bonds, I bet that is the problem. Otherwise, probably masking NetworkManager may be the solution.

So I think at this point you have established that the problem is just Ethernet connection and nothing suspicious going on with SBC itself?
Punk ain't no religious cult, punk means thinking for yourself!

Maintainer of PiStackMon

barelycompetent
Posts: 15
Joined: Sat Jun 26, 2021 12:05 am
languages_spoken: English, Dutch
Has thanked: 13 times
Been thanked: 0
Contact:

Re: [Odroid HC2 - Ubuntu 20.04 - 5.4] Seemingly random unavailability/crashing

Post by barelycompetent »

lsc1117 wrote:
Wed Sep 08, 2021 6:55 pm
@barelycompetent,

Sorry, I have no experience looking into the NetworkManager or networkd.

And we have kept running the VMS on our HC2 and N2(camera emulator).

htop_hc2_20210908.png
Seems like the system on your end isn't experiencing any of the issues we seem to be encountering, sadly not too surprising considering the relatively low amount of affected units.
Regardless, thank you for your efforts!
mctom wrote:
Wed Sep 08, 2021 7:02 pm
It's hard to imagine that the machine would freeze because of that, but losing Ethernet connectivity is my specialty lately. With recently gained experience with networkd and networkmanager I can say it's definitely a good idea to have the unused renderer turned off (masked in systemctl). Probably not the best idea to purge it completely, because it may take some useful components along, like wpa-supplicant.

NetworkManager is supposed to be controlled by netplan, but may also source its own config files from somewhere else. In my case, it had a bright idea of turning off wlan adapters when bored.

If it is possible, post your netplan config files. If you happen to use bridges or bonds, I bet that is the problem. Otherwise, probably masking NetworkManager may be the solution.

So I think at this point you have established that the problem is just Ethernet connection and nothing suspicious going on with SBC itself?
Yeah, it doesn't make much sense to me in general at this point - if I'm being honest.
I can't say that I've established that the issue is solely with the Ethernet connection, but considering the fact that when the issue presents itself the unit will not be available over the network whatsoever and we were able to connect using a serial connector on that one instance, I thought it would be worth looking into.

Original netplan.yaml

Code: Select all

network:
  renderer: NetworkManager
  ethernets:
    eth0:
      optional: true
      dhcp4: true
      addresses: ['172.31.255.250/28']
Recently adjusted netplan.yaml

Code: Select all

network:
  renderer: networkd
  ethernets:
    eth0:
      optional: true
      dhcp4: true
      dhcp6: false
      addresses: ['172.31.255.250/28']
Note: we use 172.31.255.250/28 as a static fallback IP address for direct connections, in case a DHCP-address isn't/can't be assigned.

Thank you for your suggestion as per the masking of the services, I then realized I only restricted the services from starting at boot previously.
I've now masked both NetworkManager and ModemManager, as well as stopped their services outright.

That being said, while I was attempting to arrange this on the aforementioned device that encountered the issue over the weekend, it appeared to output an IO error and had mounted the filesystems as read-only. I ran fsck and had it fix any issues it encountered, after which a reboot made it possible to write changes to the filesystem again and I was able to finish what I was doing.

Because of this I also ran a S.M.A.R.T. test and got the following results.
Keep in mind that this is the drive on which both the boot and storage partitions are present, so if it were to fail periodically and mount as read-only as a result, that may explain why no logs were present while it was experiencing issues - as it can no longer write to the filesystem? I'm just spitballing here, of course.

It looks like there's a device fault that was registered at one point?

S.M.A.R.T. Output

Code: Select all

smartctl 7.1 2019-12-30 r5022 [armv7l-linux-5.4.142-228] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     WDC WD40PURZ-85AKKY0
Serial Number:    WD-WX32D80EUCTJ
LU WWN Device Id: 5 0014ee 2be0e1bca
Firmware Version: 80.00A80
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Sep  8 15:57:12 2021 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      ( 121) The previous self-test completed having
                                        the read element of the test failed.
Total time to complete Offline
data collection:                (43260) seconds.
Offline data collection
capabilities:                    (0x11) SMART execute Offline immediate.
                                        No Auto Offline data collection support.
                                        Suspend Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        No Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 460) minutes.
SCT capabilities:              (0x303d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   224   221   021    Pre-fail  Always       -       3783
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       218
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   095   095   000    Old_age   Always       -       3685
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       218
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       104
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       113
194 Temperature_Celsius     0x0022   105   103   000    Old_age   Always       -       45
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
ATA Error Count: 45252 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 45252 occurred at disk power-on lifetime: 2608 hours (108 days + 16 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 61 46 00 00 00 00  Device Fault; Error: ABRT

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ef 03 46 00 00 00 00 00      00:13:51.645  SET FEATURES [Set transfer mode]
  ef 02 00 00 00 00 00 00      00:13:51.645  SET FEATURES [Enable write cache]
  e1 00 02 00 00 00 00 00      00:13:51.644  IDLE IMMEDIATE
  ec 00 01 00 00 00 00 00      00:13:51.644  IDENTIFY DEVICE
  2f 00 01 10 00 00 00 00      00:13:51.149  READ LOG EXT

Error 45251 occurred at disk power-on lifetime: 2608 hours (108 days + 16 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 61 00 00 00 00 00  Device Fault; Error: ABRT

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ef 02 00 00 00 00 00 00      00:13:51.645  SET FEATURES [Enable write cache]
  e1 00 02 00 00 00 00 00      00:13:51.644  IDLE IMMEDIATE
  ec 00 01 00 00 00 00 00      00:13:51.644  IDENTIFY DEVICE
  2f 00 01 10 00 00 00 00      00:13:51.149  READ LOG EXT

Error 45250 occurred at disk power-on lifetime: 2608 hours (108 days + 16 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 61 02 00 00 00 00

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  e1 00 02 00 00 00 00 00      00:13:51.644  IDLE IMMEDIATE
  ec 00 01 00 00 00 00 00      00:13:51.644  IDENTIFY DEVICE
  2f 00 01 10 00 00 00 00      00:13:51.149  READ LOG EXT

Error 45249 occurred at disk power-on lifetime: 2608 hours (108 days + 16 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 61 46 00 00 00 00  Device Fault; Error: ABRT

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ef 03 46 00 00 00 00 00      00:13:50.886  SET FEATURES [Set transfer mode]
  ef 02 00 00 00 00 00 00      00:13:50.886  SET FEATURES [Enable write cache]
  e1 00 02 00 00 00 00 00      00:13:50.885  IDLE IMMEDIATE
  ec 00 01 00 00 00 00 00      00:13:50.885  IDENTIFY DEVICE
  e1 00 0f 00 00 00 00 00      00:13:50.390  IDLE IMMEDIATE

Error 45248 occurred at disk power-on lifetime: 2608 hours (108 days + 16 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 61 00 00 00 00 00  Device Fault; Error: ABRT

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ef 02 00 00 00 00 00 00      00:13:50.886  SET FEATURES [Enable write cache]
  e1 00 02 00 00 00 00 00      00:13:50.885  IDLE IMMEDIATE
  ec 00 01 00 00 00 00 00      00:13:50.885  IDENTIFY DEVICE
  e1 00 0f 00 00 00 00 00      00:13:50.390  IDLE IMMEDIATE
  ef 03 46 00 00 00 00 00      00:13:50.390  SET FEATURES [Set transfer mode]

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%      3683         75337176
However, it seems like another one of the units isn't experiencing the same issue, though it uses a different HDD entirely. Below, you find the output from this unit.

S.M.A.R.T. Output

Code: Select all

smartctl 7.1 2019-12-30 r5022 [armv7l-linux-5.4.109-220] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     HUS722T1TALA600
Serial Number:    WCC6M6PN423C
LU WWN Device Id: 5 0014ee 2bdebebf4
Add. Product Id:  DELL(tm)
Firmware Version: RADEMU03
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Sep  8 16:05:37 2021 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (   90) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 106) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x603d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   143   142   021    Pre-fail  Always       -       3825
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       58
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   098   098   000    Old_age   Always       -       1719
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       52
 16 Unknown_Attribute       0x0022   000   200   000    Old_age   Always       -       2473195239
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       28
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       63
194 Temperature_Celsius     0x0022   090   088   000    Old_age   Always       -       53 (Min/Max 22/55)
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0
241 Total_LBAs_Written      0x0032   200   200   000    Old_age   Always       -       2410044236
242 Total_LBAs_Read         0x0032   200   200   000    Old_age   Always       -       63151003

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Vendor (0xdf)       Completed without error       00%         2         -
# 2  Short offline       Completed without error       00%         0         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
Finally, the last unit I had access to:

S.M.A.R.T. Output

Code: Select all

smartctl 7.1 2019-12-30 r5022 [armv7l-linux-5.4.142-228] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     ST4000VX007-2DT166
Serial Number:    ZM40QNLV
LU WWN Device Id: 5 000c50 0c4e24186
Firmware Version: CV11
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5980 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Sep  8 16:11:36 2021 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  581) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 651) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x50bd) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   083   064   044    Pre-fail  Always       -       209426096
  3 Spin_Up_Time            0x0003   096   094   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       19
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   087   060   045    Pre-fail  Always       -       503352193
  9 Power_On_Hours          0x0032   098   098   000    Old_age   Always       -       1938
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       18
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   060   057   040    Old_age   Always       -       40 (Min/Max 40/41)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       10
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       25
194 Temperature_Celsius     0x0022   040   043   000    Old_age   Always       -       40 (0 26 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       1937 (227 27 0)
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       7737217192
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       61193140

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
P.S.: I now have remote access to all three of these units, making it a bit easier to get/provide information as I go along.

User avatar
mctom
Posts: 438
Joined: Wed Nov 11, 2020 4:44 am
languages_spoken: english, polish
ODROIDs: N2+, Game Advance, a few XU4
Location: Gdansk, Poland
Has thanked: 48 times
Been thanked: 41 times
Contact:

Re: [Odroid HC2 - Ubuntu 20.04 - 5.4] Seemingly random unavailability/crashing

Post by mctom »

barelycompetent wrote:
Wed Sep 08, 2021 11:13 pm
Note: we use 172.31.255.250/28 as a static fallback IP address for direct connections, in case a DHCP-address isn't/can't be assigned.
I'm not entirely sure if this is working that way. It's not a fallback address, but an additional one.
https://netplan.io/reference/ wrote:addresses (sequence of scalars and mappings)

Add static addresses to the interface in addition to the ones received
through DHCP or RA. Each sequence entry is in CIDR notation, i. e. of the
form addr/prefixlen. addr is an IPv4 or IPv6 address as recognized
by inet_pton(3) and prefixlen the number of bits of the subnet.
Hopefully some strange address from outside of the DHCP pool didn't anger the network overlords! After all, the issues only happen in some specific network configurations, not all (and apparently not the network in HardKernel lab :( )


Also, looking again at syslog from your original post, it is interesting to search for keyword "Network". Normally it should spit out some information on startup and be quiet later on. On my machine cat /var/log/syslog | grep Network returns nothing (the boot time messages have been rotated out long ago).
In your syslog, we see NetworkManager restarting periodically and systemd flexing his Network Service (systemd-networkd if i'm not mistaken).
I think that either these two are in a power struggle, or both react to some external event. Nevertheless, there must be only one minding the networking.

Check your syslog if you can still see that happening, after you got rid of NetworkManager for good.
barelycompetent wrote:
Wed Sep 08, 2021 11:13 pm
Keep in mind that this is the drive on which both the boot and storage partitions are present, so if it were to fail periodically and mount as read-only as a result, that may explain why no logs were present while it was experiencing issues - as it can no longer write to the filesystem? I'm just spitballing here, of course.
I don't think Linux can remount its root partition as read only on the fly. I know that mounting as read only in case of errors is a default option in fstab, that would only be applied on boot time. I think :roll:
Punk ain't no religious cult, punk means thinking for yourself!

Maintainer of PiStackMon

User avatar
mad_ady
Posts: 9689
Joined: Wed Jul 15, 2015 5:00 pm
languages_spoken: english
ODROIDs: XU4, C1+, C2, C4, N1, N2, H2, Go, Go Advance
Location: Bucharest, Romania
Has thanked: 609 times
Been thanked: 721 times
Contact:

Re: [Odroid HC2 - Ubuntu 20.04 - 5.4] Seemingly random unavailability/crashing

Post by mad_ady »

I don't think Linux can remount its root partition as read only on the fly. I know that mounting as read only in case of errors is a default option in fstab, that would only be applied on boot time. I think Image
Sure it can! The kernel does it if it finds inconsistencies in the rootfs, in order not to destroy data. You should see messages about it in dmesg/netconsole when it happens.

Regarding the error of the first disk - it looks like an isolated/transient event. Could have been caused by a problem with the data cable.
These users thanked the author mad_ady for the post:
mctom (Thu Sep 09, 2021 2:12 am)

User avatar
mctom
Posts: 438
Joined: Wed Nov 11, 2020 4:44 am
languages_spoken: english, polish
ODROIDs: N2+, Game Advance, a few XU4
Location: Gdansk, Poland
Has thanked: 48 times
Been thanked: 41 times
Contact:

Re: [Odroid HC2 - Ubuntu 20.04 - 5.4] Seemingly random unavailability/crashing

Post by mctom »

mad_ady wrote:
Thu Sep 09, 2021 12:24 am
Regarding the error of the first disk - it looks like an isolated/transient event. Could have been caused by a problem with the data cable.
I was about to write a full rant about how bad some USB-SATA dongles are, but then I realized we're talking about HC2. ;)

Well, fortunately SMART data can always be read and compared among machines that did "freeze" and didn't. I highly doubt this has anything to do with the original problem, but this hypothesis is at least testable.

But let's remember, if the problems seems to be site-specific, I'd rather believe that some specific local network quirks got in a way, than voodoo magic breaking HDDs in certain spots of the world.
UNLESS someone placed HDDs next to a substantial magnetic field!
Punk ain't no religious cult, punk means thinking for yourself!

Maintainer of PiStackMon

barelycompetent
Posts: 15
Joined: Sat Jun 26, 2021 12:05 am
languages_spoken: English, Dutch
Has thanked: 13 times
Been thanked: 0
Contact:

Re: [Odroid HC2 - Ubuntu 20.04 - 5.4] Seemingly random unavailability/crashing

Post by barelycompetent »

mctom wrote:
Wed Sep 08, 2021 11:49 pm
barelycompetent wrote:
Wed Sep 08, 2021 11:13 pm
Note: we use 172.31.255.250/28 as a static fallback IP address for direct connections, in case a DHCP-address isn't/can't be assigned.
I'm not entirely sure if this is working that way. It's not a fallback address, but an additional one.
Fair enough, I guess the term fallback address is reserved for something functionally different - my bad!
mad_ady wrote:
Thu Sep 09, 2021 12:24 am
I don't think Linux can remount its root partition as read only on the fly. I know that mounting as read only in case of errors is a default option in fstab, that would only be applied on boot time. I think Image
Sure it can! The kernel does it if it finds inconsistencies in the rootfs, in order not to destroy data. You should see messages about it in dmesg/netconsole when it happens.

Regarding the error of the first disk - it looks like an isolated/transient event. Could have been caused by a problem with the data cable.
Glad to know I'm not entirely crazy just yet!
mctom wrote:
Thu Sep 09, 2021 2:24 am
mad_ady wrote:
Thu Sep 09, 2021 12:24 am
Regarding the error of the first disk - it looks like an isolated/transient event. Could have been caused by a problem with the data cable.
I was about to write a full rant about how bad some USB-SATA dongles are, but then I realized we're talking about HC2. ;)

Well, fortunately SMART data can always be read and compared among machines that did "freeze" and didn't. I highly doubt this has anything to do with the original problem, but this hypothesis is at least testable.

But let's remember, if the problems seems to be site-specific, I'd rather believe that some specific local network quirks got in a way, than voodoo magic breaking HDDs in certain spots of the world.
UNLESS someone placed HDDs next to a substantial magnetic field!
Fair enough, unfortunately we've not been able to find any real similarities between the S.M.A.R.T.-readouts of affected devices yet. Though one of the devices definitely seems to be running into some bad blocks during long selftests.
I'll be performing an investigation on this (hopefully) later today or tomorrow if I get the time.

Of course, it's not unthinkable that the issue is unrelated in the case of this specific unit.
If bad block reallocation doesn't work and the disk has to be replaced, I guess we'll see if it brings an improvement.

That being said, despite all efforts - so far we haven't gotten much closer to fixing it.
I've made sure that systemd-networkd and NetworkManager don't conflict with each other by masking NetworkManager and ModemManager. Unfortunately, there's still devices that are experiencing issues.

I've had a look and I noticed that a shutdown was started by systemd/Unattended Upgrades?
I'm not sure if this is what could be causing the system to simply shutdown, but I haven't found any way to prevent it either - aside from removing unattended-upgrades entirely. Of course, being able to install security updates, or at least check for them regularly, isn't something to give up so frivolously from my estimations.

Code: Select all

Sep 15 08:10:32 odroid systemd[1]: Started Unattended Upgrades Shutdown
I also noticed the following happened after the device came back online, anyone have an idea if this could impact the device functionality at all?

Code: Select all

Sep 15 08:10:27 odroid kernel: s5p-mfc 11000000.codec: Direct firmware load for s5p-mfc-v8.fw failed with error -2
It appears that this device has also started displaying a lot of h264 related errors now.

Code: Select all

Sep 12 07:35:17 odroid mediaserver[408]: [h264 @ 0xa9272220] error while decoding MB 148 86
Sep 12 07:35:17 odroid mediaserver[408]: [h264 @ 0xa9272220] concealing 1413 DC, 1413 AC, 1413 MV errors in P frame
Sep 12 14:48:54 odroid mediaserver[408]: [h264 @ 0xa9272220] cbp too large (49) at 25 47
Sep 12 14:48:54 odroid mediaserver[408]: [h264 @ 0xa9272220] error while decoding MB 25 47
Sep 12 14:48:54 odroid mediaserver[408]: [h264 @ 0xa9272220] concealing 8088 DC, 8088 AC, 8088 MV errors in P frame
Sep 12 19:28:43 odroid mediaserver[408]: [h264 @ 0xa9272220] Invalid level prefix
Sep 12 19:28:43 odroid mediaserver[408]: [h264 @ 0xa9272220] error while decoding MB 44 34
Sep 12 19:28:43 odroid mediaserver[408]: [h264 @ 0xa9272220] concealing 10253 DC, 10253 AC, 10253 MV errors in P frame
Sep 13 06:53:40 odroid mediaserver[408]: [h264 @ 0xa9272220] mb_type 119 in P slice too large at 15 0
Sep 13 06:53:40 odroid mediaserver[408]: [h264 @ 0xa9272220] error while decoding MB 15 0
Sep 13 06:53:40 odroid mediaserver[408]: [h264 @ 0xa9272220] concealing 15960 DC, 15960 AC, 15960 MV errors in P frame
Sep 13 07:29:48 odroid mediaserver[408]: [h264 @ 0xa9272220] P sub_mb_type 32 out of range at 134 88
Sep 13 07:29:48 odroid mediaserver[408]: [h264 @ 0xa9272220] error while decoding MB 134 88
Sep 13 07:29:48 odroid mediaserver[408]: [h264 @ 0xa9272220] concealing 1091 DC, 1091 AC, 1091 MV errors in P frame
Sep 13 09:28:34 odroid mediaserver[408]: [h264 @ 0xa9272220] out of range intra chroma pred mode
Sep 13 09:28:34 odroid mediaserver[408]: [h264 @ 0xa9272220] error while decoding MB 26 17
Sep 13 09:28:34 odroid mediaserver[408]: [h264 @ 0xa9272220] concealing 13127 DC, 13127 AC, 13127 MV errors in P frame
Sep 13 10:30:54 odroid mediaserver[408]: [h264 @ 0xa9272220] corrupted macroblock 32 24 (total_coeff=-1)
Sep 13 10:30:54 odroid mediaserver[408]: [h264 @ 0xa9272220] error while decoding MB 32 24
Sep 13 10:30:54 odroid mediaserver[408]: [h264 @ 0xa9272220] concealing 11945 DC, 11945 AC, 11945 MV errors in P frame
Sep 13 13:59:14 odroid mediaserver[408]: [h264 @ 0xa74a61f0] negative number of zero coeffs at 135 58
Sep 13 13:59:14 odroid mediaserver[408]: [h264 @ 0xa74a61f0] error while decoding MB 135 58
Sep 13 13:59:14 odroid mediaserver[408]: [h264 @ 0xa74a61f0] concealing 6130 DC, 6130 AC, 6130 MV errors in P frame
Sep 13 15:01:07 odroid mediaserver[408]: [h264 @ 0xa74a61f0] corrupted macroblock 27 11 (total_coeff=-1)
Sep 13 15:01:07 odroid mediaserver[408]: [h264 @ 0xa74a61f0] error while decoding MB 27 11
Sep 13 15:01:07 odroid mediaserver[408]: [h264 @ 0xa74a61f0] concealing 14134 DC, 14134 AC, 14134 MV errors in P frame
Sep 13 15:01:43 odroid mediaserver[408]: [h264 @ 0xa74a61f0] corrupted macroblock 166 29 (total_coeff=-1)
Sep 13 15:01:43 odroid mediaserver[408]: [h264 @ 0xa74a61f0] error while decoding MB 166 29
Sep 13 15:01:43 odroid mediaserver[408]: [h264 @ 0xa74a61f0] concealing 10971 DC, 10971 AC, 10971 MV errors in P frame
Sep 13 15:02:07 odroid mediaserver[408]: [h264 @ 0xa74a61f0] mb_type 47 in P slice too large at 36 2
Sep 13 15:02:07 odroid mediaserver[408]: [h264 @ 0xa74a61f0] error while decoding MB 36 2
Sep 13 15:02:07 odroid mediaserver[408]: [h264 @ 0xa74a61f0] concealing 15637 DC, 15637 AC, 15637 MV errors in P frame
Sep 13 15:33:02 odroid mediaserver[408]: [h264 @ 0xa74a61f0] negative number of zero coeffs at 154 5
Sep 13 15:33:02 odroid mediaserver[408]: [h264 @ 0xa74a61f0] error while decoding MB 154 5
Sep 13 15:33:02 odroid mediaserver[408]: [h264 @ 0xa74a61f0] concealing 15015 DC, 15015 AC, 15015 MV errors in P frame
Sep 13 16:03:12 odroid mediaserver[408]: [h264 @ 0xa74a61f0] Invalid level prefix
Sep 13 16:03:12 odroid mediaserver[408]: [h264 @ 0xa74a61f0] error while decoding MB 25 24
Sep 13 16:03:12 odroid mediaserver[408]: [h264 @ 0xa74a61f0] concealing 11952 DC, 11952 AC, 11952 MV errors in P frame
Sep 13 18:49:07 odroid mediaserver[408]: [h264 @ 0xa9272220] corrupted macroblock 35 36 (total_coeff=-1)
Sep 13 18:49:07 odroid mediaserver[408]: [h264 @ 0xa9272220] error while decoding MB 35 36
Sep 13 18:49:07 odroid mediaserver[408]: [h264 @ 0xa9272220] concealing 9926 DC, 9926 AC, 9926 MV errors in P frame
Sep 13 19:09:10 odroid mediaserver[408]: [h264 @ 0xa74a61f0] cbp too large (3199971767) at 66 86
Sep 13 19:09:10 odroid mediaserver[408]: [h264 @ 0xa74a61f0] error while decoding MB 66 86
Sep 13 19:09:10 odroid mediaserver[408]: [h264 @ 0xa74a61f0] concealing 1495 DC, 1495 AC, 1495 MV errors in P frame
Sep 15 13:21:57 odroid mediaserver[404]: [h264 @ 0x8b45f9a0] corrupted macroblock 74 50 (total_coeff=-1)
Sep 15 13:21:57 odroid mediaserver[404]: [h264 @ 0x8b45f9a0] error while decoding MB 74 50
Sep 15 13:21:57 odroid mediaserver[404]: [h264 @ 0x8b45f9a0] concealing 7535 DC, 7535 AC, 7535 MV errors in P frame
I'm really trying to look at every potential cause, but like you said - for all we know it could be tied to something specific to the local network/environment that we're not taking note of.

At the moment I've got a few other priorities that I have to take care of within a short timeframe as well, which is causing progress in troubleshooting this issue to be extremely slow. Please accept my apologies for the delayed responses, I'm hoping to be able to allocate more time to this issue as soon as it becomes available.

User avatar
mctom
Posts: 438
Joined: Wed Nov 11, 2020 4:44 am
languages_spoken: english, polish
ODROIDs: N2+, Game Advance, a few XU4
Location: Gdansk, Poland
Has thanked: 48 times
Been thanked: 41 times
Contact:

Re: [Odroid HC2 - Ubuntu 20.04 - 5.4] Seemingly random unavailability/crashing

Post by mctom »

So wrapping up, the HDD issues are probably not related to network issues, at least there's no evidence for it.
NetworkManager masking didn't resolve the problem.

unattended-upgrades is a daemon that does exactly that - apt update/upgrade automatically from time to time. This is a component that I always remove after installing ubuntu, it's even the first thing that my magic script does:

Code: Select all

echo "------Disabling unattended upgrades..."
systemctl stop unattended-upgrades
systemctl disable unattended-upgrades
systemctl mask unattended-upgrades
# The following pops up an interactive menu, not sure if this is a necessary step
dpkg-reconfigure unattended-upgrades
apt remove -y unattended-upgrades
The reason why I remove it is that it tends to do upgrades when I want to use apt for other purposes, like installing software. Now the log entry you showed indicates that it may also restart your system (or itself?), which is a good reason to remove this pest on its own.

Of course apt will still work as usual, you'll just have to run update/upgrade manually.
But then again, if something works, don't fix it!

I'm not sure if this is a standard component of Ubuntu, but the images for Raspberry Pi also have fwupd that does similar task. Also should be removed.

Code: Select all

echo "------Disabling fwupd..."
systemctl stop fwupd
systemctl disable fwupd
systemctl mask fwupd
barelycompetent wrote:
Thu Sep 16, 2021 1:21 am
I also noticed the following happened after the device came back online, anyone have an idea if this could impact the device functionality at all?

Code: Select all

Sep 15 08:10:27 odroid kernel: s5p-mfc 11000000.codec: Direct firmware load for s5p-mfc-v8.fw failed with error -2
Yep, as far as I understood my googling, this is a Samsung firmware for hardware video codec. It is not expected to work if firmware loading failed.
I can't find information what exactly error code "-2" means, but there are two things to be considered:
- Normally firmware is read when appropriate kernel module is loaded (I think this normally should happen only at boot time, but maybe not?)
- If it sometimes loads and sometimes gives error, I can't think of any other reason than the indicated file being available or not available in certain circumstances (So back to HDD problems hypothesis).

You said you keep your OS on HDD rather than SD card? Perhaps that is a problem after all.
Punk ain't no religious cult, punk means thinking for yourself!

Maintainer of PiStackMon

User avatar
mad_ady
Posts: 9689
Joined: Wed Jul 15, 2015 5:00 pm
languages_spoken: english
ODROIDs: XU4, C1+, C2, C4, N1, N2, H2, Go, Go Advance
Location: Bucharest, Romania
Has thanked: 609 times
Been thanked: 721 times
Contact:

Re: [Odroid HC2 - Ubuntu 20.04 - 5.4] Seemingly random unavailability/crashing

Post by mad_ady »

The mfc load error is normal and is caused by a missing firmware file in /lib/firmware in the initrd. The firmware is loaded correctly when the rootfs is mounted and can be tested by the existance of /dev/video*.

User avatar
mctom
Posts: 438
Joined: Wed Nov 11, 2020 4:44 am
languages_spoken: english, polish
ODROIDs: N2+, Game Advance, a few XU4
Location: Gdansk, Poland
Has thanked: 48 times
Been thanked: 41 times
Contact:

Re: [Odroid HC2 - Ubuntu 20.04 - 5.4] Seemingly random unavailability/crashing

Post by mctom »

Why is that normal? Last time I saw firmware load fails, the underlying device didn't work (touchscreen on a tablet PC).
Why would it even try to load firmware when rootfs is not mounted, and what triggers the reattempt?
I'm not saying you're wrong, I'm taking the opportunity to learn around here. :)
Punk ain't no religious cult, punk means thinking for yourself!

Maintainer of PiStackMon

User avatar
mad_ady
Posts: 9689
Joined: Wed Jul 15, 2015 5:00 pm
languages_spoken: english
ODROIDs: XU4, C1+, C2, C4, N1, N2, H2, Go, Go Advance
Location: Bucharest, Romania
Has thanked: 609 times
Been thanked: 721 times
Contact:

Re: [Odroid HC2 - Ubuntu 20.04 - 5.4] Seemingly random unavailability/crashing

Post by mad_ady »

I'm not sure what triggers the reattempt, but the firmware tries to get loaded when the driver is loaded or initialized. I too would like to know what triggers the reaload.

barelycompetent
Posts: 15
Joined: Sat Jun 26, 2021 12:05 am
languages_spoken: English, Dutch
Has thanked: 13 times
Been thanked: 0
Contact:

Re: [Odroid HC2 - Ubuntu 20.04 - 5.4] Seemingly random unavailability/crashing

Post by barelycompetent »

mad_ady wrote:
Sun Sep 19, 2021 11:35 pm
I'm not sure what triggers the reattempt, but the firmware tries to get loaded when the driver is loaded or initialized. I too would like to know what triggers the reload.
Well, seeing as these function as video surveillance servers, they're pretty much decoding video constantly.
Though I think I should mention that this is a kernel message originating at boot and not in the middle of the unit's uptime.

Unfortunately I don't have access to this specific unit anymore, but I've been able to confirm that both other devices have the same kernel message at boot and both seem to log tons of h264 related errors.

Here's the dmesg output of one of them to give you the necessary context:

Code: Select all

Sep 16 09:43:58 odroid kernel: Kernel command line: console=tty1 console=ttySAC2,115200n8 root=UUID=ec0a4c5f-b87b-44fb-b97e-d8aee6fc4e3b rootwait ro fsck.repair=yes net.ifnames=0  HPD=true vout=hdmi usbhid.quirks=0x0eef:0x0005:0x0004 smsc95xx.macaddr=00:1e:06:61:7a:39 false ipv6.disable=1 s5p_mfc.mem=16M
Sep 16 09:43:58 odroid kernel: s5p-jpeg 11f50000.jpeg: Adding to iommu group 3
Sep 16 09:43:58 odroid kernel: s5p-jpeg 11f50000.jpeg: encoder device registered as /dev/video20
Sep 16 09:43:58 odroid kernel: s5p-jpeg 11f50000.jpeg: decoder device registered as /dev/video21
Sep 16 09:43:58 odroid kernel: s5p-jpeg 11f50000.jpeg: Samsung S5P JPEG codec
Sep 16 09:43:58 odroid kernel: s5p-jpeg 11f60000.jpeg: Adding to iommu group 4
Sep 16 09:43:58 odroid kernel: s5p-jpeg 11f60000.jpeg: encoder device registered as /dev/video22
Sep 16 09:43:58 odroid kernel: s5p-jpeg 11f60000.jpeg: decoder device registered as /dev/video23
Sep 16 09:43:58 odroid kernel: s5p-jpeg 11f60000.jpeg: Samsung S5P JPEG codec
Sep 16 09:43:58 odroid kernel: s5p-mfc 11000000.codec: Adding to iommu group 5
Sep 16 09:43:58 odroid kernel: s5p-mfc 11000000.codec: preallocated 16 MiB buffer for the firmware and context buffers
Sep 16 09:43:58 odroid kernel: s5p-mfc 11000000.codec: Direct firmware load for s5p-mfc-v8.fw failed with error -2
Sep 16 09:43:58 odroid kernel: s5p_mfc_load_firmware:69: Firmware is not present in the /lib/firmware directory nor compiled in kernel
Sep 16 09:43:58 odroid kernel: s5p-mfc 11000000.codec: decoder registered as /dev/video10
Sep 16 09:43:58 odroid kernel: s5p-mfc 11000000.codec: encoder registered as /dev/video11
Sep 16 09:43:58 odroid kernel: s5p-secss 10830000.sss: s5p-sss driver registered
I've been trying to find more information surrounding the observation made with regards to the s5p-mfc firmware, but haven't really been able to book any success in getting this resolved.
I checked and it appears that the firmware files that it's referring to are available in /lib/firmware, but won't load for some reason.
From what I can tell, this is not exclusive to the affected units and occurs on all units with our image, indicating that it may not be fatal to the functionality of the device.

That said, here's the last netconsole log generated by the device that is now offline:

Code: Select all

Sep 16 00:01:17 172.29.12.12  [   27.493590] printk: console [netcon0] enabled
Sep 16 00:01:17 172.29.12.12  [   27.497846] netconsole: network logging started
Sep 16 00:01:23 172.29.12.12  [   33.758163] vdd_ldo12: disabling
Sep 16 00:01:26 172.29.12.12  [   35.768732] EXT4-fs (sda2): mounted filesystem with ordered data mode. Opts: errors=remount-ro
Sep 16 06:51:01 172.29.12.12  [24610.903773] EXT4-fs (mmcblk1p2): mounted filesystem without journal. Opts: (null)
Sep 16 10:48:18 172.29.12.12  [38847.420509] sd 0:0:0:0: [sda] tag#27 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
Sep 16 10:48:18 172.29.12.12  [38847.427392] sd 0:0:0:0: [sda] tag#27 Sense Key : 0x3 [current]
Sep 16 10:48:18 172.29.12.12  [38847.433293] sd 0:0:0:0: [sda] tag#27 ASC=0x11 ASCQ=0x0
Sep 16 10:48:18 172.29.12.12  [38847.438564] sd 0:0:0:0: [sda] tag#27 CDB: opcode=0x88 88 00 00 00 00 00 00 1f 0f 00 00 00 02 00 00 00
Sep 16 10:48:18 172.29.12.12  [38847.447684] blk_update_request: critical medium error, dev sda, sector 2035456 op 0x0:(READ) flags 0x80700 phys_seg 64 prio class 0
Sep 16 10:48:19 172.29.12.12  [38847.928837] sd 0:0:0:0: [sda] tag#24 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
Sep 16 10:48:19 172.29.12.12  [38847.935788] sd 0:0:0:0: [sda] tag#24 Sense Key : 0x3 [current]
Sep 16 10:48:19 172.29.12.12  [38847.941824] sd 0:0:0:0: [sda] tag#24 ASC=0x11 ASCQ=0x0
Sep 16 10:48:19 172.29.12.12  [38847.946804] sd 0:0:0:0: [sda] tag#24 CDB: opcode=0x88 88 00 00 00 00 00 00 1f 11 00 00 00 01 00 00 00
Sep 16 10:48:19 172.29.12.12  [38847.956093] blk_update_request: I/O error, dev sda, sector 2035968 op 0x0:(READ) flags 0x80700 phys_seg 32 prio class 0
Sep 16 10:48:30 172.29.12.12  [38858.978937] sd 0:0:0:0: [sda] tag#12 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
Sep 16 10:48:30 172.29.12.12  [38858.985798] sd 0:0:0:0: [sda] tag#12 Sense Key : 0x3 [current]
Sep 16 10:48:30 172.29.12.12  [38858.991698] sd 0:0:0:0: [sda] tag#12 ASC=0x11 ASCQ=0x0
Sep 16 10:48:30 172.29.12.12  [38858.997013] sd 0:0:0:0: [sda] tag#12 CDB: opcode=0x88 88 00 00 00 00 00 00 22 19 00 00 00 01 00 00 00
Sep 16 10:48:30 172.29.12.12  [38859.006104] blk_update_request: critical medium error, dev sda, sector 2234624 op 0x0:(READ) flags 0x80700 phys_seg 32 prio class 0
Sep 16 10:48:31 172.29.12.12  [38859.486704] sd 0:0:0:0: [sda] tag#13 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
Sep 16 10:48:31 172.29.12.12  [38859.493604] sd 0:0:0:0: [sda] tag#13 Sense Key : 0x3 [current]
Sep 16 10:48:31 172.29.12.12  [38859.499568] sd 0:0:0:0: [sda] tag#13 ASC=0x11 ASCQ=0x0
Sep 16 10:48:31 172.29.12.12  [38859.504702] sd 0:0:0:0: [sda] tag#13 CDB: opcode=0x88 88 00 00 00 00 00 00 22 1a 00 00 00 01 00 00 00
Sep 16 10:48:31 172.29.12.12  [38859.513879] blk_update_request: I/O error, dev sda, sector 2234880 op 0x0:(READ) flags 0x80700 phys_seg 32 prio class 0
Sep 16 10:49:23 172.29.12.12  [38911.738248] sd 0:0:0:0: [sda] tag#26 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
Sep 16 10:49:23 172.29.12.12  [38911.745135] sd 0:0:0:0: [sda] tag#26 Sense Key : 0x3 [current]
Sep 16 10:49:23 172.29.12.12  [38911.751020] sd 0:0:0:0: [sda] tag#26 ASC=0x11 ASCQ=0x0
Sep 16 10:49:23 172.29.12.12  [38911.756176] sd 0:0:0:0: [sda] tag#26 CDB: opcode=0x88 88 00 00 00 00 00 01 86 db c8 00 00 00 18 00 00
...skipping...
00 00 58 00 00
Sep 16 14:02:10 172.29.12.12  [10860.672578] sd 0:0:0:0: [sda] tag#12 uas_eh_abort_handler 0 uas-tag 6 inflight: CMD IN
Sep 16 14:02:10 172.29.12.12  [10860.679166] sd 0:0:0:0: [sda] tag#12 CDB: opcode=0x88 88 00 00 00 00 00 00 ce 28 e0 00 00 00 40 00 00
Sep 16 14:02:10 172.29.12.12  [10860.808580] sd 0:0:0:0: [sda] tag#26 uas_eh_abort_handler 0 uas-tag 7 inflight: CMD IN
Sep 16 14:02:10 172.29.12.12  [10860.815164] sd 0:0:0:0: [sda] tag#26 CDB: opcode=0x88 88 00 00 00 00 00 02 41 14 50 00 00 00 08 00 00
Sep 16 14:02:19 172.29.12.12  [10869.376337] sd 0:0:0:0: [sda] tag#3 uas_eh_abort_handler 0 uas-tag 11 inflight: CMD IN
Sep 16 14:02:19 172.29.12.12  [10869.382930] sd 0:0:0:0: [sda] tag#3 CDB: opcode=0x88 88 00 00 00 00 00 02 44 c3 d0 00 00 00 18 00 00
Sep 16 14:02:19 172.29.12.12  [10869.392551] sd 0:0:0:0: [sda] tag#2 uas_eh_abort_handler 0 uas-tag 10 inflight: CMD IN
Sep 16 14:02:19 172.29.12.12  [10869.400002] sd 0:0:0:0: [sda] tag#2 CDB: opcode=0x88 88 00 00 00 00 00 02 44 c3 c0 00 00 00 08 00 00
Sep 16 14:02:19 172.29.12.12  [10869.409872] sd 0:0:0:0: [sda] tag#1 uas_eh_abort_handler 0 uas-tag 9 inflight: CMD IN
Sep 16 14:02:19 172.29.12.12  [10869.416985] sd 0:0:0:0: [sda] tag#1 CDB: opcode=0x88 88 00 00 00 00 00 02 44 c3 98 00 00 00 08 00 00
Sep 16 14:02:19 172.29.12.12  [10869.427347] sd 0:0:0:0: [sda] tag#0 uas_eh_abort_handler 0 uas-tag 8 inflight: CMD IN
Sep 16 14:02:19 172.29.12.12  [10869.433965] sd 0:0:0:0: [sda] tag#0 CDB: opcode=0x88 88 00 00 00 00 00 02 44 c3 50 00 00 00 08 00 00
Sep 16 14:02:30 172.29.12.12  [10880.639942] sd 0:0:0:0: [sda] tag#23 uas_eh_abort_handler 0 uas-tag 3 inflight: CMD
Sep 16 14:02:30 172.29.12.12  [10880.646267] sd 0:0:0:0: [sda] tag#23 CDB: opcode=0x85 85 06 2c 00 00 00 00 00 00 00 00 00 00 00 e5 00
Sep 16 14:02:30 172.29.12.12  [10880.667918] scsi host0: uas_eh_device_reset_handler start
Sep 16 14:02:30 172.29.12.12  [10880.800263] usb 4-1: reset SuperSpeed Gen 1 USB device number 2 using xhci-hcd
Sep 16 14:02:30 172.29.12.12  [10880.826429] scsi host0: uas_eh_device_reset_handler success
Sep 16 14:02:30 172.29.12.12  [10880.831134] sd 0:0:0:0: [sda] tag#23 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x06
Sep 16 14:02:30 172.29.12.12  [10880.838973] sd 0:0:0:0: [sda] tag#23 CDB: opcode=0x85 85 06 2c 00 00 00 00 00 00 00 00 00 00 00 e5 00
...skipping...
00 00 58 00 00
Sep 16 14:02:10 172.29.12.12  [10860.672578] sd 0:0:0:0: [sda] tag#12 uas_eh_abort_handler 0 uas-tag 6 inflight: CMD IN
Sep 16 14:02:10 172.29.12.12  [10860.679166] sd 0:0:0:0: [sda] tag#12 CDB: opcode=0x88 88 00 00 00 00 00 00 ce 28 e0 00 00 00 40 00 00
Sep 16 14:02:10 172.29.12.12  [10860.808580] sd 0:0:0:0: [sda] tag#26 uas_eh_abort_handler 0 uas-tag 7 inflight: CMD IN
Sep 16 14:02:10 172.29.12.12  [10860.815164] sd 0:0:0:0: [sda] tag#26 CDB: opcode=0x88 88 00 00 00 00 00 02 41 14 50 00 00 00 08 00 00
Sep 16 14:02:19 172.29.12.12  [10869.376337] sd 0:0:0:0: [sda] tag#3 uas_eh_abort_handler 0 uas-tag 11 inflight: CMD IN
Sep 16 14:02:19 172.29.12.12  [10869.382930] sd 0:0:0:0: [sda] tag#3 CDB: opcode=0x88 88 00 00 00 00 00 02 44 c3 d0 00 00 00 18 00 00
Sep 16 14:02:19 172.29.12.12  [10869.392551] sd 0:0:0:0: [sda] tag#2 uas_eh_abort_handler 0 uas-tag 10 inflight: CMD IN
Sep 16 14:02:19 172.29.12.12  [10869.400002] sd 0:0:0:0: [sda] tag#2 CDB: opcode=0x88 88 00 00 00 00 00 02 44 c3 c0 00 00 00 08 00 00
Sep 16 14:02:19 172.29.12.12  [10869.409872] sd 0:0:0:0: [sda] tag#1 uas_eh_abort_handler 0 uas-tag 9 inflight: CMD IN
Sep 16 14:02:19 172.29.12.12  [10869.416985] sd 0:0:0:0: [sda] tag#1 CDB: opcode=0x88 88 00 00 00 00 00 02 44 c3 98 00 00 00 08 00 00
Sep 16 14:02:19 172.29.12.12  [10869.427347] sd 0:0:0:0: [sda] tag#0 uas_eh_abort_handler 0 uas-tag 8 inflight: CMD IN
Sep 16 14:02:19 172.29.12.12  [10869.433965] sd 0:0:0:0: [sda] tag#0 CDB: opcode=0x88 88 00 00 00 00 00 02 44 c3 50 00 00 00 08 00 00
Sep 16 14:02:30 172.29.12.12  [10880.639942] sd 0:0:0:0: [sda] tag#23 uas_eh_abort_handler 0 uas-tag 3 inflight: CMD
Sep 16 14:02:30 172.29.12.12  [10880.646267] sd 0:0:0:0: [sda] tag#23 CDB: opcode=0x85 85 06 2c 00 00 00 00 00 00 00 00 00 00 00 e5 00
Sep 16 14:02:30 172.29.12.12  [10880.667918] scsi host0: uas_eh_device_reset_handler start
Sep 16 14:02:30 172.29.12.12  [10880.800263] usb 4-1: reset SuperSpeed Gen 1 USB device number 2 using xhci-hcd
Sep 16 14:02:30 172.29.12.12  [10880.826429] scsi host0: uas_eh_device_reset_handler success
Sep 16 14:02:30 172.29.12.12  [10880.831134] sd 0:0:0:0: [sda] tag#23 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x06
Sep 16 14:02:30 172.29.12.12  [10880.838973] sd 0:0:0:0: [sda] tag#23 CDB: opcode=0x85 85 06 2c 00 00 00 00 00 00 00 00 00 00 00 e5 00
...skipping...
00 00 58 00 00
Sep 16 14:02:10 172.29.12.12  [10860.672578] sd 0:0:0:0: [sda] tag#12 uas_eh_abort_handler 0 uas-tag 6 inflight: CMD IN
Sep 16 14:02:10 172.29.12.12  [10860.679166] sd 0:0:0:0: [sda] tag#12 CDB: opcode=0x88 88 00 00 00 00 00 00 ce 28 e0 00 00 00 40 00 00
Sep 16 14:02:10 172.29.12.12  [10860.808580] sd 0:0:0:0: [sda] tag#26 uas_eh_abort_handler 0 uas-tag 7 inflight: CMD IN
Sep 16 14:02:10 172.29.12.12  [10860.815164] sd 0:0:0:0: [sda] tag#26 CDB: opcode=0x88 88 00 00 00 00 00 02 41 14 50 00 00 00 08 00 00
Sep 16 14:02:19 172.29.12.12  [10869.376337] sd 0:0:0:0: [sda] tag#3 uas_eh_abort_handler 0 uas-tag 11 inflight: CMD IN
Sep 16 14:02:19 172.29.12.12  [10869.382930] sd 0:0:0:0: [sda] tag#3 CDB: opcode=0x88 88 00 00 00 00 00 02 44 c3 d0 00 00 00 18 00 00
Sep 16 14:02:19 172.29.12.12  [10869.392551] sd 0:0:0:0: [sda] tag#2 uas_eh_abort_handler 0 uas-tag 10 inflight: CMD IN
Sep 16 14:02:19 172.29.12.12  [10869.400002] sd 0:0:0:0: [sda] tag#2 CDB: opcode=0x88 88 00 00 00 00 00 02 44 c3 c0 00 00 00 08 00 00
Sep 16 14:02:19 172.29.12.12  [10869.409872] sd 0:0:0:0: [sda] tag#1 uas_eh_abort_handler 0 uas-tag 9 inflight: CMD IN
Sep 16 14:02:19 172.29.12.12  [10869.416985] sd 0:0:0:0: [sda] tag#1 CDB: opcode=0x88 88 00 00 00 00 00 02 44 c3 98 00 00 00 08 00 00
Sep 16 14:02:19 172.29.12.12  [10869.427347] sd 0:0:0:0: [sda] tag#0 uas_eh_abort_handler 0 uas-tag 8 inflight: CMD IN
Sep 16 14:02:19 172.29.12.12  [10869.433965] sd 0:0:0:0: [sda] tag#0 CDB: opcode=0x88 88 00 00 00 00 00 02 44 c3 50 00 00 00 08 00 00
Sep 16 14:02:30 172.29.12.12  [10880.639942] sd 0:0:0:0: [sda] tag#23 uas_eh_abort_handler 0 uas-tag 3 inflight: CMD
Sep 16 14:02:30 172.29.12.12  [10880.646267] sd 0:0:0:0: [sda] tag#23 CDB: opcode=0x85 85 06 2c 00 00 00 00 00 00 00 00 00 00 00 e5 00
Sep 16 14:02:30 172.29.12.12  [10880.667918] scsi host0: uas_eh_device_reset_handler start
Sep 16 14:02:30 172.29.12.12  [10880.800263] usb 4-1: reset SuperSpeed Gen 1 USB device number 2 using xhci-hcd
Sep 16 14:02:30 172.29.12.12  [10880.826429] scsi host0: uas_eh_device_reset_handler success
Sep 16 14:02:30 172.29.12.12  [10880.831134] sd 0:0:0:0: [sda] tag#23 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x06
Sep 16 14:02:30 172.29.12.12  [10880.838973] sd 0:0:0:0: [sda] tag#23 CDB: opcode=0x85 85 06 2c 00 00 00 00 00 00 00 00 00 00 00 e5 00
...skipping...
00 00 58 00 00
Sep 16 14:02:10 172.29.12.12  [10860.672578] sd 0:0:0:0: [sda] tag#12 uas_eh_abort_handler 0 uas-tag 6 inflight: CMD IN
Sep 16 14:02:10 172.29.12.12  [10860.679166] sd 0:0:0:0: [sda] tag#12 CDB: opcode=0x88 88 00 00 00 00 00 00 ce 28 e0 00 00 00 40 00 00
Sep 16 14:02:10 172.29.12.12  [10860.808580] sd 0:0:0:0: [sda] tag#26 uas_eh_abort_handler 0 uas-tag 7 inflight: CMD IN
Sep 16 14:02:10 172.29.12.12  [10860.815164] sd 0:0:0:0: [sda] tag#26 CDB: opcode=0x88 88 00 00 00 00 00 02 41 14 50 00 00 00 08 00 00
Sep 16 14:02:19 172.29.12.12  [10869.376337] sd 0:0:0:0: [sda] tag#3 uas_eh_abort_handler 0 uas-tag 11 inflight: CMD IN
Sep 16 14:02:19 172.29.12.12  [10869.382930] sd 0:0:0:0: [sda] tag#3 CDB: opcode=0x88 88 00 00 00 00 00 02 44 c3 d0 00 00 00 18 00 00
Sep 16 14:02:19 172.29.12.12  [10869.392551] sd 0:0:0:0: [sda] tag#2 uas_eh_abort_handler 0 uas-tag 10 inflight: CMD IN
Sep 16 14:02:19 172.29.12.12  [10869.400002] sd 0:0:0:0: [sda] tag#2 CDB: opcode=0x88 88 00 00 00 00 00 02 44 c3 c0 00 00 00 08 00 00
Sep 16 14:02:19 172.29.12.12  [10869.409872] sd 0:0:0:0: [sda] tag#1 uas_eh_abort_handler 0 uas-tag 9 inflight: CMD IN
Sep 16 14:02:19 172.29.12.12  [10869.416985] sd 0:0:0:0: [sda] tag#1 CDB: opcode=0x88 88 00 00 00 00 00 02 44 c3 98 00 00 00 08 00 00
Sep 16 14:02:19 172.29.12.12  [10869.427347] sd 0:0:0:0: [sda] tag#0 uas_eh_abort_handler 0 uas-tag 8 inflight: CMD IN
Sep 16 14:02:19 172.29.12.12  [10869.433965] sd 0:0:0:0: [sda] tag#0 CDB: opcode=0x88 88 00 00 00 00 00 02 44 c3 50 00 00 00 08 00 00
Sep 16 14:02:30 172.29.12.12  [10880.639942] sd 0:0:0:0: [sda] tag#23 uas_eh_abort_handler 0 uas-tag 3 inflight: CMD
Sep 16 14:02:30 172.29.12.12  [10880.646267] sd 0:0:0:0: [sda] tag#23 CDB: opcode=0x85 85 06 2c 00 00 00 00 00 00 00 00 00 00 00 e5 00
Sep 16 14:02:30 172.29.12.12  [10880.667918] scsi host0: uas_eh_device_reset_handler start
Sep 16 14:02:30 172.29.12.12  [10880.800263] usb 4-1: reset SuperSpeed Gen 1 USB device number 2 using xhci-hcd
Sep 16 14:02:30 172.29.12.12  [10880.826429] scsi host0: uas_eh_device_reset_handler success
Sep 16 14:02:30 172.29.12.12  [10880.831134] sd 0:0:0:0: [sda] tag#23 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x06
Sep 16 14:02:30 172.29.12.12  [10880.838973] sd 0:0:0:0: [sda] tag#23 CDB: opcode=0x85 85 06 2c 00 00 00 00 00 00 00 00 00 00 00 e5 00
...skipping...

Sep 16 00:01:17 172.29.12.12  [   27.493590] printk: console [netcon0] enabled
Sep 16 00:01:17 172.29.12.12  [   27.497846] netconsole: network logging started
Sep 16 00:01:23 172.29.12.12  [   33.758163] vdd_ldo12: disabling
Sep 16 00:01:26 172.29.12.12  [   35.768732] EXT4-fs (sda2): mounted filesystem with ordered data mode. Opts: errors=remount-ro
Sep 16 06:51:01 172.29.12.12  [24610.903773] EXT4-fs (mmcblk1p2): mounted filesystem without journal. Opts: (null)
Sep 16 10:48:18 172.29.12.12  [38847.420509] sd 0:0:0:0: [sda] tag#27 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
Sep 16 10:48:18 172.29.12.12  [38847.427392] sd 0:0:0:0: [sda] tag#27 Sense Key : 0x3 [current]
Sep 16 10:48:18 172.29.12.12  [38847.433293] sd 0:0:0:0: [sda] tag#27 ASC=0x11 ASCQ=0x0
Sep 16 10:48:18 172.29.12.12  [38847.438564] sd 0:0:0:0: [sda] tag#27 CDB: opcode=0x88 88 00 00 00 00 00 00 1f 0f 00 00 00 02 00 00 00
Sep 16 10:48:18 172.29.12.12  [38847.447684] blk_update_request: critical medium error, dev sda, sectorSep 16 00:01:17 172.29.12.12  [   27.493590] printk: console [netcon0] enabled
Sep 16 00:01:17 172.29.12.12  [   27.497846] netconsole: network logging started
Sep 16 00:01:23 172.29.12.12  [   33.758163] vdd_ldo12: disabling
Sep 16 00:01:26 172.29.12.12  [   35.768732] EXT4-fs (sda2): mounted filesystem with ordered data mode. Opts: errors=remount-ro
Sep 16 06:51:01 172.29.12.12  [24610.903773] EXT4-fs (mmcblk1p2): mounted filesystem without journal. Opts: (null)
Sep 16 10:48:18 172.29.12.12  [38847.420509] sd 0:0:0:0: [sda] tag#27 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
Sep 16 10:48:18 172.29.12.12  [38847.427392] sd 0:0:0:0: [sda] tag#27 Sense Key : 0x3 [current]
Sep 16 10:48:18 172.29.12.12  [38847.433293] sd 0:0:0:0: [sda] tag#27 ASC=0x11 ASCQ=0x0
Sep 16 10:48:18 172.29.12.12  [38847.438564] sd 0:0:0:0: [sda] tag#27 CDB: opcode=0x88 88 00 00 00 00 00 00 1f 0f 00 00 00 02 00 00 00
Sep 16 10:48:18 172.29.12.12  [38847.447684] blk_update_request: critical medium error, dev sda, sector 2035456 op 0x0:(READ) flags 0x80700 phys_seg 64 prio class 0
Sep 16 10:48:19 172.29.12.12  [38847.928837] sd 0:0:0:0: [sda] tag#24 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
Sep 16 10:48:19 172.29.12.12  [38847.935788] sd 0:0:0:0: [sda] tag#24 Sense Key : 0x3 [current]
Sep 16 10:48:19 172.29.12.12  [38847.941824] sd 0:0:0:0: [sda] tag#24 ASC=0x11 ASCQ=0x0
Sep 16 10:48:19 172.29.12.12  [38847.946804] sd 0:0:0:0: [sda] tag#24 CDB: opcode=0x88 88 00 00 00 00 00 00 1f 11 00 00 00 01 00 00 00
Sep 16 10:48:19 172.29.12.12  [38847.956093] blk_update_request: I/O error, dev sda, sector 2035968 op 0x0:(READ) flags 0x80700 phys_seg 32 prio class 0
Sep 16 10:48:30 172.29.12.12  [38858.978937] sd 0:0:0:0: [sda] tag#12 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
Sep 16 10:48:30 172.29.12.12  [38858.985798] sd 0:0:0:0: [sda] tag#12 Sense Key : 0x3 [current]
Sep 16 10:48:30 172.29.12.12  [38858.991698] sd 0:0:0:0: [sda] tag#12 ASC=0x11 ASCQ=0x0
Sep 16 10:48:30 172.29.12.12  [38858.997013] sd 0:0:0:0: [sda] tag#12 CDB: opcode=0x88 88 00 00 00 00 00 00 22 19 00 00 00 01 00 00 00
Sep 16 10:48:30 172.29.12.12  [38859.006104] blk_update_request: critical medium error, dev sda, sector 2234624 op 0x0:(...skipping...
00 00 58 00 00
Sep 16 14:02:10 172.29.12.12  [10860.672578] sd 0:0:0:0: [sda] tag#12 uas_eh_abort_handler 0 uas-tag 6 inflight: CMD IN
Sep 16 14:02:10 172.29.12.12  [10860.679166] sd 0:0:0:0: [sda] tag#12 CDB: opcode=0x88 88 00 00 00 00 00 00 ce 28 e0 00 00 00 40 00 00
Sep 16 14:02:10 172.29.12.12  [10860.808580] sd 0:0:0:0: [sda] tag#26 uas_eh_abort_handler 0 uas-tag 7 inflight: CMD IN
Sep 16 14:02:10 172.29.12.12  [10860.815164] sd 0:0:0:0: [sda] tag#26 CDB: opcode=0x88 88 00 00 00 00 00 02 41 14 50 00 00 00 08 00 00
Sep 16 14:02:19 172.29.12.12  [10869.376337] sd 0:0:0:0: [sda] tag#3 uas_eh_abort_handler 0 uas-tag 11 inflight: CMD IN
Sep 16 14:02:19 172.29.12.12  [10869.382930] sd 0:0:0:0: [sda] tag#3 CDB: opcode=0x88 88 00 00 00 00 00 02 44 c3 d0 00 00 00 18 00 00
Sep 16 14:02:19 172.29.12.12  [10869.392551] sd 0:0:0:0: [sda] tag#2 uas_eh_abort_handler 0 uas-tag 10 inflight: CMD IN
Sep 16 14:02:19 172.29.12.12  [10869.400002] sd 0:0:0:0: [sda] tag#2 CDB: opcode=0x88 88 00 00 00 00 00 02 44 c3 c0 00 00 00 08 00 00
Sep 16 14:02:19 172.29.12.12  [10869.409872] sd 0:0:0:0: [sda] tag#1 uas_eh_abort_handler 0 uas-tag 9 inflight: CMD IN
Sep 16 14:02:19 172.29.12.12  [10869.416985] sd 0:0:0:0: [sda] tag#1 CDB: opcode=0x88 88 00 00 00 00 00 02 44 c3 98 00 00 00 08 00 00
Sep 16 14:02:19 172.29.12.12  [10869.427347] sd 0:0:0:0: [sda] tag#0 uas_eh_abort_handler 0 uas-tag 8 inflight: CMD IN
Sep 16 14:02:19 172.29.12.12  [10869.433965] sd 0:0:0:0: [sda] tag#0 CDB: opcode=0x88 88 00 00 00 00 00 02 44 c3 50 00 00 00 08 00 00
Sep 16 14:02:30 172.29.12.12  [10880.639942] sd 0:0:0:0: [sda] tag#23 uas_eh_abort_handler 0 uas-tag 3 inflight: CMD
Sep 16 14:02:30 172.29.12.12  [10880.646267] sd 0:0:0:0: [sda] tag#23 CDB: opcode=0x85 85 06 2c 00 00 00 00 00 00 00 00 00 00 00 e5 00
Sep 16 14:02:30 172.29.12.12  [10880.667918] scsi host0: uas_eh_device_reset_handler start
Sep 16 14:02:30 172.29.12.12  [10880.800263] usb 4-1: reset SuperSpeed Gen 1 USB device number 2 using xhci-hcd
Sep 16 14:02:30 172.29.12.12  [10880.826429] scsi host0: uas_eh_device_reset_handler success
Sep 16 14:02:30 172.29.12.12  [10880.831134] sd 0:0:0:0: [sda] tag#23 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x06
Sep 16 14:02:30 172.29.12.12  [10880.838973] sd 0:0:0:0: [sda] tag#23 CDB: opcode=0x85 85 06 2c 00 00 00 00 00 00 00 00 00 00 00 e5 00
My suspicion is that the drive actually gave up this time and it's no longer able to boot after this failure, but we'll find out soon enough.
mctom wrote:
Thu Sep 16, 2021 2:13 am
You said you keep your OS on HDD rather than SD card? Perhaps that is a problem after all.
Yeah, I figured this is also something to look at.
I've since gotten my hands on the cards suggested by Hardkernel (those with the A1 classification) and suggested we step away from the original method of booting from the HDD.
Let's hope this has an over-all positive impact.

User avatar
mad_ady
Posts: 9689
Joined: Wed Jul 15, 2015 5:00 pm
languages_spoken: english
ODROIDs: XU4, C1+, C2, C4, N1, N2, H2, Go, Go Advance
Location: Bucharest, Romania
Has thanked: 609 times
Been thanked: 721 times
Contact:

Re: [Odroid HC2 - Ubuntu 20.04 - 5.4] Seemingly random unavailability/crashing

Post by mad_ady »

Yes... If the rootfs goes away, you're kind of stuck. You can prepare for it, though - maybe it helps?

The quick and dirty way - install busybox, copy (on boot) /bin/busybox to /dev/shm, Start a busybox telnet (or netcat) on some port. Make the rootfs go away and you should still be able to telnet on that port (I used netcat) and execute /dev/shm/busybox commands to see what happened.

A more long-term solution is to have rootfs as some sort of squashfs that you can copy to ram and mount it from there.

Or if a reboot is acceptable, the busybox trick in /dev/shm with a self-contained script that runs /dev/shm/busybox dmesg | /dev/shm/busybox tail -30 | /dev/shm/busybox grep "sda ... error message" | /dev/shm/busybox wc -l If non-zero, do a /dev/shm/busybox reboot -f

Post Reply

Return to “Ubuntu”

Who is online

Users browsing this forum: No registered users and 5 guests