Mono issue with big.LITTLE arch

Test and fix the Kernel 4.14 features

Moderators: odroid, mdrjr

Mono issue with big.LITTLE arch

Unread postby skuizy » Sun Jun 03, 2018 11:53 pm

Hello everyone,

First of all, forgive me if this isn't the place to discuss this subject, I'm fairly new to this debug party.

Also English isn't my native language, so tell me if you need more infomation.

As some users, I've had random crashes with mono that prevent my apps to run correctly (THIS GitHub thread is a great exemple of the random crashes experienced), which seems to be more generally related to JIT compilers.

After a fair amount of searching, I've figured this issue seems to be caused by the big.LITTLE arch the XU4 uses.

From my understanding, the crashes happen when the scheduler moves the process from a big core to a little one, as caches lines are not the same size.

My point is, are the correction made in THIS thread still relevant ?

If so, have they been merged into our kernel ?

It also seems that a workouroud has been implemented in mono, as explained in THIS blog post, but it doesn't seems to solving all issues...

Thanks !
skuizy
 
Posts: 4
Joined: Sun Jun 03, 2018 11:25 pm
languages_spoken: english,french
ODROIDs: XU4

Re: Mono issue with big.LITTLE arch

Unread postby mad_ady » Mon Jun 04, 2018 1:22 am

That patch is apparently for arm64, while xu4 has a 32bit processor.

You can work around the issue in userland by pinning the mono process to either the big or little cores. You can use tasksel or cgroups (https://magazine.odroid.com/article/set ... rpose-nas/)
User avatar
mad_ady
 
Posts: 3794
Joined: Wed Jul 15, 2015 5:00 pm
Location: Bucharest, Romania
languages_spoken: english
ODROIDs: XU4, C1+, C2

Re: Mono issue with big.LITTLE arch

Unread postby skuizy » Mon Jun 04, 2018 3:46 am

Thanks !

I'll give it a shot to confirm the origin of the issue !

But I think this is an ugly way of solving this issue. We use a clever implementation of the ARM architecture with a clever scheduler that maximise efficiency, so forcing the execution only on the big cores or the little cores doesn't makes sense...

Despite the arm64 vs arm32 architecture, if the origin of the issue is the same (different size of caches), couldn't a similar fix be integrated ?
skuizy
 
Posts: 4
Joined: Sun Jun 03, 2018 11:25 pm
languages_spoken: english,french
ODROIDs: XU4

Re: Mono issue with big.LITTLE arch

Unread postby crashoverride » Mon Jun 04, 2018 5:01 am

The cache issue was patched in Mono a long time ago. Try using the Microsoft (Xamarin) provided repository to ensure you have the latest version.
https://www.mono-project.com/download/stable/#download-lin-ubuntu
crashoverride
 
Posts: 3511
Joined: Tue Dec 30, 2014 8:42 pm
languages_spoken: english
ODROIDs: C1

Re: Mono issue with big.LITTLE arch

Unread postby skuizy » Mon Jun 04, 2018 4:42 pm

I'm already on the last stable version which is 5.12.0.226.

From my understanding the fix is included from the 4.6.0.245 release...
skuizy
 
Posts: 4
Joined: Sun Jun 03, 2018 11:25 pm
languages_spoken: english,french
ODROIDs: XU4

Re: Mono issue with big.LITTLE arch

Unread postby crashoverride » Tue Jun 05, 2018 12:21 am

I have not encountered any cache issues with mono on XU4. If the issue is reproducible, you should report it to Microsoft.
crashoverride
 
Posts: 3511
Joined: Tue Dec 30, 2014 8:42 pm
languages_spoken: english
ODROIDs: C1

Re: Mono issue with big.LITTLE arch

Unread postby DarkBahamut » Tue Jun 05, 2018 4:42 am

skuizy wrote:But I think this is an ugly way of solving this issue. We use a clever implementation of the ARM architecture with a clever scheduler that maximise efficiency, so forcing the execution only on the big cores or the little cores doesn't makes sense...


4.14 doesn't have a cleaver scheduler unfortunately. big.LITTLE GTS isn't used so I wouldn't be too worried about forcing cores manually. It's definitely your best option.

The current hperf hmp code by default tries try to run all code on the A15 cluster anyway, then overspills onto the A7 cluster when full - In theory anyway, in practice it doesn't really work smoothly.
DarkBahamut
 
Posts: 305
Joined: Tue Jan 19, 2016 10:19 am
languages_spoken: english
ODROIDs: XU4

Re: Mono issue with big.LITTLE arch

Unread postby skuizy » Sat Jun 09, 2018 8:23 pm

Alright guys, I've just tried mad_ady workaround, which seems to be working for now ! :)

crashoverride wrote:I have not encountered any cache issues with mono on XU4. If the issue is reproducible, you should report it to Microsoft.

I'll let my XU4 run for another week with the cgroup method to confirm that the issue came from migrating mono process from big cores to small cores.

DarkBahamut wrote:4.14 doesn't have a cleaver scheduler unfortunately. big.LITTLE GTS isn't used so I wouldn't be too worried about forcing cores manually. It's definitely your best option.

The current hperf hmp code by default tries try to run all code on the A15 cluster anyway, then overspills onto the A7 cluster when full - In theory anyway, in practice it doesn't really work smoothly.

Well... That's disapointing... Is there any reason not using it ? Is it a stability issue ? A compatibility issue ?
Which scheduler is refered by the author of the article on mad_ady's link then ?
Adrian Popa wrote:The official kernel comes with a “magic” scheduler from Samsung which knows the processor’s true power, and can switch tasks from the little cores to the big cores when load is high.
skuizy
 
Posts: 4
Joined: Sun Jun 03, 2018 11:25 pm
languages_spoken: english,french
ODROIDs: XU4

Re: Mono issue with big.LITTLE arch

Unread postby mad_ady » Sun Jun 10, 2018 2:09 am

I'm not sure how it's called exactly (MFC?) but there are some linux kernel mailing list messages suggesting it was not accepted in its current form.
User avatar
mad_ady
 
Posts: 3794
Joined: Wed Jul 15, 2015 5:00 pm
Location: Bucharest, Romania
languages_spoken: english
ODROIDs: XU4, C1+, C2

Re: Mono issue with big.LITTLE arch

Unread postby crashoverride » Sun Jun 10, 2018 7:00 am

The cache issue occurs when going from a little to big core only. In this instance, the cache on the big core is larger and contains undefined contents in the expanded area. The symptom of the cache issue is SIGILL (illegal instruction) because its due to the ICACHE. The symptom stated in the github issue referenced is NullReferenceException which indicates a "runtime" error. The difference is significant in that SIGILL is an instruction trap and NullReferenceException is a data trap.

Based on the logs in the github issue, a more likely candidate for the issue is "futex()":
viewtopic.php?f=146&t=29501#p217699
Concerning my futex() issue ... turning off the little cores fixes the problem I was trying to fix.
crashoverride
 
Posts: 3511
Joined: Tue Dec 30, 2014 8:42 pm
languages_spoken: english
ODROIDs: C1


Return to Linux Kernel 4.14 Debugging Party

Who is online

Users browsing this forum: No registered users and 1 guest