Memory Bandwidth Tests

Post Reply
crashoverride
Posts: 5027
Joined: Tue Dec 30, 2014 8:42 pm
languages_spoken: english
ODROIDs: C1
Has thanked: 0
Been thanked: 323 times
Contact:

Memory Bandwidth Tests

Post by crashoverride »

Looking for comments, suggestions and test data to confirm memory performance. The expectation is DDR3-1866 performance (14933 MB/s).

I am not sure what the best tool to use to provide memory bandwidth is in an Aarch64 linux environment.

The sysbench number is much lower:

Code: Select all

$ taskset -c 4-5 sysbench --test=memory --memory-block-size=1M --memory-total-size=100G --num-threads=1 run
sysbench 0.4.12:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1

Doing memory operations speed test
Memory block size: 1024K

Memory transfer size: 102400M

Memory operations type: write
Memory scope type: global
Threads started!
Done.

Operations performed: 102400 ( 3720.78 ops/sec)

102400.00 MB transferred (3720.78 MB/sec)


Test execution summary:
    total time:                          27.5211s
    total number of events:              102400
    total time taken by event execution: 27.4838
    per-request statistics:
         min:                                  0.26ms
         avg:                                  0.27ms
         max:                                  1.87ms
         approx.  95 percentile:               0.27ms

Threads fairness:
    events (avg/stddev):           102400.0000/0.00
    execution time (avg/stddev):   27.4838/0.00
The 3.7GB/s number is nowhere near the expected 14.9GB/s.

[edit]
mbw test is also much lower than expected:

Code: Select all

$ taskset -c 4-5 mbw 1000
Long uses 8 bytes. Allocating 2*131072000 elements = 2097152000 bytes of memory.
Using 262144 bytes as blocks for memcpy block copy test.
Getting down to business... Doing 10 runs per test.
0       Method: MEMCPY  Elapsed: 0.37532        MiB: 1000.00000 Copy: 2664.365 MiB/s
1       Method: MEMCPY  Elapsed: 0.37517        MiB: 1000.00000 Copy: 2665.487 MiB/s
2       Method: MEMCPY  Elapsed: 0.37525        MiB: 1000.00000 Copy: 2664.911 MiB/s
3       Method: MEMCPY  Elapsed: 0.37537        MiB: 1000.00000 Copy: 2664.074 MiB/s
4       Method: MEMCPY  Elapsed: 0.37514        MiB: 1000.00000 Copy: 2665.650 MiB/s
5       Method: MEMCPY  Elapsed: 0.37511        MiB: 1000.00000 Copy: 2665.849 MiB/s
6       Method: MEMCPY  Elapsed: 0.37519        MiB: 1000.00000 Copy: 2665.302 MiB/s
7       Method: MEMCPY  Elapsed: 0.37496        MiB: 1000.00000 Copy: 2666.980 MiB/s
8       Method: MEMCPY  Elapsed: 0.37505        MiB: 1000.00000 Copy: 2666.304 MiB/s
9       Method: MEMCPY  Elapsed: 0.37523        MiB: 1000.00000 Copy: 2665.018 MiB/s
AVG     Method: MEMCPY  Elapsed: 0.37518        MiB: 1000.00000 Copy: 2665.394 MiB/s
0       Method: DUMB    Elapsed: 0.37721        MiB: 1000.00000 Copy: 2651.029 MiB/s
1       Method: DUMB    Elapsed: 0.37723        MiB: 1000.00000 Copy: 2650.896 MiB/s
2       Method: DUMB    Elapsed: 0.37648        MiB: 1000.00000 Copy: 2656.205 MiB/s
3       Method: DUMB    Elapsed: 0.37644        MiB: 1000.00000 Copy: 2656.494 MiB/s
4       Method: DUMB    Elapsed: 0.37718        MiB: 1000.00000 Copy: 2651.219 MiB/s
5       Method: DUMB    Elapsed: 0.37719        MiB: 1000.00000 Copy: 2651.163 MiB/s
6       Method: DUMB    Elapsed: 0.37731        MiB: 1000.00000 Copy: 2650.312 MiB/s
7       Method: DUMB    Elapsed: 0.37728        MiB: 1000.00000 Copy: 2650.523 MiB/s
8       Method: DUMB    Elapsed: 0.37723        MiB: 1000.00000 Copy: 2650.896 MiB/s
9       Method: DUMB    Elapsed: 0.37739        MiB: 1000.00000 Copy: 2649.779 MiB/s
AVG     Method: DUMB    Elapsed: 0.37710        MiB: 1000.00000 Copy: 2651.850 MiB/s
0       Method: MCBLOCK Elapsed: 0.22012        MiB: 1000.00000 Copy: 4542.997 MiB/s
1       Method: MCBLOCK Elapsed: 0.22007        MiB: 1000.00000 Copy: 4544.091 MiB/s
2       Method: MCBLOCK Elapsed: 0.22011        MiB: 1000.00000 Copy: 4543.224 MiB/s
3       Method: MCBLOCK Elapsed: 0.22005        MiB: 1000.00000 Copy: 4544.380 MiB/s
4       Method: MCBLOCK Elapsed: 0.22007        MiB: 1000.00000 Copy: 4544.009 MiB/s
5       Method: MCBLOCK Elapsed: 0.22003        MiB: 1000.00000 Copy: 4544.814 MiB/s
6       Method: MCBLOCK Elapsed: 0.22011        MiB: 1000.00000 Copy: 4543.204 MiB/s
7       Method: MCBLOCK Elapsed: 0.22010        MiB: 1000.00000 Copy: 4543.451 MiB/s
8       Method: MCBLOCK Elapsed: 0.22005        MiB: 1000.00000 Copy: 4544.401 MiB/s
9       Method: MCBLOCK Elapsed: 0.22006        MiB: 1000.00000 Copy: 4544.257 MiB/s
AVG     Method: MCBLOCK Elapsed: 0.22008        MiB: 1000.00000 Copy: 4543.883 MiB/s

crashoverride
Posts: 5027
Joined: Tue Dec 30, 2014 8:42 pm
languages_spoken: english
ODROIDs: C1
Has thanked: 0
Been thanked: 323 times
Contact:

Re: Memory Bandwidth Tests

Post by crashoverride »

@odroid

Can you confirm the DDR3 bootloader is setting the correct DDR timings? Normally, the boot chain will use "rk3399_ddr_933MHz_v1.08.bin" from here:
https://github.com/rockchip-linux/rkbin ... aster/rk33

I don't know if the RK boot or uboot boot is being used. If its the latter, the uboot settings should be checked.

(Right now the RK terms for all these things eludes me. I will need to brush up terminology.)

DarkBahamut
Posts: 332
Joined: Tue Jan 19, 2016 10:19 am
languages_spoken: english
ODROIDs: XU4, N1
Has thanked: 0
Been thanked: 0
Contact:

Re: Memory Bandwidth Tests

Post by DarkBahamut »

Don't forget that the CCI doesn't run at the same clock as the DRAM, so it's normally impossible to get full bandwidth in a unidirectional bandwidth test.

The most common value is for the CCI to be half of DRAM clock, but it is lower on some platforms. This is by design for power reasons so can't really be changed, just needs accounting for when bandwidth testing :)

mlinuxguy
Posts: 842
Joined: Thu Feb 28, 2013 10:28 am
languages_spoken: english
ODROIDs: X, X2, XU, XU3, XU4, C1, C1+, C2, N1, USB-IO
Has thanked: 0
Been thanked: 0
Contact:

Re: Memory Bandwidth Tests

Post by mlinuxguy »

I haven't seen any updates from HK on the GPIO toggle speed performance I posted in HW forum. I did spend some time digging thru the code trying to figure out
who was running at what speed without much luck. I was mostly comparing things I thought should be a certain speed to their clocks, DDR clocks was one I looked at and couldn't really figure out the correct clock rate. It made me think there was a clock DIV somewhere I was missing or something.
All that said I would not be surprised if some of the clocks are not what we think they are.... GPIO or DDR

User avatar
odroid
Site Admin
Posts: 34859
Joined: Fri Feb 22, 2013 11:14 pm
languages_spoken: English, Korean, Japanese
ODROIDs: ODROID
Has thanked: 933 times
Been thanked: 760 times
Contact:

Re: Memory Bandwidth Tests

Post by odroid »

Due to our limited resources, we had no time to play with N1 in this week.
We had to spend all of our resources to fix several critical issues on XU4 and C2.
For example:
viewtopic.php?f=97&t=29638
viewtopic.php?f=97&t=29069
viewtopic.php?f=140&t=29735
viewtopic.php?f=137&t=29868
viewtopic.php?f=97&t=29587
viewtopic.php?f=141&t=29799

I hope we can join the N1 debug party again from early next week.
Please understand our situation.

User avatar
rooted
Posts: 7875
Joined: Fri Dec 19, 2014 9:12 am
languages_spoken: english
Location: Gulf of Mexico, US
Has thanked: 724 times
Been thanked: 222 times
Contact:

Re: Memory Bandwidth Tests

Post by rooted »

Been a rough week, mostly due to the kernel. It's easy to see why so many SoC run on outdated but well tested kernels.

crashoverride
Posts: 5027
Joined: Tue Dec 30, 2014 8:42 pm
languages_spoken: english
ODROIDs: C1
Has thanked: 0
Been thanked: 323 times
Contact:

Re: Memory Bandwidth Tests

Post by crashoverride »

It does appear we are using a slower DRAM setting. I patched it but am not seeing any measurable results (yet).

Code: Select all

diff --git a/arch/arm/dts/rk3399-odroidn1.dts b/arch/arm/dts/rk3399-odroidn1.dts
index 3ebb8ec..d0e55ef 100644
--- a/arch/arm/dts/rk3399-odroidn1.dts
+++ b/arch/arm/dts/rk3399-odroidn1.dts
@@ -8,7 +8,7 @@
 #include <dt-bindings/pwm/pwm.h>
 #include <dt-bindings/pinctrl/rockchip.h>
 #include "rk3399.dtsi"
-#include "rk3399-sdram-ddr3-1600.dtsi"
+#include "rk3399-sdram-ddr3-1866.dtsi"
 
 / {
 	model = "Hardkernel ODROID-N1";
diff --git a/build.sh b/build.sh
index 3c201c6..5992f08 100755
--- a/build.sh
+++ b/build.sh
@@ -2,11 +2,11 @@
 
 make odroidn1_defconfig
 
-make
+make ARCH=arm CROSS_COMPILE=aarch64-linux-gnu-
 
 tools/rk_tools/bin/loaderimage --pack --uboot ./u-boot-dtb.bin uboot.img
 
-tools/mkimage -n rk3399 -T rksd -d tools/rk_tools/bin/rk33/rk3399_ddr_800MHz_v1.08.bin idbloader.img
+tools/mkimage -n rk3399 -T rksd -d tools/rk_tools/bin/rk33/rk3399_ddr_933MHz_v1.08.bin idbloader.img
 cat tools/rk_tools/bin/rk33/rk3399_miniloader_v1.06.bin >> idbloader.img
 
 cp tools/rk_tools/bin/rk33/rk3399_loader_v1.08.106.bin sd_fuse
[edit]
Ironically, my numbers from the two tests posted are lower. :shock:

elatllat
Posts: 1779
Joined: Tue Sep 01, 2015 8:54 am
languages_spoken: english
ODROIDs: XU4, N1, N2, C4, N2+
Has thanked: 47 times
Been thanked: 114 times
Contact:

Re: Memory Bandwidth Tests

Post by elatllat »

odroid wrote:...fix several critical issues...
Thanks for providing such good software support for your hardware.

crashoverride
Posts: 5027
Joined: Tue Dec 30, 2014 8:42 pm
languages_spoken: english
ODROIDs: C1
Has thanked: 0
Been thanked: 323 times
Contact:

Re: Memory Bandwidth Tests

Post by crashoverride »

"Stream" benchmark with DDR3 933Mhz

Code: Select all

$ ./stream_c.exe
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 6
Number of Threads counted = 6
-------------------------------------------------------------
Your clock granularity/precision appears to be 2 microseconds.
Each test below will take on the order of 41498 microseconds.
   (= 20749 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:            6345.4     0.026320     0.025215     0.028407
Scale:           6353.0     0.026551     0.025185     0.028174
Add:             5689.5     0.043376     0.042183     0.044296
Triad:           5676.3     0.043872     0.042281     0.046206
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------

elatllat
Posts: 1779
Joined: Tue Sep 01, 2015 8:54 am
languages_spoken: english
ODROIDs: XU4, N1, N2, C4, N2+
Has thanked: 47 times
Been thanked: 114 times
Contact:

Re: Memory Bandwidth Tests

Post by elatllat »

how does

Code: Select all

dd if=/dev/zero of=/dev/shm/test bs=10M count=10 && rm /dev/shm/test
do in comparison?

crashoverride
Posts: 5027
Joined: Tue Dec 30, 2014 8:42 pm
languages_spoken: english
ODROIDs: C1
Has thanked: 0
Been thanked: 323 times
Contact:

Re: Memory Bandwidth Tests

Post by crashoverride »

Code: Select all

$ dd if=/dev/zero of=/dev/shm/test bs=10M count=10 && rm /dev/shm/test
10+0 records in
10+0 records out
104857600 bytes (105 MB, 100 MiB) copied, 0.121082 s, 866 MB/s
Also for reference:

Code: Select all

$ taskset -c 4-5 glmark2-es2                                    
=======================================================
    glmark2 2017.07
=======================================================
    OpenGL Information
    GL_VENDOR:     ARM
    GL_RENDERER:   Mali-T860
    GL_VERSION:    OpenGL ES 3.2 v1.r14p0-01rel0-git(966ed26).f44c85cb3d2ceb87e8be88e7592755c3
=======================================================
[build] use-vbo=false: FPS: 395 FrameTime: 2.532 ms
[build] use-vbo=true: FPS: 470 FrameTime: 2.128 ms
[texture] texture-filter=nearest: FPS: 503 FrameTime: 1.988 ms
[texture] texture-filter=linear: FPS: 502 FrameTime: 1.992 ms
[texture] texture-filter=mipmap: FPS: 504 FrameTime: 1.984 ms
[shading] shading=gouraud: FPS: 436 FrameTime: 2.294 ms
[shading] shading=blinn-phong-inf: FPS: 437 FrameTime: 2.288 ms
[shading] shading=phong: FPS: 414 FrameTime: 2.415 ms
[shading] shading=cel: FPS: 411 FrameTime: 2.433 ms
[bump] bump-render=high-poly: FPS: 264 FrameTime: 3.788 ms
[bump] bump-render=normals: FPS: 411 FrameTime: 2.433 ms
[bump] bump-render=height: FPS: 405 FrameTime: 2.469 ms
[effect2d] kernel=0,1,0;1,-4,1;0,1,0;: FPS: 341 FrameTime: 2.933 ms
[effect2d] kernel=1,1,1,1,1;1,1,1,1,1;1,1,1,1,1;: FPS: 211 FrameTime: 4.739 ms
[pulsar] light=false:quads=5:texture=false: FPS: 417 FrameTime: 2.398 ms
[desktop] blur-radius=5:effect=blur:passes=1:separable=true:windows=4: FPS: 190 FrameTime: 5.263 ms
[desktop] effect=shadow:windows=4: FPS: 317 FrameTime: 3.155 ms
[buffer] columns=200:interleave=false:update-dispersion=0.9:update-fraction=0.5:update-method=map: FPS: 130 FrameTime: 7.692 ms
[buffer] columns=200:interleave=false:update-dispersion=0.9:update-fraction=0.5:update-method=subdata: FPS: 132 FrameTime: 7.576 ms
[buffer] columns=200:interleave=true:update-dispersion=0.9:update-fraction=0.5:update-method=map: FPS: 147 FrameTime: 6.803 ms
[ideas] speed=duration: FPS: 178 FrameTime: 5.618 ms
[jellyfish] <default>: FPS: 290 FrameTime: 3.448 ms
[terrain] <default>: FPS: 47 FrameTime: 21.277 ms
[shadow] <default>: FPS: 257 FrameTime: 3.891 ms
[refract] <default>: FPS: 94 FrameTime: 10.638 ms
[conditionals] fragment-steps=0:vertex-steps=0: FPS: 495 FrameTime: 2.020 ms
[conditionals] fragment-steps=5:vertex-steps=0: FPS: 415 FrameTime: 2.410 ms
[conditionals] fragment-steps=0:vertex-steps=5: FPS: 493 FrameTime: 2.028 ms
[function] fragment-complexity=low:fragment-steps=5: FPS: 456 FrameTime: 2.193 ms
[function] fragment-complexity=medium:fragment-steps=5: FPS: 398 FrameTime: 2.513 ms
[loop] fragment-loop=false:fragment-steps=5:vertex-steps=5: FPS: 462 FrameTime: 2.165 ms
[loop] fragment-steps=5:fragment-uniform=false:vertex-steps=5: FPS: 462 FrameTime: 2.165 ms
[loop] fragment-steps=5:fragment-uniform=true:vertex-steps=5: FPS: 411 FrameTime: 2.433 ms
=======================================================
                                  glmark2 Score: 348
=======================================================

tkaiser
Posts: 672
Joined: Mon Nov 09, 2015 12:30 am
languages_spoken: english
ODROIDs: C1+, C2, XU4, HC1
Has thanked: 0
Been thanked: 3 times
Contact:

Re: Memory Bandwidth Tests

Post by tkaiser »

Just as a reference: tinymembench numbers pinned to big or little cores and link with comparison to other RK3399 devices here: https://forum.armbian.com/topic/6496-od ... ment=49414

DarkBahamut
Posts: 332
Joined: Tue Jan 19, 2016 10:19 am
languages_spoken: english
ODROIDs: XU4, N1
Has thanked: 0
Been thanked: 0
Contact:

Re: Memory Bandwidth Tests

Post by DarkBahamut »

Just a point of interest. The N1 doesn't have DDR3-1866 memory. The memory chips on board are Samsung K4B8G1646D-MYK0 which are 8Gbit 16bit chips rated at DDR3L-1600 11-11-11 @ 1.35v (datasheet)

They might do 1866 but we'd effectively be RAM overclocking to do so. I guess this would explain why the uboot is loading the 1600 config ;)

mlinuxguy
Posts: 842
Joined: Thu Feb 28, 2013 10:28 am
languages_spoken: english
ODROIDs: X, X2, XU, XU3, XU4, C1, C1+, C2, N1, USB-IO
Has thanked: 0
Been thanked: 0
Contact:

Re: Memory Bandwidth Tests

Post by mlinuxguy »

On my custom kernel using Debian image

Code: Select all

root@odroid:/sys/kernel/debug/clk# cat clk_summary | grep dpll
    pll_dpll                              1            1   792000000          0 0
       dpll                               1            1   792000000          0 0
          clk_ddrc_dpll_src               1            1   792000000          0 0
          clk_core_b_dpll_src             0            0   792000000          0 0
          clk_core_l_dpll_src             0            0   792000000          0 0

Code: Select all

root@odroid:/sys/kernel/debug/clk# cat clk_enabled_list | grep ddr
        sclk_ddrc:1:1 [792000000] -> clk_ddrc_dpll_src:1:1 [792000000] -> dpll:1:1 [792000000] -> pll_dpll:1:1 [792000000] -> xin24m:23:23 [24000000]
        clk_ddrc_dpll_src:1:1 [792000000] -> dpll:1:1 [792000000] -> pll_dpll:1:1 [792000000] -> xin24m:23:23 [24000000]
        pclk_center_main_noc:1:1 [200000000] -> pclk_ddr:1:1 [200000000] -> gpll:29:27 [800000000] -> pll_gpll:1:1 [800000000] -> xin24m:23:23 [24000000]
        pclk_ddr:1:1 [200000000] -> gpll:29:27 [800000000] -> pll_gpll:1:1 [800000000] -> xin24m:23:23 [24000000]

crashoverride
Posts: 5027
Joined: Tue Dec 30, 2014 8:42 pm
languages_spoken: english
ODROIDs: C1
Has thanked: 0
Been thanked: 323 times
Contact:

Re: Memory Bandwidth Tests

Post by crashoverride »

DarkBahamut wrote:The memory chips on board are Samsung K4B8G1646D-MYK0 which are 8Gbit 16bit chips rated at DDR3L-1600 11-11-11
I confirmed through visual inspection that is what is physically mounted on my N1 board. The N1 announcement stated DDR3-1866 which is what I was previously going off of.

Image

crashoverride
Posts: 5027
Joined: Tue Dec 30, 2014 8:42 pm
languages_spoken: english
ODROIDs: C1
Has thanked: 0
Been thanked: 323 times
Contact:

Re: Memory Bandwidth Tests

Post by crashoverride »

DDR3-1866 operation seems legitimate:

Code: Select all

# cat /sys/kernel/debug/clk/clk_summary | grep dpll                
    pll_dpll                              1            1   912000000          0 
       dpll                               1            1   912000000          0 
          clk_ddrc_dpll_src               1            1   912000000          0 
          clk_core_b_dpll_src             0            0   912000000          0 
          clk_core_l_dpll_src             0            0   912000000          0 
                               
# cat /sys/kernel/debug/clk/clk_enabled_list | grep ddr            
        sclk_ddrc:1:1 [912000000] -> clk_ddrc_dpll_src:1:1 [912000000] -> dpll:]
        clk_ddrc_dpll_src:1:1 [912000000] -> dpll:1:1 [912000000] -> pll_dpll:1]
        pclk_center_main_noc:1:1 [200000000] -> pclk_ddr:1:1 [200000000] -> gpl]
        pclk_ddr:1:1 [200000000] -> gpll:29:26 [800000000] -> pll_gpll:1:1 [800]

User avatar
odroid
Site Admin
Posts: 34859
Joined: Fri Feb 22, 2013 11:14 pm
languages_spoken: English, Korean, Japanese
ODROIDs: ODROID
Has thanked: 933 times
Been thanked: 760 times
Contact:

Re: Memory Bandwidth Tests

Post by odroid »

Right. The actual DRAM components on the engineering sample boards are 1600Mhz grade.
We will check the availability of 1866Mhz with DRAM vendors for the mass production.
If it is not easy to source the high-speed components stably, we will change our N1 specification and block-diagram.

crashoverride
Posts: 5027
Joined: Tue Dec 30, 2014 8:42 pm
languages_spoken: english
ODROIDs: C1
Has thanked: 0
Been thanked: 323 times
Contact:

Re: Memory Bandwidth Tests

Post by crashoverride »

DDR3-1866 tinymembench results (https://github.com/ssvb/tinymembench)

Code: Select all

odroid@odroid:~/tinymembench$ taskset -c 5 ./tinymembench 
tinymembench v0.4.9 (simple benchmark for memory throughput and latency)

==========================================================================
== Memory bandwidth tests                                               ==
==                                                                      ==
== Note 1: 1MB = 1000000 bytes                                          ==
== Note 2: Results for 'copy' tests show how many bytes can be          ==
==         copied per second (adding together read and writen           ==
==         bytes would have provided twice higher numbers)              ==
== Note 3: 2-pass copy means that we are using a small temporary buffer ==
==         to first fetch data into it, and only then write it to the   ==
==         destination (source -> L1 cache, L1 cache -> destination)    ==
== Note 4: If sample standard deviation exceeds 0.1%, it is shown in    ==
==         brackets                                                     ==
==========================================================================

 C copy backwards                                     :   2949.7 MB/s
 C copy backwards (32 byte blocks)                    :   2944.1 MB/s
 C copy backwards (64 byte blocks)                    :   2887.3 MB/s
 C copy                                               :   2900.8 MB/s
 C copy prefetched (32 bytes step)                    :   2870.9 MB/s
 C copy prefetched (64 bytes step)                    :   2871.7 MB/s
 C 2-pass copy                                        :   2595.9 MB/s
 C 2-pass copy prefetched (32 bytes step)             :   2658.3 MB/s
 C 2-pass copy prefetched (64 bytes step)             :   2650.4 MB/s
 C fill                                               :   4877.0 MB/s (0.4%)
 C fill (shuffle within 16 byte blocks)               :   4876.1 MB/s
 C fill (shuffle within 32 byte blocks)               :   4877.1 MB/s
 C fill (shuffle within 64 byte blocks)               :   4875.6 MB/s
 ---
 standard memcpy                                      :   2945.8 MB/s
 standard memset                                      :   4877.4 MB/s (0.4%)
 ---
 NEON LDP/STP copy                                    :   2937.4 MB/s
 NEON LDP/STP copy pldl2strm (32 bytes step)          :   2989.7 MB/s (0.1%)
 NEON LDP/STP copy pldl2strm (64 bytes step)          :   2982.0 MB/s
 NEON LDP/STP copy pldl1keep (32 bytes step)          :   2878.5 MB/s (0.1%)
 NEON LDP/STP copy pldl1keep (64 bytes step)          :   2875.8 MB/s
 NEON LD1/ST1 copy                                    :   2943.0 MB/s (0.1%)
 NEON STP fill                                        :   4875.5 MB/s (0.4%)
 NEON STNP fill                                       :   4842.2 MB/s (0.1%)
 ARM LDP/STP copy                                     :   2936.7 MB/s
 ARM STP fill                                         :   4874.6 MB/s (0.4%)
 ARM STNP fill                                        :   4843.9 MB/s (0.1%)

==========================================================================
== Framebuffer read tests.                                              ==
==                                                                      ==
== Many ARM devices use a part of the system memory as the framebuffer, ==
== typically mapped as uncached but with write-combining enabled.       ==
== Writes to such framebuffers are quite fast, but reads are much       ==
== slower and very sensitive to the alignment and the selection of      ==
== CPU instructions which are used for accessing memory.                ==
==                                                                      ==
== Many x86 systems allocate the framebuffer in the GPU memory,         ==
== accessible for the CPU via a relatively slow PCI-E bus. Moreover,    ==
== PCI-E is asymmetric and handles reads a lot worse than writes.       ==
==                                                                      ==
== If uncached framebuffer reads are reasonably fast (at least 100 MB/s ==
== or preferably >300 MB/s), then using the shadow framebuffer layer    ==
== is not necessary in Xorg DDX drivers, resulting in a nice overall    ==
== performance improvement. For example, the xf86-video-fbturbo DDX     ==
== uses this trick.                                                     ==
==========================================================================

 NEON LDP/STP copy (from framebuffer)                 :    658.1 MB/s
 NEON LDP/STP 2-pass copy (from framebuffer)          :    602.5 MB/s
 NEON LD1/ST1 copy (from framebuffer)                 :    702.4 MB/s
 NEON LD1/ST1 2-pass copy (from framebuffer)          :    656.3 MB/s
 ARM LDP/STP copy (from framebuffer)                  :    483.3 MB/s
 ARM LDP/STP 2-pass copy (from framebuffer)           :    471.7 MB/s

==========================================================================
== Memory latency test                                                  ==
==                                                                      ==
== Average time is measured for random memory accesses in the buffers   ==
== of different sizes. The larger is the buffer, the more significant   ==
== are relative contributions of TLB, L1/L2 cache misses and SDRAM      ==
== accesses. For extremely large buffer sizes we are expecting to see   ==
== page table walk with several requests to SDRAM for almost every      ==
== memory access (though 64MiB is not nearly large enough to experience ==
== this effect to its fullest).                                         ==
==                                                                      ==
== Note 1: All the numbers are representing extra time, which needs to  ==
==         be added to L1 cache latency. The cycle timings for L1 cache ==
==         latency can be usually found in the processor documentation. ==
== Note 2: Dual random read means that we are simultaneously performing ==
==         two independent memory accesses at a time. In the case if    ==
==         the memory subsystem can't handle multiple outstanding       ==
==         requests, dual random read has the same timings as two       ==
==         single reads performed one after another.                    ==
==========================================================================

block size : single random read / dual random read
      1024 :    0.0 ns          /     0.0 ns 
      2048 :    0.0 ns          /     0.0 ns 
      4096 :    0.0 ns          /     0.0 ns 
      8192 :    0.0 ns          /     0.0 ns 
     16384 :    0.0 ns          /     0.0 ns 
     32768 :    0.0 ns          /     0.0 ns 
     65536 :    4.1 ns          /     6.5 ns 
    131072 :    6.2 ns          /     8.7 ns 
    262144 :    8.9 ns          /    11.6 ns 
    524288 :   10.3 ns          /    13.3 ns 
   1048576 :   14.6 ns          /    20.3 ns 
   2097152 :  100.1 ns          /   153.2 ns 
   4194304 :  141.9 ns          /   192.2 ns 
   8388608 :  168.2 ns          /   212.8 ns 
  16777216 :  181.2 ns          /   222.2 ns 
  33554432 :  188.3 ns          /   228.5 ns 
  67108864 :  197.7 ns          /   241.9 ns 

User avatar
odroid
Site Admin
Posts: 34859
Joined: Fri Feb 22, 2013 11:14 pm
languages_spoken: English, Korean, Japanese
ODROIDs: ODROID
Has thanked: 933 times
Been thanked: 760 times
Contact:

Re: Memory Bandwidth Tests

Post by odroid »

I didn't know the tinymembench supports the ARM specific NEON accelerator.

BTW, I have no idea why the standard memset speed is very similar to the NEON accelerated memset. :?:

Code: Select all

standard memset                                      :   4877.4 MB/s (0.4%)
vs

Code: Select all

NEON STP fill                                        :   4875.5 MB/s (0.4%)

crashoverride
Posts: 5027
Joined: Tue Dec 30, 2014 8:42 pm
languages_spoken: english
ODROIDs: C1
Has thanked: 0
Been thanked: 323 times
Contact:

Re: Memory Bandwidth Tests

Post by crashoverride »

By now, there are probably Aarch64 optimized versions of memset and memcpy provided. I have not verified this. Its just a guess.

Where I think the memory speed (bandwidth) is going to make the most difference is with 4k@60 video. We saw on C2, without AFBC, 4k video tops out around 40fps. While RK3399 does support AFBC, its unclear if its usable in Linux. Since Kodi is moving to remove all platform specific optimizations, its highly likely that on RK3399 there will be no use of AFBC.

I have not tested video at all yet. I need to conclude my SATA tests so I can get my HDDs back for use first. ;)

DarkBahamut
Posts: 332
Joined: Tue Jan 19, 2016 10:19 am
languages_spoken: english
ODROIDs: XU4, N1
Has thanked: 0
Been thanked: 0
Contact:

Re: Memory Bandwidth Tests

Post by DarkBahamut »

Interesting results. I tested the A15's on the XU4 with the 925MHz memory clock. The A15's actually get slightly more bandwidth in the memset test (5375 MB/s) but the other results are typically lower. Could be lots of reasons for that though given armv7 vs armv8 aarch64.

One thing I noticed was the memory latency on the A72 cores. @crashoverride were your results run with the A72 locked to 2GHz?

crashoverride
Posts: 5027
Joined: Tue Dec 30, 2014 8:42 pm
languages_spoken: english
ODROIDs: C1
Has thanked: 0
Been thanked: 323 times
Contact:

Re: Memory Bandwidth Tests

Post by crashoverride »

DarkBahamut wrote:were your results run with the A72 locked to 2GHz?
Yes.

Code: Select all

$ cpufreq-info
cpufrequtils 008: cpufreq-info (C) Dominik Brodowski 2004-2009
Report errors and bugs to cpufreq@vger.kernel.org, please.
analyzing CPU 0:
  driver: cpufreq-dt
  CPUs which run at the same hardware frequency: 0 1 2 3
  CPUs which need to have their frequency coordinated by software: 0 1 2 3
  maximum transition latency: 40.0 us.
  hardware limits: 408 MHz - 1.51 GHz
  available frequency steps: 408 MHz, 600 MHz, 816 MHz, 1.01 GHz, 1.20 GHz, 1.42 GHz, 1.51 GHz
  available cpufreq governors: conservative, ondemand, userspace, powersave, interactive, performance
  current policy: frequency should be within 408 MHz and 1.51 GHz.
                  The governor "performance" may decide which speed to use
                  within this range.
  current CPU frequency is 1.51 GHz.
  cpufreq stats: 408 MHz:2.79%, 600 MHz:0.34%, 816 MHz:0.10%, 1.01 GHz:0.07%, 1.20 GHz:0.04%, 1.42 GHz:0.03%, 1.51 GHz:96.63%  (7281)
analyzing CPU 1:
  driver: cpufreq-dt
  CPUs which run at the same hardware frequency: 0 1 2 3
  CPUs which need to have their frequency coordinated by software: 0 1 2 3
  maximum transition latency: 40.0 us.
  hardware limits: 408 MHz - 1.51 GHz
  available frequency steps: 408 MHz, 600 MHz, 816 MHz, 1.01 GHz, 1.20 GHz, 1.42 GHz, 1.51 GHz
  available cpufreq governors: conservative, ondemand, userspace, powersave, interactive, performance
  current policy: frequency should be within 408 MHz and 1.51 GHz.
                  The governor "performance" may decide which speed to use
                  within this range.
  current CPU frequency is 1.51 GHz.
  cpufreq stats: 408 MHz:2.79%, 600 MHz:0.34%, 816 MHz:0.10%, 1.01 GHz:0.07%, 1.20 GHz:0.04%, 1.42 GHz:0.03%, 1.51 GHz:96.63%  (7281)
analyzing CPU 2:
  driver: cpufreq-dt
  CPUs which run at the same hardware frequency: 0 1 2 3
  CPUs which need to have their frequency coordinated by software: 0 1 2 3
  maximum transition latency: 40.0 us.
  hardware limits: 408 MHz - 1.51 GHz
  available frequency steps: 408 MHz, 600 MHz, 816 MHz, 1.01 GHz, 1.20 GHz, 1.42 GHz, 1.51 GHz
  available cpufreq governors: conservative, ondemand, userspace, powersave, interactive, performance
  current policy: frequency should be within 408 MHz and 1.51 GHz.
                  The governor "performance" may decide which speed to use
                  within this range.
  current CPU frequency is 1.51 GHz.
  cpufreq stats: 408 MHz:2.79%, 600 MHz:0.34%, 816 MHz:0.10%, 1.01 GHz:0.07%, 1.20 GHz:0.04%, 1.42 GHz:0.03%, 1.51 GHz:96.63%  (7281)
analyzing CPU 3:
  driver: cpufreq-dt
  CPUs which run at the same hardware frequency: 0 1 2 3
  CPUs which need to have their frequency coordinated by software: 0 1 2 3
  maximum transition latency: 40.0 us.
  hardware limits: 408 MHz - 1.51 GHz
  available frequency steps: 408 MHz, 600 MHz, 816 MHz, 1.01 GHz, 1.20 GHz, 1.42 GHz, 1.51 GHz
  available cpufreq governors: conservative, ondemand, userspace, powersave, interactive, performance
  current policy: frequency should be within 408 MHz and 1.51 GHz.
                  The governor "performance" may decide which speed to use
                  within this range.
  current CPU frequency is 1.51 GHz.
  cpufreq stats: 408 MHz:2.79%, 600 MHz:0.34%, 816 MHz:0.10%, 1.01 GHz:0.07%, 1.20 GHz:0.04%, 1.42 GHz:0.03%, 1.51 GHz:96.63%  (7281)
analyzing CPU 4:
  driver: cpufreq-dt
  CPUs which run at the same hardware frequency: 4 5
  CPUs which need to have their frequency coordinated by software: 4 5
  maximum transition latency: 540 us.
  hardware limits: 408 MHz - 1.99 GHz
  available frequency steps: 408 MHz, 600 MHz, 816 MHz, 1.01 GHz, 1.20 GHz, 1.42 GHz, 1.61 GHz, 1.80 GHz, 1.99 GHz
  available cpufreq governors: conservative, ondemand, userspace, powersave, interactive, performance
  current policy: frequency should be within 408 MHz and 1.99 GHz.
                  The governor "performance" may decide which speed to use
                  within this range.
  current CPU frequency is 1.99 GHz.
  cpufreq stats: 408 MHz:10.20%, 600 MHz:0.29%, 816 MHz:0.17%, 1.01 GHz:0.30%, 1.20 GHz:0.27%, 1.42 GHz:0.09%, 1.61 GHz:0.13%, 1.80 GHz:0.21%, 1.99 GHz:88.33%  (2021)
analyzing CPU 5:
  driver: cpufreq-dt
  CPUs which run at the same hardware frequency: 4 5
  CPUs which need to have their frequency coordinated by software: 4 5
  maximum transition latency: 540 us.
  hardware limits: 408 MHz - 1.99 GHz
  available frequency steps: 408 MHz, 600 MHz, 816 MHz, 1.01 GHz, 1.20 GHz, 1.42 GHz, 1.61 GHz, 1.80 GHz, 1.99 GHz
  available cpufreq governors: conservative, ondemand, userspace, powersave, interactive, performance
  current policy: frequency should be within 408 MHz and 1.99 GHz.
                  The governor "performance" may decide which speed to use
                  within this range.
  current CPU frequency is 1.99 GHz.
  cpufreq stats: 408 MHz:10.20%, 600 MHz:0.29%, 816 MHz:0.17%, 1.01 GHz:0.30%, 1.20 GHz:0.27%, 1.42 GHz:0.09%, 1.61 GHz:0.13%, 1.80 GHz:0.21%, 1.99 GHz:88.33%  (2021)

tkaiser
Posts: 672
Joined: Mon Nov 09, 2015 12:30 am
languages_spoken: english
ODROIDs: C1+, C2, XU4, HC1
Has thanked: 0
Been thanked: 3 times
Contact:

Re: Memory Bandwidth Tests

Post by tkaiser »

RK3399 Chromebook Plus running 4.16-rc1: http://ix.io/Gun
RK3399 Sapphire board running 4.15: https://pastebin.com/raw/RYASmY0D
RK3399-Q7 running 4.4 and rk3399-sdram-lpddr3-4GB-1600.dtsi: https://gist.githubusercontent.com/anon ... tfile1.txt

Beeble reported he got with 4.15 on the RK3399-Q7 similar results to the Sapphire board -- all details: https://irclog.whitequark.org/linux-roc ... 2#21298744;

crashoverride
Posts: 5027
Joined: Tue Dec 30, 2014 8:42 pm
languages_spoken: english
ODROIDs: C1
Has thanked: 0
Been thanked: 323 times
Contact:

Re: Memory Bandwidth Tests

Post by crashoverride »

So it seems Kernel 4.15+ has some kind of optimization not in 4.4.

DarkBahamut
Posts: 332
Joined: Tue Jan 19, 2016 10:19 am
languages_spoken: english
ODROIDs: XU4, N1
Has thanked: 0
Been thanked: 0
Contact:

Re: Memory Bandwidth Tests

Post by DarkBahamut »

Thanks for the other benchmarks. Seems quite a bit of variance in the tests depending on the kernel used. The memset results go up quite a lot :o

mlinuxguy
Posts: 842
Joined: Thu Feb 28, 2013 10:28 am
languages_spoken: english
ODROIDs: X, X2, XU, XU3, XU4, C1, C1+, C2, N1, USB-IO
Has thanked: 0
Been thanked: 0
Contact:

Re: Memory Bandwidth Tests

Post by mlinuxguy »

crashoverride wrote:Where I think the memory speed (bandwidth) is going to make the most difference is with 4k@60 video. We saw on C2, without AFBC, 4k video tops out around 40fps. I have not tested video at all yet. ;)
FYI, for what its worth, I tested es2gears on my 60hz 4k monitor and got 60hz

Post Reply

Return to “General Chat”

Who is online

Users browsing this forum: No registered users and 1 guest