Odroid XU4 + GPU Deep Learning / Tensorflow

Post Reply
User avatar
memeka
Posts: 4420
Joined: Mon May 20, 2013 10:22 am
languages_spoken: english
ODROIDs: XU rev2 + eMMC + UART
U3 + eMMC + IO Shield + UART
Has thanked: 2 times
Been thanked: 58 times
Contact:

Odroid XU4 + GPU Deep Learning / Tensorflow

Post by memeka » Tue Sep 05, 2017 3:05 pm

Not sure anyone is interested in this, but:

Code: Select all

odroid@odroid:~/src/Theano$ THEANO_FLAGS=device=cpu,floatX=float32 python theano.py
[Elemwise{exp,no_inplace}(<TensorType(float32, vector)>)]
Looping 1000 times took 11.920728 seconds
Result is [ 1.23178029  1.61879337  1.52278066 ...,  2.20771813  2.29967761
  1.62323284]
Used the cpu

odroid@odroid:~/src/Theano$ THEANO_FLAGS=device=opencl0:0,floatX=float32 python theano.py
Mapped name None to device opencl0:0: Mali-T628
[GpuElemwise{exp,no_inplace}(<GpuArrayType<None>(float32, vector)>), HostFromGpu(gpuarray)(GpuElemwise{exp,no_inplace}.0)]
Looping 1000 times took 2.553153 seconds
Result is [ 1.23178029  1.61879337  1.52278066 ...,  2.20771813  2.29967761
  1.62323284]
Used the gpu
Last edited by memeka on Fri Sep 15, 2017 9:47 am, edited 1 time in total.

User avatar
odroid
Site Admin
Posts: 34586
Joined: Fri Feb 22, 2013 11:14 pm
languages_spoken: English, Korean, Japanese
ODROIDs: ODROID
Has thanked: 812 times
Been thanked: 704 times
Contact:

Re: Odroid XU4 + Theano Deep Learning

Post by odroid » Tue Sep 05, 2017 4:57 pm

Ummm... GPU computing is 4~5 times faster than CPU computing. :o
But I have no idea about Deep Learning itself. ;)

User avatar
memeka
Posts: 4420
Joined: Mon May 20, 2013 10:22 am
languages_spoken: english
ODROIDs: XU rev2 + eMMC + UART
U3 + eMMC + IO Shield + UART
Has thanked: 2 times
Been thanked: 58 times
Contact:

Re: Odroid XU4 + Theano Deep Learning

Post by memeka » Tue Sep 05, 2017 6:52 pm

I will be testing Keras + Theano deep learning frameworks to see if xu4 can use GPU to accelerate object detections and/or classification in images.
It would be cool to use webcam + motion for security camera, then send alert only when people are detected (ignore animals and other false positives)..

User avatar
memeka
Posts: 4420
Joined: Mon May 20, 2013 10:22 am
languages_spoken: english
ODROIDs: XU rev2 + eMMC + UART
U3 + eMMC + IO Shield + UART
Has thanked: 2 times
Been thanked: 58 times
Contact:

Re: Odroid XU4 + Theano Deep Learning

Post by memeka » Wed Sep 06, 2017 3:40 pm

OK, so with a few modifications in the Keras framework, i was able to use the Mali GPU to do image detection:

Code: Select all

odroid@odroid:~/src/keras$ THEANO_FLAGS=mode=FAST_RUN,device=opencl0:0,floatX=float32 python classify.py --image images/beer.png
Using Theano backend.
Mapped name None to device opencl0:0: Mali-T628
[INFO] loading network...
[INFO] loading and preprocessing image...
[INFO] classifying image...
Classifying the image took 490.016743 seconds, here are the results:
1. beer_glass: 100.00%
2. pop_bottle: 99.56%
3. goblet: 97.35%
For some reason, it throws errors when I try to use the CPU to do the same (some linear algebra libraries missing symbols), so I cannot compare it with CPU.
If somebody knows how long it takes Keras to do image classification on some other systems (code: http://www.pyimagesearch.com/2017/03/20 ... ion-keras/) please let me know.
As you can see, on Mali (using 4 cores only, it can use only one GPU cluster) it takes ~8 minutes... not as fast as I hoped :( -- certainly not good enough for a security system!

EDIT: for comparison, it takes ~1 second on my 2 x 8-core Xeon server, with the TensorFlow backend :D

EDIT2: there is def. something wrong, here's Tensorflow results: (compiled for RPi, using CPU)

Code: Select all

odroid@odroid:~/src/keras$ python classify.py --image images/beer.png
Using TensorFlow backend.
[INFO] loading network...
[INFO] loading and preprocessing image...
[INFO] classifying image...
Classifying the image took 3.346749 seconds, here are the results:
1. beer_glass: 98.62%
2. pop_bottle: 0.18%
3. goblet: 0.05%

User avatar
memeka
Posts: 4420
Joined: Mon May 20, 2013 10:22 am
languages_spoken: english
ODROIDs: XU rev2 + eMMC + UART
U3 + eMMC + IO Shield + UART
Has thanked: 2 times
Been thanked: 58 times
Contact:

Re: Odroid XU4 + Theano Deep Learning

Post by memeka » Thu Sep 07, 2017 11:38 am

update: it's possible to run CUDA applications on Mali GPU using the coriander OpenCL translator.
I am using CUDA 6.5 from Jetson TK1 SDK, since newer CUDA are released for the arm64 or intel architectures.

Code: Select all

odroid@odroid:~/src/cuda$ ./cuda_sample
OpenCL platform: ARM Platform
OpenCL device: Mali-T628
hostFloats[2] 123
hostFloats[2] 222
hostFloats[2] 444

elatllat
Posts: 1763
Joined: Tue Sep 01, 2015 8:54 am
languages_spoken: english
ODROIDs: XU4, N1, N2, C4
Has thanked: 45 times
Been thanked: 113 times
Contact:

Re: Odroid XU4 + Theano Deep Learning

Post by elatllat » Fri Sep 08, 2017 1:50 am

memeka wrote:it's possible to run CUDA applications on Mali GPU using the coriander OpenCL translator...
memeka wrote:...1. beer_glass: 100.00%...
That's really cool.
memeka wrote:...Classifying the image took 490.016743 seconds...
That's really limiting; I thought I saw real time (~1s) object identification on similar hardware, but maybe it was only for simple stuff like "face = 2 eyes", still...

User avatar
mad_ady
Posts: 8151
Joined: Wed Jul 15, 2015 5:00 pm
languages_spoken: english
ODROIDs: XU4, C1+, C2, C4, N1, N2, H2, Go, Go Advance
Location: Bucharest, Romania
Has thanked: 567 times
Been thanked: 403 times
Contact:

Re: Odroid XU4 + Theano Deep Learning

Post by mad_ady » Fri Sep 08, 2017 2:27 am

Maybe you were showing it bottles of American beer and it was having a hard time deciding wether it was beer or water
Try with German beer...

User avatar
memeka
Posts: 4420
Joined: Mon May 20, 2013 10:22 am
languages_spoken: english
ODROIDs: XU rev2 + eMMC + UART
U3 + eMMC + IO Shield + UART
Has thanked: 2 times
Been thanked: 58 times
Contact:

Re: Odroid XU4 + Theano Deep Learning

Post by memeka » Fri Sep 08, 2017 6:16 am

@elatilat - it was 490 seconds when using Theano framework. The same test with Tensorflow on CPU took 3 seconds. I am checking to see if I can get Tensorflow to work on the GPU, to compare (Theano on CPU crashes).

3 seconds is not realtime, but at least it's possible to use it for a security camera by analyzing the images from Motion (see https://www.bouvet.no/bouvet-deler/utbr ... ces-part-1 but with local analysis instead of cloud services).
obviously, it would be great if we can get <1s, and get it to work with the video feed :)

elatllat
Posts: 1763
Joined: Tue Sep 01, 2015 8:54 am
languages_spoken: english
ODROIDs: XU4, N1, N2, C4
Has thanked: 45 times
Been thanked: 113 times
Contact:

Re: Odroid XU4 + Theano Deep Learning

Post by elatllat » Fri Sep 08, 2017 6:51 am

nice; 3s on CPU is promising.

User avatar
mad_ady
Posts: 8151
Joined: Wed Jul 15, 2015 5:00 pm
languages_spoken: english
ODROIDs: XU4, C1+, C2, C4, N1, N2, H2, Go, Go Advance
Location: Bucharest, Romania
Has thanked: 567 times
Been thanked: 403 times
Contact:

Re: Odroid XU4 + Theano Deep Learning

Post by mad_ady » Fri Sep 08, 2017 1:40 pm

@memeka: perhaps when you are done you can submit a nice article about it

User avatar
memeka
Posts: 4420
Joined: Mon May 20, 2013 10:22 am
languages_spoken: english
ODROIDs: XU rev2 + eMMC + UART
U3 + eMMC + IO Shield + UART
Has thanked: 2 times
Been thanked: 58 times
Contact:

Re: Odroid XU4 + GPU Deep Learning / Tensorflow

Post by memeka » Fri Sep 15, 2017 9:50 am

The GOOD news:
I got tensorflow running with GPU acceleration on the XU4:

Code: Select all

odroid@odroid:~/src/tensorflow/example$ TF_MIN_GPU_MULTIPROCESSOR_COUNT=2 python helloworld.py
OpenCL platform: ARM Platform
OpenCL device: Mali-T628
I tensorflow/core/common_runtime/gpu/gpu_device.cc:989] Found device 0 with properties:
name: Mali-T628
major: -1 minor: -1 memoryClockRate (GHz) 600
pciBusID 0000.0000
Total memory: 1.95GiB
Free memory: 498.49MiB
W tensorflow/stream_executor/cl/cl_driver.cc:587] creating context when one is currently active; existing: �W��;L�
OpenCL platform: ARM Platform
OpenCL device: Mali-T628
I tensorflow/core/common_runtime/gpu/gpu_device.cc:989] Found device 1 with properties:
name: Mali-T628
major: -1 minor: -1 memoryClockRate (GHz) 600
pciBusID 0000.0000
Total memory: 1.95GiB
Free memory: 498.49MiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:877] cannot enable peer access from device ordinal 0 to device ordinal 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:877] cannot enable peer access from device ordinal 0 to device ordinal 1
I tensorflow/core/common_runtime/gpu/gpu_device.cc:877] cannot enable peer access from device ordinal 1 to device ordinal 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:877] cannot enable peer access from device ordinal 1 to device ordinal 1
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1011] DMA: 0 1
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1021] 0:   N N
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1021] 1:   N N
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1083] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Mali-T628, pci bus id: 0000.0000)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1083] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Mali-T628, pci bus id: 0000.0000)
cl_driver DeviceAllocate 312985600
cl_driver DeviceAllocate 312985600
Device mapping:
/job:localhost/replica:0/task:0/gpu:0 -> device: 0, name: Mali-T628, pci bus id: 0000.0000
/job:localhost/replica:0/task:0/gpu:1 -> device: 1, name: Mali-T628, pci bus id: 0000.0000
I tensorflow/core/common_runtime/direct_session.cc:252] Device mapping:
/job:localhost/replica:0/task:0/gpu:0 -> device: 0, name: Mali-T628, pci bus id: 0000.0000
/job:localhost/replica:0/task:0/gpu:1 -> device: 1, name: Mali-T628, pci bus id: 0000.0000

Const: /job:localhost/replica:0/task:0/cpu:0
I tensorflow/core/common_runtime/simple_placer.cc:819] Const: /job:localhost/replica:0/task:0/cpu:0
b'Hello, TensorFlow!'
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1083] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Mali-T628, pci bus id: 0000.0000)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1083] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Mali-T628, pci bus id: 0000.0000)
[name: "/cpu:0"
device_type: "CPU"
memory_limit: 268435456
bus_adjacency: BUS_ANY
incarnation: 15260331923906990807
, name: "/gpu:0"
device_type: "GPU"
memory_limit: 312985600
incarnation: 195841583316312646
physical_device_desc: "device: 0, name: Mali-T628, pci bus id: 0000.0000"
, name: "/gpu:1"
device_type: "GPU"
memory_limit: 312985600
incarnation: 10576007618008246268
physical_device_desc: "device: 1, name: Mali-T628, pci bus id: 0000.0000"
]
The BAD news:
The CUDA->OpenCL translator for some reason cannot allocate memory for the GPU, so actually all operations that use the GPU fail with no memory :(
Still work to do :(

User avatar
memeka
Posts: 4420
Joined: Mon May 20, 2013 10:22 am
languages_spoken: english
ODROIDs: XU rev2 + eMMC + UART
U3 + eMMC + IO Shield + UART
Has thanked: 2 times
Been thanked: 58 times
Contact:

Re: Odroid XU4 + GPU Deep Learning / Tensorflow

Post by memeka » Sat Sep 16, 2017 12:43 am

more updates :)
after another memory allocation patch, I can now run Tensorflow on Mali GPU successfully!
For example, this 2x2 matrix addition:

Code: Select all

a = tf.constant([1, 3, 5, 2], dtype=tf.float32, shape=[2, 2], name='a')
b = tf.constant([3, 4, 4, 6], dtype=tf.float32, shape=[2, 2], name='b')
c = tf.add(a, b, name="c")
results in:

Code: Select all

odroid@odroid:~/src/tensorflow/examples$ python add.py 
OpenCL platform: ARM Platform
OpenCL device: Mali-T628
I tensorflow/core/common_runtime/gpu/gpu_device.cc:989] Found device 0 with properties: 
name: Mali-T628
major: -1 minor: -1 memoryClockRate (GHz) 600
pciBusID 0000.0000
Total memory: 1.95GiB
Free memory: 498.49MiB
W tensorflow/stream_executor/cl/cl_driver.cc:587] creating context when one is currently active; existing: X����Y�, 2]
OpenCL platform: ARM Platform
OpenCL device: Mali-T628
I tensorflow/core/common_runtime/gpu/gpu_device.cc:989] Found device 1 with properties: 
name: Mali-T628
major: -1 minor: -1 memoryClockRate (GHz) 600
pciBusID 0000.0000
Total memory: 1.95GiB
Free memory: 498.49MiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:877] cannot enable peer access from device ordinal 0 to device ordinal 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:877] cannot enable peer access from device ordinal 0 to device ordinal 1
I tensorflow/core/common_runtime/gpu/gpu_device.cc:877] cannot enable peer access from device ordinal 1 to device ordinal 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:877] cannot enable peer access from device ordinal 1 to device ordinal 1
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1011] DMA: 0 1 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1021] 0:   N N 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1021] 1:   N N 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1083] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Mali-T628, pci bus id: 0000.0000)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1070] Ignoring gpu device (device: 1, name: Mali-T628, pci bus id: 0000.0000) with Cuda multiprocessor count: 2. The minimum required count is 4. You can adjust this requirement with the env var TF_MIN_GPU_MULTIPROCESSOR_COUNT.
cl_driver DeviceAllocate 312985600
Device mapping:
/job:localhost/replica:0/task:0/gpu:0 -> device: 0, name: Mali-T628, pci bus id: 0000.0000
I tensorflow/core/common_runtime/direct_session.cc:252] Device mapping:
/job:localhost/replica:0/task:0/gpu:0 -> device: 0, name: Mali-T628, pci bus id: 0000.0000

running matrix addition on GPU
c: /job:localhost/replica:0/task:0/gpu:0
I tensorflow/core/common_runtime/simple_placer.cc:819] c: /job:localhost/replica:0/task:0/gpu:0
b: /job:localhost/replica:0/task:0/gpu:0
I tensorflow/core/common_runtime/simple_placer.cc:819] b: /job:localhost/replica:0/task:0/gpu:0
a: /job:localhost/replica:0/task:0/gpu:0
I tensorflow/core/common_runtime/simple_placer.cc:819] a: /job:localhost/replica:0/task:0/gpu:0
+++ kCudaHostMemoryUseBFC calling BFC allocator for 1LL << 36 memory (replaced with 312985600)
[[ 4.  7.]
 [ 9.  8.]]
success
HOWEVER, most tensorflow programs fail because of memory alignment issues in the Eigen library, this is not something I think I can fix alone... :(
Follow issue here: https://github.com/benoitsteiner/tensor ... /issues/49

User avatar
memeka
Posts: 4420
Joined: Mon May 20, 2013 10:22 am
languages_spoken: english
ODROIDs: XU rev2 + eMMC + UART
U3 + eMMC + IO Shield + UART
Has thanked: 2 times
Been thanked: 58 times
Contact:

Re: Odroid XU4 + GPU Deep Learning / Tensorflow

Post by memeka » Tue Sep 19, 2017 8:57 am

Ok, so i managed to fix the alignment issue, and was able to run some tensorflow programs. There are still issues with some complicated models, but at least I was able to run some tests and compare then with the CPU Tensorflow (the ARM-optimized version built for RPi).
Unfortunately, the CPU version is much faster (10-300 times).
In my linear regression example:
GPU:

Code: Select all

average_epoch_times= 6.146 kernel_compile_time 0.319
CPU:

Code: Select all

average_epoch_times= 0.018482208252 kernel_compile_time 0.0319726467133
I can upload the OpenCL tensorflow package, if anyone is interested, but I don't really see the point. Looks like having llvm translating OpenCL -> CUDA is way slower than using the neon-optimized CPU program.

elatllat
Posts: 1763
Joined: Tue Sep 01, 2015 8:54 am
languages_spoken: english
ODROIDs: XU4, N1, N2, C4
Has thanked: 45 times
Been thanked: 113 times
Contact:

Re: Odroid XU4 + GPU Deep Learning / Tensorflow

Post by elatllat » Wed Sep 20, 2017 10:27 am

What was the data size on your linear regression?
Maybe matrices larger than 2x2 (like 99x99) will perform differently?
Are you sure "CUDA -> OpenCL" is the slow part?
Maybe there is a way to make prepared statements (like SQL) so execution time is reduced for the repetitions of same question with different values?
Looks like direct OpenCL support has been open for a long time https://github.com/tensorflow/tensorflow/issues/22

User avatar
memeka
Posts: 4420
Joined: Mon May 20, 2013 10:22 am
languages_spoken: english
ODROIDs: XU rev2 + eMMC + UART
U3 + eMMC + IO Shield + UART
Has thanked: 2 times
Been thanked: 58 times
Contact:

Re: Odroid XU4 + GPU Deep Learning / Tensorflow

Post by memeka » Wed Sep 20, 2017 11:00 am

yeah, direct openCL support won't probably happen... so other solutions have to be found.
i tried both theano witg gpuarray and tensorflow with coriander, and both (using different test workloads) were noticeably slower on the GPU... so i am a bit disappointed :(
plus, really large workloads won't work because of lack of memory (currently in tensorflow i fixed it to 300MB). The idea is that it's not useful for learning, and for deployment i suppose CPU is good enough ...
last try to some computer vision stuff i'm trying now is opencv 3.3 (compiled with arm optimizations and opencl support) - anyone knows how i can test opencl in opencv? :D

User avatar
memeka
Posts: 4420
Joined: Mon May 20, 2013 10:22 am
languages_spoken: english
ODROIDs: XU rev2 + eMMC + UART
U3 + eMMC + IO Shield + UART
Has thanked: 2 times
Been thanked: 58 times
Contact:

Re: Odroid XU4 + GPU Deep Learning / Tensorflow

Post by memeka » Wed Sep 20, 2017 4:16 pm

OK, so I compiled OpenCV with DNN and OpenCL support, and I tested realtime object recognition from the webcam.
I was able to get 3fps, w/o optimizations. Also, OpenCV is compiled with gstreamer support, so it's very easy to stream the modified stream to the web with gst.

Say hello to memeka:

User avatar
odroid
Site Admin
Posts: 34586
Joined: Fri Feb 22, 2013 11:14 pm
languages_spoken: English, Korean, Japanese
ODROIDs: ODROID
Has thanked: 812 times
Been thanked: 704 times
Contact:

Re: Odroid XU4 + GPU Deep Learning / Tensorflow

Post by odroid » Wed Sep 20, 2017 4:22 pm

Wow! You made it. :o :D

elatllat
Posts: 1763
Joined: Tue Sep 01, 2015 8:54 am
languages_spoken: english
ODROIDs: XU4, N1, N2, C4
Has thanked: 45 times
Been thanked: 113 times
Contact:

Re: Odroid XU4 + GPU Deep Learning / Tensorflow

Post by elatllat » Wed Sep 20, 2017 4:30 pm

Nice!

0.3s is the leap from 3s that broadens the uses for object recognition.
got a deb or git somewhere?

User avatar
memeka
Posts: 4420
Joined: Mon May 20, 2013 10:22 am
languages_spoken: english
ODROIDs: XU rev2 + eMMC + UART
U3 + eMMC + IO Shield + UART
Has thanked: 2 times
Been thanked: 58 times
Contact:

Re: Odroid XU4 + GPU Deep Learning / Tensorflow

Post by memeka » Wed Sep 20, 2017 4:48 pm

Well it’s using different detection method, faster but less accurate.
I’ve had to use taskset to bind it to the big cores to get 3fps (2fps otherwise), but my kernel has the cores at 1.7ghz max.
Also the cores were ~70% each, so with some optimisations I hope to get over 5fps.

User avatar
memeka
Posts: 4420
Joined: Mon May 20, 2013 10:22 am
languages_spoken: english
ODROIDs: XU rev2 + eMMC + UART
U3 + eMMC + IO Shield + UART
Has thanked: 2 times
Been thanked: 58 times
Contact:

Re: Odroid XU4 + GPU Deep Learning / Tensorflow

Post by memeka » Fri Sep 29, 2017 9:20 am

So, after some more work, I managed to make some improvements:

* 10fps video (although detection happens every 4th frame)
* gstreamer compatibility
* h264 encoding
* 2 concurrent output methods:
1) video output of continuous detection (e.g. http video stream, or output to file)
2) on detecting objects you are interested in (e.g. people), take a screenshot and execute external script (e.g. send email/notification)

The clip below was created using this gstreamer input pipeline:

Code: Select all

v4l2src device=/dev/video2 ! video/x-raw, width=800, height=600, framerate=10/1 ! videoconvert
and this gstreamer output pipeline:

Code: Select all

videoconvert ! v4l2video11h264enc extra-controls=\"encode,h264_level=10,h264_profile=4,frame_level_rate_control_enable=1,video_bitrate=2097152\" ! h264parse ! matroskamux ! filesink location=detect.mkv
You can see having 10fps is much better than the 3fps in the previous post :lol: (but output quality is reduced because of odroid's MFC hw encoder :oops:)

I also tested successfully this gstreamer output pipeline, which can be used to live-stream security camera ;)

Code: Select all

 videoconvert ! v4l2video11h264enc extra-controls="encode,h264_level=10,h264_profile=4,frame_level_rate_control_enable=1,video_bitrate=10097152" ! h264parse ! mpegtsmux ! hlssink max-files=5 playlist-root="http://192.168.0.10/hls" playlist-location="/var/www/html/hls/stream0.m3u8" location="/var/www/html/hls/fragment%06d.ts
[youtube]https://www.youtube.com/watch?v=rpCnunKdC3o[/youtube]

Stay tuned for a future magazine article! :twisted:

User avatar
mad_ady
Posts: 8151
Joined: Wed Jul 15, 2015 5:00 pm
languages_spoken: english
ODROIDs: XU4, C1+, C2, C4, N1, N2, H2, Go, Go Advance
Location: Bucharest, Romania
Has thanked: 567 times
Been thanked: 403 times
Contact:

Re: Odroid XU4 + GPU Deep Learning / Tensorflow

Post by mad_ady » Fri Sep 29, 2017 1:08 pm

Great work!

flocke
Posts: 2
Joined: Sun Nov 05, 2017 6:06 pm
languages_spoken: english
ODROIDs: XU4Q
Has thanked: 0
Been thanked: 0
Contact:

Re: Odroid XU4 + GPU Deep Learning / Tensorflow

Post by flocke » Sun Nov 05, 2017 6:11 pm

Hi Memeka,

thank you for your efforts.
Would you be so kind and share your wisdom with us?

I am highly interested in your opencl tensorflow package and the way, you got that working.

greetings,
flocke

User avatar
memeka
Posts: 4420
Joined: Mon May 20, 2013 10:22 am
languages_spoken: english
ODROIDs: XU rev2 + eMMC + UART
U3 + eMMC + IO Shield + UART
Has thanked: 2 times
Been thanked: 58 times
Contact:

Re: Odroid XU4 + GPU Deep Learning / Tensorflow

Post by memeka » Sun Nov 05, 2017 6:18 pm

@flocke i did not actually saved the package since it was performing worse than the CPU tensorflow.
So best is to download the rpi tensorflow package.

larrylart
Posts: 4
Joined: Fri Dec 29, 2017 5:01 am
languages_spoken: english
ODROIDs: XU4,HC1,C0,C1+,C2
Location: Naas, Ireland
Has thanked: 0
Been thanked: 0
Contact:

Re: Odroid XU4 + GPU Deep Learning / Tensorflow

Post by larrylart » Tue Feb 27, 2018 1:21 am

Hi Memeka,

I think I probably misread your message, did you managed to get in the end 10fps with DNN detection or 10fps video stream (with every 4th through dnn?)
If you did manage to get 10fps inference out of optimization, please share as I running out of ideas right now :) I spent several days compiling and optimizing (neon, computeLibrary, opencl and caffe on opencl) and the latest opencv (3.4) and I cannot break through the 3.5fps on either cpu or gpu.

Anyway, meanwhile I ended up buying a Movidius Neural Compute Stick* https://developer.movidius.com/ It work nice with XU4 – on a first run I get 9.2fps with SSD_MobileNet (I only got it a week ago, I need more time to experiment) and that with less than 10% load on the odroid cpu (leave plenty of room for other types of processing.

*note: might be very nice if the next odroid will include on board a VPU as such as Intel’s Myriad 2 or better the new X :)

User avatar
memeka
Posts: 4420
Joined: Mon May 20, 2013 10:22 am
languages_spoken: english
ODROIDs: XU rev2 + eMMC + UART
U3 + eMMC + IO Shield + UART
Has thanked: 2 times
Been thanked: 58 times
Contact:

Re: Odroid XU4 + GPU Deep Learning / Tensorflow

Post by memeka » Tue Feb 27, 2018 6:09 am

I got 3fps inference, like you did
And doing it every 4th frame, I can process 10fps video
Looks like that’s the max the xu4 can do.

Note: there’s a rk3399pro version with excellent ai abilities, so maybe hk will do a odroid N1pro

elatllat
Posts: 1763
Joined: Tue Sep 01, 2015 8:54 am
languages_spoken: english
ODROIDs: XU4, N1, N2, C4
Has thanked: 45 times
Been thanked: 113 times
Contact:

Re: Odroid XU4 + GPU Deep Learning / Tensorflow

Post by elatllat » Tue Feb 27, 2018 7:12 am

memeka wrote:...Note: there’s a rk3399pro version with excellent ai abilities, so maybe hk will do a odroid N1pro
Yah I looked at that then read that the hardware was not used on walleye by google so I figured it would only be useful to not optimized code.
Did you try Tensorflow on the N1 yet?

larrylart
Posts: 4
Joined: Fri Dec 29, 2017 5:01 am
languages_spoken: english
ODROIDs: XU4,HC1,C0,C1+,C2
Location: Naas, Ireland
Has thanked: 0
Been thanked: 0
Contact:

Re: Odroid XU4 + GPU Deep Learning / Tensorflow

Post by larrylart » Tue Feb 27, 2018 7:49 am

And than there is Lightspeeur - https://gyrfalcontech.com/solutions/ (https://www.bit-tech.net/news/gryfalcon ... lerator/1/) - 2.8 teraflops @ 0.3W , they are saying they going mass production in the coming months.
It remains to be seen how much these toys are going to cost and what NNets they support, for now I am happy enough to play with my xu4 and movidius @ 70 bucks

daksh
Posts: 1
Joined: Thu Jan 31, 2019 6:35 pm
languages_spoken: english
ODROIDs: Odroid XU4
Has thanked: 0
Been thanked: 0
Contact:

Re: Odroid XU4 + GPU Deep Learning / Tensorflow

Post by daksh » Thu Jan 31, 2019 6:40 pm

Hi memeka,

Can you help me out with the tensorflow configuration on odroid XU4?

User avatar
memeka
Posts: 4420
Joined: Mon May 20, 2013 10:22 am
languages_spoken: english
ODROIDs: XU rev2 + eMMC + UART
U3 + eMMC + IO Shield + UART
Has thanked: 2 times
Been thanked: 58 times
Contact:

Re: Odroid XU4 + GPU Deep Learning / Tensorflow

Post by memeka » Thu Jan 31, 2019 7:02 pm

Use the rpi package

favio
Posts: 1
Joined: Thu Feb 28, 2019 4:38 am
languages_spoken: english
ODROIDs: none yet
Has thanked: 0
Been thanked: 0
Contact:

Re: Odroid XU4 + GPU Deep Learning / Tensorflow

Post by favio » Thu Feb 28, 2019 4:40 am

Hello memeka, is it possible to compile opencv 3.4.5 + to work with openCL backend on the gpu? i do not own a odroid to test it but it would be impressive against the RPI

alexharnozd
Posts: 24
Joined: Wed Sep 26, 2018 11:05 pm
languages_spoken: english
ODROIDs: HC2
Has thanked: 0
Been thanked: 0
Contact:

Re: Odroid XU4 + GPU Deep Learning / Tensorflow

Post by alexharnozd » Mon Mar 11, 2019 9:29 pm

Hi Memeka.

Any guide here?

Would like to give it a shot this weekend.
Anything to prep other than odroid?
I got hc2 here.
Running HA. With google home
Would love to try this one :)

User avatar
memeka
Posts: 4420
Joined: Mon May 20, 2013 10:22 am
languages_spoken: english
ODROIDs: XU rev2 + eMMC + UART
U3 + eMMC + IO Shield + UART
Has thanked: 2 times
Been thanked: 58 times
Contact:

Re: Odroid XU4 + GPU Deep Learning / Tensorflow

Post by memeka » Mon Mar 11, 2019 9:49 pm

Gpu seems to be slower than CPU (for machine learning).

wbgreen333
Posts: 15
Joined: Fri Apr 19, 2019 4:27 am
languages_spoken: english
ODROIDs: UX-4
Has thanked: 2 times
Been thanked: 0
Contact:

Re: Odroid XU4 + Theano Deep Learning

Post by wbgreen333 » Fri Apr 19, 2019 5:17 am

memeka wrote:
Tue Sep 05, 2017 6:52 pm
I will be testing Keras + Theano deep learning frameworks to see if xu4 can use GPU to accelerate object detections and/or classification in images.
It would be cool to use webcam + motion for security camera, then send alert only when people are detected (ignore animals and other false positives)..
I don't think current ARM processors, even with GPU acceleration, have the numerical chops for this. The difference between and i5 M40 and i5 4200M is substantial because of the extra "vector processing" insturctions Intel has been adding.

But running a Movidius NCS or Coral TPU accelerator can get usable frame rates on scaled down models like MobileNet-SSD.

See this:
https://www.pyimagesearch.com/2019/04/0 ... spberry-pi

or this project I put on GitHub:
https://github.com/wb666greene/AI_enhan ... /README.md


IMHO what is really needed is to bring some of the Motion and/or Zoneminder folks on-board to export motion capture frames with tags for the "motion detection box" they draw so that this focused area can be cropped out and sent to the AI for detection. MQTT is a good transport for this.

User avatar
memeka
Posts: 4420
Joined: Mon May 20, 2013 10:22 am
languages_spoken: english
ODROIDs: XU rev2 + eMMC + UART
U3 + eMMC + IO Shield + UART
Has thanked: 2 times
Been thanked: 58 times
Contact:

Re: Odroid XU4 + GPU Deep Learning / Tensorflow

Post by memeka » Fri Apr 19, 2019 5:23 am

@wbgreen33
Depends what you mean by “usable” frame rates.
On XU4 I got 3-4fps on mobilenets ssd.
With that, I could input 15fps security live video, and do detection every 5th frame.
This is comparable with movidius1 + rpi.

larrylart
Posts: 4
Joined: Fri Dec 29, 2017 5:01 am
languages_spoken: english
ODROIDs: XU4,HC1,C0,C1+,C2
Location: Naas, Ireland
Has thanked: 0
Been thanked: 0
Contact:

Re: Odroid XU4 + GPU Deep Learning / Tensorflow

Post by larrylart » Tue May 07, 2019 7:05 pm

NCS2 performace on XU4 for mobilenet SSD with openvino was around 19FPS, that using a default c++ demo. I didn't follow up much in the direction of Intel's Myriad as it seems unclear if they are willing or not to support arms platforms in the future.

However, google coral TPU performance is impressive, with the USB dongle on XU4 you get up to 57FPS (17ms inference time) in direct mode, and 42FPS (23ms) when throttled, with mobilenetssd v2, power usage of the tpu 1-1.2W and overall 6W that includes xu4, webcam, gigabit net. However my sampling code is very basic not optimized
https://github.com/larrylart/codrive/tr ... ogle_coral

User avatar
odroid
Site Admin
Posts: 34586
Joined: Fri Feb 22, 2013 11:14 pm
languages_spoken: English, Korean, Japanese
ODROIDs: ODROID
Has thanked: 812 times
Been thanked: 704 times
Contact:

Re: Odroid XU4 + GPU Deep Learning / Tensorflow

Post by odroid » Wed May 08, 2019 9:22 am

@larrylart,
Thank you for sharing the Coral TPU test result for ODROID users. It looks very promising.
Is the Google Coral USB dongle works in USB 3.0 mode with XU4? Can you show me "lsusb -t" output?

larrylart
Posts: 4
Joined: Fri Dec 29, 2017 5:01 am
languages_spoken: english
ODROIDs: XU4,HC1,C0,C1+,C2
Location: Naas, Ireland
Has thanked: 0
Been thanked: 0
Contact:

Re: Odroid XU4 + GPU Deep Learning / Tensorflow

Post by larrylart » Wed May 08, 2019 10:00 am

Yes, it seem to work no problem, Ubuntu 18 / 4.14 - bus 4/dev 22 (the other one no-name is Intel's ncs 2)

Code: Select all

> lsusb
Bus 006 Device 002: ID 0bda:8153 Realtek Semiconductor Corp.
Bus 006 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
Bus 005 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 004 Device 022: ID 18d1:9302 Google Inc.
Bus 004 Device 002: ID 05e3:0616 Genesys Logic, Inc. hub
Bus 004 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
Bus 003 Device 004: ID 03e7:2485
Bus 003 Device 002: ID 05e3:0610 Genesys Logic, Inc. 4-port hub
Bus 003 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 002 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
Bus 001 Device 002: ID 046d:082d Logitech, Inc. HD Pro Webcam C920
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub

Code: Select all

> lsusb -t
/:  Bus 06.Port 1: Dev 1, Class=root_hub, Driver=xhci-hcd/1p, 5000M
    |__ Port 1: Dev 2, If 0, Class=Vendor Specific Class, Driver=r8152, 5000M
/:  Bus 05.Port 1: Dev 1, Class=root_hub, Driver=xhci-hcd/1p, 480M
/:  Bus 04.Port 1: Dev 1, Class=root_hub, Driver=xhci-hcd/1p, 5000M
    |__ Port 1: Dev 2, If 0, Class=Hub, Driver=hub/2p, 5000M
        |__ Port 1: Dev 22, If 0, Class=Vendor Specific Class, Driver=, 5000M
/:  Bus 03.Port 1: Dev 1, Class=root_hub, Driver=xhci-hcd/1p, 480M
    |__ Port 1: Dev 2, If 0, Class=Hub, Driver=hub/2p, 480M
        |__ Port 2: Dev 4, If 0, Class=Vendor Specific Class, Driver=, 480M
/:  Bus 02.Port 1: Dev 1, Class=root_hub, Driver=exynos-ohci/3p, 12M
/:  Bus 01.Port 1: Dev 1, Class=root_hub, Driver=exynos-ehci/3p, 480M
    |__ Port 1: Dev 2, If 0, Class=Video, Driver=uvcvideo, 480M
    |__ Port 1: Dev 2, If 1, Class=Video, Driver=uvcvideo, 480M
    |__ Port 1: Dev 2, If 2, Class=Audio, Driver=snd-usb-audio, 480M
    |__ Port 1: Dev 2, If 3, Class=Audio, Driver=snd-usb-audio, 480M

User avatar
odroid
Site Admin
Posts: 34586
Joined: Fri Feb 22, 2013 11:14 pm
languages_spoken: English, Korean, Japanese
ODROIDs: ODROID
Has thanked: 812 times
Been thanked: 704 times
Contact:

Re: Odroid XU4 + GPU Deep Learning / Tensorflow

Post by odroid » Wed May 08, 2019 10:13 am

Glad to know that works in USB 3.0 mode. :)
Port 1: Dev 22, If 0, Class=Vendor Specific Class, Driver=, 5000M

Post Reply

Return to “Ubuntu”

Who is online

Users browsing this forum: L67GS and 3 guests