ECE Capstone Design: Deep Learning

Tennessee Tech University

About Our Project

A group of ECE students working on their capstone project for Dr. Hasan and Muluken. This project is sponsored by Intel and aims at developing a low-cost low-power embedded platform for real-time vehicle surveillance. Below is a brief overview of our project.

Introduction:

A low-cost embedded platform that can easily be used as a real-time vehicle video surveillance has been developed. We created a new model using a transfer learning approach on GoogleNet by preserving pretraining weights and training the network on a new dataset. By joining two GoogleNet networks we created a model that can classify vehicles by type and make. Our model is able to observe any type of car and understand its type and make. We trained our model on various public car datasets including a dataset we collected ourselves. We deployed our model on a low-cost Raspberry Pi 3 as it delivers a low-budget, smallsize and a low-power hardware solution. These features also make mounting the platform in different public places easier. In order to boast the performance of Raspberry Pi we utilized Intel’s low-budget Movidius Neural Compute Stick (NCS) as a coprocessor to particularly accelerate the computational intensive part of the task, the inference. Our implementation using NCS shows approximately 31.7X speedup compared to Raspberry Pi implementation and we were also able to achieve near real-time classification by classifying about 11 frames/sec from a real-time streaming camera with approximately 87% accuracy.

Fig: Framework of the project

Project Details

Fig: Number of images per car make on the left, number of images per car type on the right.

For training our networks, we needed to collect a dataset of cars. We chose 20 different vehicle makes to use and gathered images to create our dataset. As shown in the left table our total images for the Makes was 24979. The percentage of images for each make is shown in the right hand column.

The top right table shows the dataset we used to train the car Type classifier network. Cars come in all different shapes and sizes, and our network is trained on these 8 different types of cars.


Fig: Loss vs Accuracy for Car Makes on the left, Loss vs Accuracy for Car Types on the right.

The above figure shows the increase in accuracy and reduction in loss of the network versus number of iterations during training. A loss function in Caffe maps the goodness of the current learned weights to a scalar value. Hence, as show in Figures above loss is reduced as the training iteration increases, since the network is learning more and more from the new data and the weight values are getting more refined.

After training, we created one network using two fine-tuned GoogleNets, shown in the Figure below as GoogleNet-1 and GoogleNet-2. The figure coarsely depicts a model with two fine-tuned GoogleNet architectures sharing an input. The network on top (GoogleNet-1) classifies the input image/frame based on the type of the vehicle and the network on the bottom (GooglNet-2) classifies the same input image/frame based on the make of the vehicle in the image. In Caffe, combining the two networks is done by creating one network definition file containing both GoogleNet-1 and GoogleNet-2, and renaming the original GoogleNet’s layers to a different and unique name for both GoogleNet-1 and GoogleNet-2.

Fig: Network Architecture for combined GoogLeNet Networks

Fig: Base Learning Rate vs. Accuracy. Data collected from Spearmint

Using a software package called Spearmint, we were able to find combination of hyper-parameters that resulted in the best accuracy. The above figure shows the network performance across different learning rate parameter.


Fig: Classification results for car Make
Fig: Classification results for car Type

The above two figures show the classification results for car Make and Type respectively.


Fig: Computation Time per image classified using Pi CPU on the left, and time taken by the NCS on the right

Figure in the left shows the time taken per image using Pi and the external swap space defined on the USB3 flash drive. The average time taken per image was 3.105 seconds. This contrasts starkly with the data presented in Figure right which shows the time taken for the Pi to classify images with the help of the NCS. The average time taken for these images is 0.0988 seconds.

Fig: Memory usage using Intel’s Movidius NCS and the pi CPU.

The memory usage between the two tests is shown in the above figure. The memory usage measurement includes all memory necessary to classify the images one at a time. It neglects all memory needed to display the images.


Fig: Image distribution of our added color dataset.

For our second semester, we added in a color network that can classify color, as well as measured the latency of multiple NCS sticks and throughput, latency and throughput of using multiprocessing, and testing out speeds of different networks. We first started out with trying to improve the speeds of our classification adding more NCS sticks. Meanwhile the caffe team worked on getting a color network capable of classifying 8 different colors. Shown in the figure above is the image distribution for the new dataset.


Fig: Image latency in seconds of images per core on the left, and images with 1 NCS stick on the right.

The figure above on the left shows the latency time in seconds for the Raspberry Pi 3 to classify one image per core, using 3 total cores for classification. Each core is passed an image out of 1000 images, so each core processes 333~ images for classification as fast as it can. We see from this that the Pi takes about 3-4 seconds per image per core. On the right side we show the latency in seconds of classifying with one movidius stick. Note the y-axis is significantly different between the two.

Fig: Image throughput for Pi CPU cores on the left, and a direct comparison to the NCS sticks on the right.

Our throughput is measured in frames per second (fps) and we can see from the figures above the speed difference between adding cores on the pi in classification, to adding NCS sticks. We see a speed increase of about 23.13x from the sticks. The Pi by itself can classify images at around 0.3 - 0.75 frames per second. If you've ever seen a video feed at around 1 frame per second, you know how insanely slow that is. Inversely, when we look at the NCS sticks, with 3 sticks processing the same network in series we see a speedup to almost 16 frames per second! That is a significant difference, however we are still using the same network. The network used plays into the speed we see, as well as our accuracy. We compared several networks and their accuracies next.


Fig: Compared different networks, GoogLeNet, SqueezeNet, and MobileNet and their accuracies.

We trained all of our networks (color, type, make) over some of the top CNNs available and these were the results we got. For GoogleNet we already knew it was accurate so this is more of a comparison than anything new. Next we trained MobileNet and found that the accuracy was about 11% worse than GoogLeNet but it was much faster as you will see below. Then we trained over SqueezeNet and found the accuracies to be inbetween mobileNet and GoogLeNet. Our Color classification performed well on each, so we chose to use MobileNets network for color as it was the fastest for classification.


Fig: Image latency for MobileNet on the left, and comparison between GoogLeNet and MobileNet Throughput.

MobileNet was a much smaller network and was the first network we tested after GoogLeNet. We ran the same latency test with MobileNet and from the figure on the left you can see the latency of each image is a little lower than half of what GoogLeNet took per image. Then looking at the figure on the right, we can see just how much faster MobileNet is with the NCS sticks than GoogLeNet. With 3 NCS sticks on the same network we found that MobileNet can achieve almost 34 frames per second. This is a little over double the maximum of GoogLeNet in our previous throughput testing.


Fig: SqueezeNet throughput on the left, and comparison to all three networks on the right.

SqueezeNet was the last network we tested, and comparing the speeds in the figures above we can see it compares pretty evenly to MobileNet, but has the benefit of more accuracy. At 3 sticks on the same network we see a slight increase in frames per second too!


Fig: Power Consumption of the Pi 3 with connected devices.

One of the reasons we went with a platform such as the Pi is its low energy consumption. A raspberry Pi consumes about 2.5 Watts at idle with nothing connected which is very insignificant. We added up to 4 NCS sticks for our testing, plus having a camera, keyboard, mouse, and any flashdrives attached to the Pi we found ourselves needing a USB hub. We purchased one that could supply up to 60 Watts total across the 6 slots, or 10 Watts per slot. This was a little overkill but it definitely meant power was not a limiting factor in our testing. From the figure above you can see the more sticks and peripherals we add, our power consumption for our test bench only increases to about 11 Watts. The USB hub alone draws more than the Pi at almost 4 Watts at idle. This shows our project is very low power consumption even with all the added peripherals.


Future Work:

In the future, we would improve this system in a variety of ways. At first, we would improve our data set to include a larger universe of vehicles, in a more realistic setting. We would also work toward a more accurate implementation of our networks by improving the hyper-parameters. We would also work toward a network that could classify vehicle images by multiple criteria i.e. a network that could classify cars by both type and make. We would then work toward testing this implementation for the same metrics outlined above.

To improve the accuracy of the device at the Raspberry Pi stage, we would implement a Single Shot Detector (SSD) that would isolate vehicles in the frame, crop them out, and then classify the cropped images. In this way, we could reduce the amount of noise in the image, as well as classify all of the cars in the frame. We were very close to realizing this goal within our timeframe, but we were unable to achieve an accurate classification without saving the images as a .JPG file first, reopening them, and then classifying them. We believe that this has to do with the way the image is encoded within our Python script, but we have been unable to test this theory.

To improve the speed of the device, we could realize this project again on a slightly more powerful hardware. Systems like the Orange Pi and the ODROID offer substantial system resource improvement for only slightly higher cost and power consumption. We could also rework our test scripts to utilize a lower level language, such as C++, to acquire a faster image classification. Since lower level languages are inherently closer to machine code, they generally run faster than scripting languages, although they are substantially harder to work with.


Demonstration

Short video showing image classification on a raspberry Pi, and a real-time image classification with and without Movidius Neural Compute Stick.

Comparison: Pi 3 vs. NCS Classification

This video shows a side by side comparison of the raspberry Pi 3 classifying cars on its CPU on the left, and classifying on the NCS sticks on the right.

Classifying a Car Real-time of a Moving Video Capture

We made a short video of walking around cars, showing real-time classification of make, type, and color for a moving video capture.