Fig: Number of images per car make on the left, number of images per car type on the right.
For training our networks, we needed to collect a dataset of cars. We chose 20 different vehicle makes to
use and gathered images to create our dataset. As shown in the left table our total images for the Makes
was 24979. The percentage of images for each make is shown in the right hand column.
The top right table shows the dataset we used to train the car Type classifier network. Cars come in all
different shapes and sizes, and our network is trained on these 8 different types of cars.
Fig: Loss vs Accuracy for Car Makes on the left, Loss vs Accuracy for Car Types on the right.
The above figure shows the increase in accuracy and reduction in loss of the network versus number of
iterations during training. A loss function in Caffe maps the goodness of the current learned weights
to a scalar value. Hence, as show in Figures above loss is reduced as the training iteration increases,
since the network is learning more and more from the new data and the weight values are getting more refined.
After training, we created one network using two fine-tuned GoogleNets, shown in the Figure below as GoogleNet-1
and GoogleNet-2. The figure coarsely depicts a model with two fine-tuned GoogleNet architectures sharing an
input. The network on top (GoogleNet-1) classifies the input image/frame based on the type of the vehicle
and the network on the bottom (GooglNet-2) classifies the same input image/frame based on the make of the
vehicle in the image. In Caffe, combining the two networks is done by creating one network definition file
containing both GoogleNet-1 and GoogleNet-2, and renaming the original GoogleNet’s layers to a different
and unique name for both GoogleNet-1 and GoogleNet-2.
Fig: Network Architecture for combined GoogLeNet Networks
Fig: Base Learning Rate vs. Accuracy. Data collected from Spearmint
Using a software package called Spearmint, we were able to find combination of hyper-parameters that resulted
in the best accuracy. The above figure shows the network performance across different learning rate parameter.
Fig: Classification results for car Make
Fig: Classification results for car Type
The above two figures show the classification results for car Make and Type respectively.
Fig: Computation Time per image classified using Pi CPU on the left, and time taken by the NCS on the right
Figure in the left shows the time taken per image using Pi and the external swap space defined on the USB3
flash drive. The average time taken per image was 3.105 seconds. This contrasts starkly with the data
presented in Figure right which shows the time taken for the Pi to classify images with the help of the NCS.
The average time taken for these images is 0.0988 seconds.
Fig: Memory usage using Intel’s Movidius NCS and the pi CPU.
The memory usage between the two tests is shown in the above figure. The memory usage measurement includes
all memory necessary to classify the images one at a time. It neglects all memory needed to display the images.
Fig: Image distribution of our added color dataset.
For our second semester, we added in a color network that can classify color, as well as measured
the latency of multiple NCS sticks and throughput, latency and throughput of using multiprocessing,
and testing out speeds of different networks. We first started out with trying to improve the speeds
of our classification adding more NCS sticks. Meanwhile the caffe team worked on getting a color network
capable of classifying 8 different colors. Shown in the figure above is the image distribution for the new
Fig: Image latency in seconds of images per core on the left, and images with 1 NCS stick on the right.
The figure above on the left shows the latency time in seconds for the Raspberry Pi 3 to classify one image per core,
using 3 total cores for classification. Each core is passed an image out of 1000 images, so each core
processes 333~ images for classification as fast as it can. We see from this that the Pi takes about 3-4 seconds
per image per core. On the right side we show the latency in seconds of classifying with one movidius stick.
Note the y-axis is significantly different between the two.
Fig: Image throughput for Pi CPU cores on the left, and a direct comparison to the NCS sticks on the right.
Our throughput is measured in frames per second (fps) and we can see from the figures above the speed difference between adding cores
on the pi in classification, to adding NCS sticks. We see a speed increase of about 23.13x from the sticks. The Pi by itself
can classify images at around 0.3 - 0.75 frames per second. If you've ever seen a video feed at around 1 frame per second, you know
how insanely slow that is. Inversely, when we look at the NCS sticks, with 3 sticks processing the same network in series we
see a speedup to almost 16 frames per second! That is a significant difference, however we are still using the same network.
The network used plays into the speed we see, as well as our accuracy. We compared several networks and their accuracies next.
Fig: Compared different networks, GoogLeNet, SqueezeNet, and MobileNet and their accuracies.
We trained all of our networks (color, type, make) over some of the top CNNs available and these were the results we got. For GoogleNet
we already knew it was accurate so this is more of a comparison than anything new. Next we trained MobileNet and found that the
accuracy was about 11% worse than GoogLeNet but it was much faster as you will see below. Then we trained over SqueezeNet and found the
accuracies to be inbetween mobileNet and GoogLeNet. Our Color classification performed well on each, so we chose to use MobileNets
network for color as it was the fastest for classification.
Fig: Image latency for MobileNet on the left, and comparison between GoogLeNet and MobileNet Throughput.
MobileNet was a much smaller network and was the first network we tested after GoogLeNet. We ran the same latency test with MobileNet and
from the figure on the left you can see the latency of each image is a little lower than half of what GoogLeNet took per image. Then looking
at the figure on the right, we can see just how much faster MobileNet is with the NCS sticks than GoogLeNet. With 3 NCS sticks on the same
network we found that MobileNet can achieve almost 34 frames per second. This is a little over double the maximum of GoogLeNet in our
previous throughput testing.
Fig: SqueezeNet throughput on the left, and comparison to all three networks on the right.
SqueezeNet was the last network we tested, and comparing the speeds in the figures above we can see it compares pretty evenly to MobileNet,
but has the benefit of more accuracy. At 3 sticks on the same network we see a slight increase in frames per second too!
Fig: Power Consumption of the Pi 3 with connected devices.
One of the reasons we went with a platform such as the Pi is its low energy consumption. A raspberry Pi consumes about 2.5 Watts
at idle with nothing connected which is very insignificant. We added up to 4 NCS sticks for our testing, plus having a camera, keyboard,
mouse, and any flashdrives attached to the Pi we found ourselves needing a USB hub. We purchased one that could supply up to 60 Watts total
across the 6 slots, or 10 Watts per slot. This was a little overkill but it definitely meant power was not a limiting factor in our testing.
From the figure above you can see the more sticks and peripherals we add, our power consumption for our test bench only increases to about
11 Watts. The USB hub alone draws more than the Pi at almost 4 Watts at idle. This shows our project is very low power consumption even with
all the added peripherals.
In the future, we would improve this system in a variety of ways. At first, we would improve our data set to include a larger universe of vehicles,
in a more realistic setting. We would also work toward a more accurate implementation of our networks by improving the hyper-parameters.
We would also work toward a network that could classify vehicle images by multiple criteria i.e. a network that could classify cars by both type
and make. We would then work toward testing this implementation for the same metrics outlined above.
To improve the accuracy of the device at the Raspberry Pi stage, we would implement a Single Shot Detector (SSD) that would isolate vehicles in the
frame, crop them out, and then classify the cropped images. In this way, we could reduce the amount of noise in the image, as well as classify all
of the cars in the frame. We were very close to realizing this goal within our timeframe, but we were unable to achieve an accurate classification
without saving the images as a .JPG file first, reopening them, and then classifying them. We believe that this has to do with the way the image is
encoded within our Python script, but we have been unable to test this theory.
To improve the speed of the device, we could realize this project again on a slightly more powerful hardware. Systems like the Orange Pi and the
ODROID offer substantial system resource improvement for only slightly higher cost and power consumption. We could also rework our test scripts to
utilize a lower level language, such as C++, to acquire a faster image classification. Since lower level languages are inherently closer to machine
code, they generally run faster than scripting languages, although they are substantially harder to work with.