Co-authored by Archana Iyer
As the Internet of Things (IoT) becomes more mainstream, the number of devices connected to the web is increasing by the millions. In fact, Forbes estimates that the number of Internet-connected devices will exceed 75 Billion by 20251. Many, if not most, of these devices, will be smart.
The existence of such devices demands cloud-based services for data collection and analysis. A significant part of this analysis is being done using Deep Learning models hosted on external servers.
Imagine a security camera is installed outside your home. You decided to make the camera, a “smart camera” by running some machine learning models to detect abnormal behaviors. Would you still prefer to run these models with valuable data in the cloud? Latency, internet bandwidth costs, and security concerns are prompting many to move their analysis away from the cloud and closer to the devices where the data is being generated.
In this article, we will try to explain why moving computation to the edge can be beneficial and how to do it using a Raspberry Pi and the Intel Neural Compute Stick (NCS).
What is the Neural Compute Stick?
The NCS or Neural Compute Stick is a low power USB stick size device that can be used to run machine learning model inference on the Edge2. The NCS contains a Vision Processing Unit (VPU) that can run even large models like Google’s Inception without breaking a sweat (we benchmark some of these results later on). This factor is important as the Raspberry Pi struggles even to run small models effectively.
The NCS also comes with a set of tools that makes prototyping, profiling and deploying models on the edge easy. Furthermore, it’s tiny, fan-less, design does not take up much space nor does it consume much power. Therefore, using the NCS on the Pi is optimum for edge computing and is a low cost replacement for cloud based services.
Before we show you how to use the NCS to perform inference, I think it is necessary to know why the Edge is so essential for deploying models.
Why is Inference Vital for Edge Computing?
There are a lot of very compelling reasons for shifting computations away from the Edge and into the cloud, with the most important being latency issues. Here, latency refers to the time it might take to send data to a server and then receive the response. The few seconds of delay caused by this might not be a problem for your smart home applications, but commercially, those few precious seconds, or even microseconds, can cause a machine to break down.
Furthermore, many industrial processes might be happening in places where having an internet line may not be possible: a mine, for example. Even if having an internet connection is possible, most companies are hesitant to send data over a potentially insecure internet connection and risk exposing their data to hackers. This averseness to risk prompts many companies to keep their data in-house.
Finally, if you have many sensors, you will probably be streaming data in the order of giga bytes every hour. Even a simple camera like a pi cam will generate nearly 50 GB of data per hour at a resolution of 1280×720 and 30 fps5. It does not make sense for companies to pay for the bandwidth to send that much data when most of it is discarded anyway. Hence, it is crucial to shift all that computation to where the data is getting generated.
However, machine learning models can be huge and require not only considerable computational power but also much memory. This requirement makes it very difficult and time-consuming, to not only load the models into the memory of small devices, but to also perform all the calculations: Google’s Inception model has more than 120,000,0004 operations!
Most self-driving cars are becoming less reliant on Internet services3. There are several reasons for doing so:
- Firstly, it helps reduce the threat from hackers, which can compromise the security of the system; what if the system applies the breaks at the wrong time?
- Secondly, there are instances where the car is not in the vicinity of an Internet connection, like while moving through tunnels.
- Lastly, it is impossible to expect that each turn that a self-driving car takes will be calculated and recorded by cloud services. Imagine the massive amounts of data that will need to be sent to the cloud!
Hence, self-driving cars need edge computing for better reliability and Movidius could be an additional feature that can help accelerate such machine learning models.
Steps to Use Movidius Neural Compute Stick
Requirements for using NCS: NCS can be used on any system that has at least 1 GB of RAM and 4 GB of free memory running either 64 bit Ubuntu 16.04 or Raspbian Stretch.
In this example, we will use the NCS to run a network that can predict handwritten MNIST digits.
STEP 1: Install their Toolkit
To install their SDK, you have to clone their repo and then run make install in the directory. The repo comes with a few examples to test your installation that you run by choosing make examples.
STEP 2: Make the TensorFlow model
NCS first compiles a TensorFlow or Caffe model into an optimized format that can be run on the Neural Compute Stick. Even though this process can take some time, it happens just once and will save you time in the long run when you have to load TensorFlow graphs multiple times (we benchmark this in the next section).
We make a simple TensorFlow model with a couple of convolutional layers and two fully connected layers. The NCS toolkit does not support many TensorFlow operations, i.e., many TensorFlow operations cannot be converted into the NCS graph type. This is the reason why we use low-level TensorFlow APIs.
The full model is available here.
STEP 3: Making an Inference specific model
Before moving on to this step, make sure that you have saved the previous trained model.
The NCS is used only for inference, so we remove any training and data-preprocessing specific operations to reduce the size of the graph.
Restore all the variables and weights from the previously saved model and then save them in this inference specific model. This step should not take much time as we are not training our model here. The inference specific model can be found here.
This step can be a bit confusing and cause a few errors. It is best to copy the code from the previous step into a new file and then make changes to it.
STEP 4: Converting TensorFlow model to NCS Graph type
Before moving on, make sure that you have saved the model from the previous step. The NCSDK accepts either .pb files or .meta files.
This is done using the mvNCCompile command. Change the directory to where your saved models are and run this command:
mvNCCompile model.meta -s 12 -in input -on output -o mnist_inference.graph
(This is assuming that your model is saved with the name model.meta)
The mvNCCompile takes in a few arguments:
- s = The number of SHAVEs to use. This is used to remove certain operations to make the output graph more optimized. The default value for this is 1.
- in = The name of the input node. In this case, it is input.
- on = The name of the output node. In this case, it is output.
- o = The name of the output NCS graph type.
STEP 5: Performing Inference
Finally, after you have the Movidius graph ready, we can write a simple Python script to upload the graph to the Neural Compute Stick and then perform inference. The complete code for this is available here.
- First, we import the mvncapi module from the mvnc library
- Then we check to see if any Neural Compute Stick devices are attached to the Pi. If there aren’t any, we exit the program
- Then we open the device and load the saved graph into it.
- Finally, we can load the image into the NCS for inference and get the result. NCS works with only 16-bit floating point values for now.
- After you are done, you can deallocate the graph and close the device.
To get more details about the various APIs that mvnc provides, you can see their documentation6.
On running the model on a Raspberry Pi 3 with the Neural Compute Stick, we get the following results:
The time taken to load the model into the Neural Compute Stick is a bit less than 2.5 seconds whereas the time taken to perform inference on a single image is 0.052 seconds.
On running the normal TensorFlow model, we get these results:
This time, just loading the TensorFlow graph takes more than 16 seconds! That is nearly 7 times longer! You might notice that the TensorFlow model takes about half as much time, but as we will see later, larger models will take more time to run.
Let’s run an Inception model to test it out. On the NCS, loading the graph and performing a prediction takes less than a second! In fact, performing the inference for 24 images takes slightly more than a second.
On the other hand, just loading the Inception TensorFlow graph takes more than 90 seconds! And performing a single inference takes twice as long!
By now you should be convinced that shifting to the Edge is far advantageous than cloud. However, before you dive into this, there are a few things that you should take into consideration.
In the next article, we will show you how to use the NCS to use this Inception model to perform face recognition using FaceNet.
Why Not NCS?
Even though this processor is one of its kind and specializes in inference training, there are several pitfalls to the community built around it. While setting up the Movidius, it takes a long time to find the right open sources resources. These are needed in order to battle with the bugs and errors that one may come across.
Intel has been known for not offering adequate support for its devices like Intel Galileo, Intel Edison, and several other open source devices. It would be an encouragement to the community, if we have more blogs and forums to help battle this problem.
Reach Out to Us? We do encourage people to reach out to us with queries and suggestions. We understand that the knowledge of Movidius is limited and we would love for you to reach out to us so we can expand on ideas together.