Pixetto Gesture Recognition: Rock Paper Scissors

In addition to self-driving cars and vision-based robotics, the field of Human-Computer Interaction (HCI) is, increasingly, another area where computer vision and AI technologies are being applied. Visual sensors, if able to detect and recognize human motions, allow for much more intuitive and natural interactions between computers and users.

One example of HCI is hand gesture recognition, where we want the computer to be able to recognize different gestures we form with our hands in front of a camera. This tutorial will go through a simple gesture recognition project for “rock”, “paper”, and “scissor”.

By the end of this tutorial, you will learn:

  • how to collect and process training image data from video files
  • how to classify images of gestures with deep learning
  • how to run your deep neural network on a Pixetto
  • how to write a simple Scratch program to visualize Pixetto’s output

No prior knowledge is required for this project. Let us begin!

Project Overview

Our main task for this project is to let our Pixetto differentiate between gestures. Because Pixetto captures input in the form of video frames, this is essentially a task of image classification, i.e. classifying images of hands into different gestures.

A common tool for image classification is deep neural networks, which can convert input data into numeric labels. For example, our rock-paper-scissors project will need a neural network that can convert input to output as follows:

Scissor Rock Paper Background
Output 0 1 2 3


Neural networks correctly perform classifications through a process called training. During training, we show the neural network large amounts of training data (in our case, images of gestures) and tell it what each gesture is. Based on this information, the neural network adjusts its computations until eventually it performs the right classifications.

Therefore, before we can begin the deep learning process, we need to first collect large amounts of training images.

Collecting Training Data

One of the fastest ways to collect training data is through videos. Videos are, essentially, just sequences of static images (i.e. frames), so by recording a video of our gestures, we can extract the individual frames to form our image dataset.

To do this, connect a Pixetto to your computer and open up Pixetto Utility. Near the bottom of the right-hand panel, you will see a “Record Video” tool that records the camera input and saves it as an MP4 video file. Form a “scissor” gesture with your hand, then press the record button to start.

We want our neural network to learn that a “scissor” gesture, no matter the orientation, is still a “scissor”. To do this, we need to add a bit of variation in the training data. So as the video records, slowly shift your hand around, rotate it, move it towards and away from the camera, or change the spacing between your fingers.

For better performance, also make sure that your hand is in front of a plain background. After ten seconds, stop and save the recording. Repeat this process for the “rock” and “paper” gestures as well.

Besides these three gestures, we also need to record a fourth background video. This will be used to train the neural network to recognize when there is no gesture present. (Otherwise, it will try to interpret everything as a gesture, including an empty scene.) Simply record another ten-second video but without any hand gestures in view.

Processing Training Data

With our videos recorded, we can move onto the next step: processing the videos into the image data we need for training. Pixetto’s Machine Learning Accelerator provides an easy-to-use tool for this. Log in to the ML Accelerator then go to the Machine Learning tool on the home page. (This is the interface we will use later to set up our neural network.)

Above the main workspace, you will see an option to “Upload Video.” Choose one of your videos to upload and enter an appropriate label name (e.g. “scissor”). Next, drag a rectangle around your hand gesture and click on “Start tracking”.

The video processing tool will start going through the video frame by frame to track the position of the gesture you highlighted. After processing completes, you will see a video playback of the results. The areas bounded by the green box are what will be extracted for your training images. If the bounding is consistent, click “yes” to confirm. If not, click “no” to adjust your target selection and try again. Repeat for all of your videos.

To see the images that were generated, return to the homepage and click on “Python”. Here, you can navigate through the files and directories under your account. Your training data are stored under the “data” directory and organized into folders for each label.

These are some examples of “scissor” images that were extracted from the video.

Now that we have our training data ready, we can move onto the next step: setting up a deep neural network to perform image classification.

Setting up a Convolutional Neural Network

There are many different types of deep neural networks, and the type typically used for image classification is called a Convolutional Neural Network (CNN). Let us start by creating a simple CNN model. In your Machine Learning workspace, go to the “Popular combinations” tab and drag the entire set of blocks onto your workspace.

These blocks, which are connected sequentially, represent layers in a neural network. Each layer performs specific computations to determine the final output “label”. Right now, your model should contain an input layer, two sets of 2D convolution and maxpooling layers, a flatten layer, two dense layers, and an output layer.

The sequence and types of layers you use can be adjusted to form different neural network architectures. The parameters for each layer, called hyperparameters, can also be tuned to improve performance. These adjustments are typically made through trial-and-error, using training metrics that we will discuss later.

However, for now, we will make a couple quick adjustments that should improve performance. Add a third set of 2D Convolution and 2D Maxpooling layers right after the input layer by dragging the blocks from the “Core Layers” tab on the left.

Next, change the hyperparameters to be as follows:
• input with x-size of 60 and y-size of 60
• first 2D Convolution with 8 filters of kernel size 3
• second 2D Convolution with 8 filters of kernel size 5
• third 2D Convolution with 16 filters of kernel size 5

In addition, change the number of classes in the output layer from 2 to 4. This is because we want our neural network to classify gestures into four classes: “rock”, “paper”, “scissor”, and “background”. Finally, set the batch-size to 64.

Later, we will go through what each of these hyperparameters does and how you can tune them on your own. However, for now we have a basic neural network that can be trained!

Training a Neural Network

To train the model, click on the “Start” button in the upper right-hand corner. The console that pops up will first print a summary of your model. Then, the training will begin.

Training is divided into epochs, where each epoch is a complete pass of all the training data through the neural network. For each epoch, metrics like training loss, training accuracy, validation loss, and validation accuracy will be printed on the console. (You can also track these metrics by toggling the “Chart” button above.)

loss: the error of the neural network that training tries to minimize.
accuracy: the fraction of correct classifications out of all classifications.
training metrics: metrics computed from examples the neural network has seen
validation metrics: metrics computed from examples the neural network hasn’t seen, and indicates the model’s ability to “generalize” beyond the training data.

Generally, when your accuracy remains low throughout the training process, this is an indicator of underfitting: when your model is not complex enough relative to the task. On the other hand, if your training accuracy is high but your validation accuracy remains low, this is a likely indicator of overfitting: when your model fits the training data so well it loses its ability to generalize to additional, unseen data.

Training will end after five epochs. You should notice the accuracy gradually increasing and the loss decreasing throughout the process.

Next, connect a Pixetto to your computer and click “Download to file” to save your trained model for the Pixetto.

Next, open up the Pixetto Utility application and make sure your Pixetto is connected to your computer. Under the “Install Neural Network Model” panel, click on the “Model Path” field. Upload your .tflite file by navigating to it, then press “OK.” You will see that the Pixetto has been automatically configured to the Neural Network function. Change the object detection algorithm to “Central” and apply the changes.

Now, your Pixetto will feed the portion of each frame bound by the green box into your neural network model for hand gesture recognition! You will see two numbers above the bounding box. The first is the number associated with the gesture, which depends on the order you uploaded your training data (e.g. if you uploaded “scissor” first, it will be represented by 0). The second number, in parentheses, is the confidence of the neural network in making that particular classification.

Tuning Hyperparameters (Advanced)

If you are not satisfied with the performance of the neural network, there are many ways to improve it through tuning its hyperparameters. Hyperparameters come in two flavors: those associated with the model, and those associated with the training process.

To tune the former, we need to first understand the purpose of each layer in a CNN:

  • The input layer takes our image data and scales it down to a given size. This reduces the computational cost of training when not all of the detail of a high-resolution image may be needed to extract relevant, distinguishing features. Therefore, increasing input size means higher resolution image data is used by the neural network, giving it more details to work with.
  • In convolutional layers, small filters (or kernels) are applied across different regions of the input image, where each filter looks for a specific feature in that region, such as a line, edge or curve.

    Therefore, increasing the # of convolution filters allows the layer to look for more features. And increasing the kernel size allows it to look for larger features.
  • The maxpooling layers down-sample input data by only outputting the maximum value from each region of the previous layer. This “summarizes” the features to reduce computational costs and build translation invariance.
    Therefore, increasing the maxpooling pool size decreases the resolution of the processed image data going into the following layers.
  • The dense layers are composed of neurons each connected with all of the neurons of the previous layer. Whereas convolutional and maxpooling layers extract local features, dense layers are used for classification based on these extracted features.Therefore, increasing the # of dense layer units better fits the model to the data, but layers that are too large can cause overfitting.

In general, the number and sizes of layers should increase with the complexity of the classification task you want to perform. For example, a neural network classifying cats and dogs would typically be larger than a neural network classifying circles and squares.

Beyond adjusting model hyperparameters, you can also tune hyperparameters related to the training process. The two most frequently used are batch size and epochs.

  • Epochs controls how long the training occurs. If you see that your model is slowly training but it needs more time, you can increase the number of epochs.
  • Batch size is the number of images that the model sees in an iteration of training. With large batches, “noise” in the training data becomes less influential. However, without noise, the model is also more likely to overfit and generalize poorly.

There are countless other hyperparameters, such as activation functions or learning rate. However, the seven mentioned above are the ones most easy to understand and arguably the most important in getting your neural network to a decent level of accuracy.

Take some time to play around with the different hyperparameters and see their effects on your neural network model. Hopefully, you’ll see its performance improve!

Visualizing Results with Scratch

Right now, as you can see in the Pixetto Utility, the output of each neural network classification is a number (0 through 3) that the identified gesture corresponds to. We can create a simple Scratch program for a more appealing visualization of these results. Instead of a number or text label, we will have our program display these images.

Return to the homepage of the ML Accelerator and click on “Blocks”. Here, you will see an interface for writing and running Scratch programs. In the lower left corner, click on “Add Extension” and select the “Via AI learning kit” extension at the bottom of the list. This will allow our Scratch program to communicate with the Pixetto and read in its output.

First, let us set up the costumes of our sprite. Click on the “Costumes” tab on the top of the interface, and on the lower left corner you will see a button to add costumes. For each gesture, add a new costume by uploading a representative image from your computer, and name it accordingly. Delete the default costume.

Next, return to the “Code” tab. From the “Events” section on the left, drag the “When green flag clicked” block onto the workspace. Add a “forever” loop below it. Now, any code under in the loop will run continuously after you press the green flag button.

After the program begins, the first thing we want to check is whether the Pixetto detects the background or a hand gesture. Inside the loop, add an if-else block evaluating if the “object type” outputted by Pixetto is equal to whichever number your neural network associates with backgrounds. If this condition is satisfied, hide the sprite. Else, show it.

Under else, add another set of if-else blocks (as shown below) for all other values of “object type”. For each object type, switch the sprite’s costume accordingly. Here, object type 0 corresponds to scissor, 1 corresponds to rock, and 2 (if none of the other conditions are satisfied) corresponds to paper.

With this, your simple Scratch program is complete.

Before running this program, you’ll need to establish a connection with your Pixetto. To do this, click on “Not Connected” on the menu bar. Make sure that your Pixetto is plugged in and that Pixetto Utility is closed. Connect to your Pixetto device.

Now, our Scratch program is ready to run! Click on the green flag, and you will see that the program displays different images depending on the gestures you make!

Conclusion and Future Work

Congratulations! After this tutorial, you have successfully collected training data, trained a deep neural network, and built a Scratch program that uses your neural network for basic gesture recognition.

However, there are a lot of ways that you can use the skills you have learned to continue working on this project and make it even better. For example, you can:

  • develop the Scratch program into a rock paper scissors game
  • train your neural network on more gestures, such as gestures for numbers
  • continue tuning hyperparameters or experiment with different architectures

For casual users, the neural network module used in this article could be downloaded from here.

Author | Ethan Wu is currently a rising sophomore at Columbia University’s School of Engineering and Applied Science. He is pursuing a B.S. degree in Computer Science and is interested in exploring the applications of machine learning and computer vision. In his free time, Ethan also enjoys free-writing, lion dancing, and lap swimming. Contact him at ew664@columbia.edu.