I think it’s a great project! But it could do with a few improvements:

# Neuron Type^{(1)}

Suppose we have a network of perceptrons that we’d like to use to learn to solve some problem. For example, the inputs to the network might be the raw pixel data from a scanned image of a signature. And we’d like the network to learn weights and biases so that the output from the network correctly classifies the digit. To see how learning might work, suppose we make a small change in some weight (or bias) in the network. What we’d like is for this small change in weight to cause only a small corresponding change in the output from the network.

If it were true that a small change in a weight (or bias) causes only a small change in output, then we could use this fact to modify the weights and biases to get our network to behave more in the manner we want. For example, suppose the network was mistakenly classifying an image as an “c” when it should be a “o”. We could figure out how to make a small change in the weights and biases so the network gets a little closer to classifying the image as a “o”. And then we’d repeat this, changing the weights and biases over and over to produce better and better output. The network would be learning.

The problem is that *this isn’t what happens when our network contains perceptrons*. In fact, a small change in the weights or bias of any single perceptron in the network can sometimes cause the output of that perceptron to completely flip, say from 0 to 1. That flip may then cause the behavior of the rest of the network to completely change in some very complicated way. So while your “o” might now be classified correctly, the behavior of the network on all the other images is likely to have completely changed in some hard-to-control way. That makes it difficult to see how to gradually modify the weights and biases so that the network gets closer to the desired behavior. Perhaps there’s some clever way of getting around this problem. But it’s not immediately obvious how we can get a network of perceptrons to learn.

We can overcome this problem by introducing a new type of artificial neuron called a **sigmoid neuron**. Sigmoid neurons are similar to perceptrons, but modified so that small changes in their weights and bias cause only a small change in their output. That’s the crucial fact which will allow a network of sigmoid neurons to learn.

Just like a perceptron, the sigmoid neuron has inputs, $ x1 $, $ x2 $, … But instead of being just 0 or 1, these inputs can also take on any values between 0 and 1. So, for instance, 0.638 is a valid input for a sigmoid neuron.

The Sigmoid Neuron is defined as:

$$ sigma(z) = dfrac{1}{1 + e^{-z}} $$

Torch implements this neuron type here.

_{(1) Excerpt with minor edits from Neural Networks and Deep learning}

# Cost Function

I don’t see any use of a cost function in your code. I’m going to recommend you read this section in Neural Networks and Deep Learning to get a good reason why you should be using one.

In short, the cost function returns a number representing how well the neural network performed to map training examples to correct output. The basic idea is that the more “wrong” our network is at achieving the desired results, the higher the cost and the more we’ll want to adjust the weights and bias to achieve a lower cost. We try and minimize this cost using methods such as gradient descent.

There are certain properties that you look for in a cost function, such as convexity (so gradient descent finds a global optima instead of getting stuck in a local optima). As the book suggests, I would lean towards using the cross-entropy cost function.

The way we implement this in Torch is with `Criterions`

. Torch seems to have implemented a bunch of these cost functions, and I encourage you to try different ones and see how they affect your neural net accuracy.

# Over-fitting

It could be possible that you fit your data **too** well, to the point where we don’t generalize well enough. An example of this is given in the picture:

Noisy, linear-ish data is fitted to both linear and polynomial functions. Although the polynomial function is a perfect fit, the linear version generalizes the data better.

I don’t know Lua very well, but by looking at your code I don’t see any attempts to reduce over-fitting. A common approach to this is by implementing ** regularization**. Since it’s too hard of a topic to cover in-depth here, I’ll leave you to understand it if you would like. It is quite simple to use once its concepts are understood, you can see from this Torch implementation here.

Another way to reduce over-fitting is by introducing ** dropout**. At each training stage, individual nodes are “dropped out” of the net so that a reduced network is left. Only the reduced network is trained on the data in that stage. The removed nodes are then reinserted into the network with their original weights. The nodes become somewhat more insensitive to the weights of the other nodes, and they learn how to decide more on their own.

Dropout also *significantly improves the speed of training while improving performance* (important for deep learning)!

# Gradient Checking

For more complex models, gradient computation can be notoriously difficult to debug and get right. Sometimes a buggy implementation will manage to learn something that can look surprisingly reasonable (while performing less well than a correct implementation). Thus, even with a buggy implementation, it may not at all be apparent that anything is amiss. Therefore, you should numerically check the derivatives computed by your code to make sure that your implementation is correct.

I found an implementation of gradient checking with Torch here. Be warned that this check is computationally expensive, so once you’ve verified that your implementation of backpropagation is correct you should turn off gradient checking.

# Principal Components Analysis

PCA can be used for data compression to *speed up* learning algorithms, and can also be used to visualize feature relations. Basically, in a situation where you have a WHOLE BUNCH of independent variables, PCA helps you figure out which ones matter the most and gets rid of the others (think of having centimeters and inches both as input features, we only need one to get the same information).

Looking at the research paper you linked, it looks like there are only 10 features input into the neural net. But to me, it looks like we could get rid of 2, possibly 3 features! That’s quite a bit for the few features we have. The functions $ sin $ and $ cos $ are related to each other, why do we need both to measure direction and curvature of the trajectory when we could use just one and *get the same information into the neural network*?

One could also make the argument that the centripetal and tangential accelerations are related to each other, or that the velocity and curvature together rule out the need for the centripetal acceleration since $ a_c = frac{v^2}{r}$. More analysis would be needed by the software to determine that thoroughly.

Be warned, if not applied correctly PCA can *reduce* neural network accuracy. PCA is also ** not** to be used to handle over-fitting (since overfitting usually occurs when many features are present). There is a nice GitHub repo here covering PCA using Torch.