Neural Networks III: How would I implement one?
Fórum:
rwurl=https://imgur.com/FC1QvBY
In the third article in the series, I am attempting to keep everything fairly detailed and explain everything I do, as I dive deeply into actual implementation of the Feed-Forward Neural Network. To make the implementation process easier to comprehend, the article is divided into 5 sub-segments:
- Making a simple, Feed-Forward Neural Network structure
- Fixing the Neural Network’s bias/weight initial values
- Adding a learning algorithm to the Neural Network
- Multiple Input and Output Sets for our Neural Network
- Training for handwriting recognition with MNIST data set
Let’s jump right in!
I will use Java in this case, but any other programming language will follow the exact same route of ideas. I’m not going to use any exotic language specific solution, but will try to keep everything as generic as possible.
Making a simple, Feed-Forward Neural Network structure:
The structure of our entire Neural Network supposed to be very simple, as the theoretical example was in the previous article. I presume the entire codebase should cap around 150-200 lines of code, plus the helper utility classes.
First, let’s create a Network.java class, which represent our NN object.
We would need a few integer constant attributes here, which defines our network and doesn’t need to change over the program’s lifetime.
“NETWORK_LAYER_SIZES” Contains the number or neurons in each of our layer.
“NETWORK_SIZE” Contains the number of layers over our NN. We set this number to be associated from the NETWORK_LAYER_SIZES array’s length.
“INPUT_SIZE” Contains the number of input neurons. Input layer is the first in the network, so the first number in the NETWORK_LAYER_SIZES will represent this variable.
“OUTPUT_SIZE” Contains the number of output neurons in our NN. Output is the last layer, so it is represented by the last index in the NETWORK_LAYER_SIZES array.
Now we will declare a couple of variables to work with:
“output” Contains the calculated output value of every neuron over our entire network. This value needs to be as precise as possible to give us accurate results, so we use Double as a datatype. Using a two dimensional array here is sufficient enough to store the given layer, and the given neuron positions as well.
“weights” Stores all the weight data over the network. Note that this needs to be a three dimensional array to store all the necessary positions. The first value would be the given layer, the second is the given neuron, and the third is the previous neuron the weight is connected to. We need this previous neuron data because as we have learned in the previous article, a single neuron on the given layer is connected to all of the previous ones at the adjacent previous layer.
“bias” Is a two dimensional array, similar as the output, because every neuron has one bias variable as well.
Our constructor will receive the NETWORK_LAYER_SIZES value from the initialization method, and all the rest of the constant and variable data can be calculated from it.
For initializing the output, weight and bias values, we assign the NETWORK_SIZE as the first dimension’s size. Also we need to iterate through a FOR loop to initialize all the rest of the elements over the second layer. Note that while every neuron has an output and a bias, the very first layer doesn’t have weights on it (being the input layer), so we start assigning weight values from the second layer in this loop.
We now have a basic initialization constructor, but we need to have a method that will calculate the FEED-FORWARDING values as well. Let’s call it “calculate”. This methods takes an array of doubles as an input parameter, and returns an array of doubles as an output.
The very first IF block check just makes sure that the input array’s size matches our network’s previously set INPUT_SIZE constant value. If it doesn’t we cannot do any calculations.
The next line just passes these input values to the output array’s first element, since it doesn’t need to do any calculations with it.
After that, we have a nested FOR loop that iterates through all the rest of the layers, while iterating through all the neurons as well in the given layer. Here, each of the neurons will apply the summarization with the bias, apply the weight multiplication over each of the previous neurons, iterating through yet another FOR loop. Finally we apply the sigmoid function to this summarized value.
The math for the sigmoid function can be represented by this in Java:
We can quickly make a Main method to test our current version of the network with some random values. Let’s instantiate our network and have the input and output layers contain 5-5 neurons, and have two hidden layers, containing 4 and 3 neurons. We can feed some random values as inputs like 0.2, 0.3, 0.1, 0.2, 0.5. Java has a good amount of integrated helper methods to make our life easier as a programmers, and the “Arrays.toString” can print out all the values on the given array in a nicely formatted, coma separated list.
When we run this program, we notice an issue right away. No matter what input values are we entering, all five of the output values are always exactly 0.5. This occurs because all the weight and bias values are initialized as 0 instead of 1, which basically nullifies all of our calculated summary values, causing the sigmoid function to return us 0.5 every time as well.
Fixing the Neural Network’s bias/weight initial values:
We can change the weight/bias initialization lines at our constructor to start with 1, but I will go one step further and make those values randomized within a certain range. This will give us more flexible control over the network’s behavior right from the start.
I’m creating a helper class called “NetworkTools” to store all the array creating randomizer and related utility methods. These methods are going to become handy as we go on, and are very straight forward to understand. I’ve commented their functions at each of their header:
So going back to the Network constructor and changing the weight and bias initialization lines to produce some random values. These values absolutely doesn’t matter right now, they can be positive or negative also:
Now every time we run the program and make one pass of feed-forwarding calculations, it will give us random output values, proving that the network works as intended. Without a learning algorithm however, the network is fairly useless in this state, so let’s tackle that issue as well.
Adding a learning algorithm to the Neural Network:
So for every given input value combination, we would need to have a “targeted” output value combination as well. With these values we can measure and compare how far or close the network is from the currently calculated output values. Let’s say the previously declared input values (0.2,0.3,0.1,0.2,0.5) we want the network to output ideally (1,0,0,0,0) instead of any other seemingly random numbers.
As explained in the previous article, this is where Backpropagation, our learning mechanism comes in and trying to predict each of the previous adjacent layers supposed weight and bias values, which would produce this final desired output. It will start from the last layer and try to predict what weight and bias combination could produce a value closer to the desired output results, once that is figured out, will jump to the previous adjacent layer and do these modification again and so on until it gets to the starting layer. We start from the last layer because those related weight and bias values have the greatest influence over the final output. Note that by the rules of the network, we can only change these bias and weight values if we want to influence these output values, we cannot change any other values directly.
These differences over the desired output and current output are called the “error signal”. Before making any changes to the weight/bias values, we finish the backpropagation completely and store this error signal from each layer (except the very first input layer).
Intuitively this seems like a very easy task to do. Just subtracting the targeted value from the current value, try nudging the weight/biases over some positive values and measure if we got closer or further from the desired output. If we got closer, then try adding more positive values until we reach the desired output, but if we got further away, start applying negative values instead and keep doing so. This would be exactly true for a simple input range, representing a straightforward curve:
rwurl=https://imgur.com/W77WOyG
But unfortunately, as we get more and more input values, the complexity of the whole Neural Network function gets significantly more complex as well, and predicting the exact “right” position and the path towards it becomes less and less obvious:
rwurl=https://imgur.com/3LgjejG
As you can see, having multiple local minimums can easily “fool” the algorithm, thinking that it goes the right way, but in reality it may just pursuit some local ones, which never going to produce the desired output. You can think of the algorithm as a “heavy ball” for weights. This ball will run down the slope where it started from, and stop at the bottom that it happen to find. Now for instance, if the initial weight value would be between 0 and 0.5 somewhere, and no matter where it would start to adjust, with a naïve “heavy ball” approach, it would slide down to be around ~0.65 and stop there, while we can clearly see that that would always produce a wrong result. This is the primary reason we use randomized values each time we start the training process, instead of setting them to 1, so every time there would be a new chance for these weights to propagate over the proper global minimum values.
Furthermore, to successful tackle this backpropagation, each neuron needs to have an “error signal” and an “output derivative” values, beside their regular output values. Our backpropagation error function, which calculates how close we are from the target output values looks like this:
E = ½ (target-output)^2.
I am not going to go into the details of all the related math in this article, because it’s a fairly large subject of its own. But for anyone interested, I can refer you to Ryan Harris over Youtube. He has excellent tutorial series for backpropagation algorithms, just to help you comprehend the whole concept easier:
rwurl=https://www.youtube.com/watch?v=aVId8KMsdUU
There are many good written articles over the net for it, Wikipedia is a great detailed source as well:
https://en.wikipedia.org/wiki/Backpropagation
Alright, going back to coding. We need to declare these two additional variables and initialize them at the constructor before we can use them:
The feed-forwarding calculation needs to be updated with these variables as well:
Calculating the error needs another method, we will call this “backpropError”, which receives the target output array and does the error calculations for each layer, starting from the last one:
Once we have these values, we can finally update the weights and biases over our network. We will need another method for this. Let’s call this “updateWeightsAndBiases”. It can receive 1 parameter called the “learning rate”. The learning rate is just a ratio value, indication how brave should the learning algorithm be over nudging those values in a positive or negative values. Setting this number to too small will produce a much slower learning periods, but setting it to too high may produce errors or anomalies over the calculations, making the whole learning process taking slower again.
Next, let’s have a method that will make our live easier and connect all these learning functionalities together. Let’s call it “train”. It receives the input array, the output array and a learning rate. It goes over all these previously mentioned calculatios:
We are now set to use our learning algorithm! Let’s change the main method to do so. We can use the same similar network setup, having for instance the input layer as 3 neurons values: 0.1, 0.5, 0.2, while having the output layer of 5 neurons, expecting the output to be for this given input combination to be 0, 1, 0, 0, 0. The FOR loop represents the number of times we are running the learning algorithm and applying the changes to the weights/biases.
Running the program gives us a seemingly far away result from desired:
This is reasonable again, we ran the learning algorithm only once, and as we know already, it’s virtually impossible to “guess” the right weight/bias combination. The network needs many tries and measuring over and over, until it can get closer and closer. Let’s try running the learning algorithm 10 times for instance by changing the FOR loop value:
We can see that the output this time is getting actually closer to the desired values. The ones that should be zero are ~0.2, and the one that should be 1, is almost ~0.8. Ok, let’s try running the learning algorithm, say 10,000 times:
We can see that the values are getting really close to the desired ones, and the more and more we train the network, the more accurate will it actually be. It depends on us how close do we want to get to the desired values before we can safely say that the network knows the right output for the right input, and how much processing power do we want to trade in for the training. You can imagine that over a large network and large amount of input data, a couple of million iterations can take hours or even days.
Multiple Input and Output Sets for our Neural Network:
In most of the cases, we would have a large amount of different input sets and all of them need to produce a given targeted output sets. We could have different variable names for each selected input and desired output arrays, but you can imagine that this would start to get tedious even for a couple of hundred values, not talking about millions.
To tackle this issue, we need to be as efficient as possible and create a new class, which can contain and work with many-many input and their corresponding expected output values. Let’s call it “TrainSet”. I’m briefly going the talk about a few methods in it, because most of them are straight forward to understand just by looking at them.
So we have a constructor that can accept the input and output size, this will represent the number of neurons at the network’s input and output layer.
“addData(input[], expected[])” will expect two parameters, the first one being the currently inserted input array values and the second being currently expected output array values for it. You can call this method as many time you need, and add as many input/expected array combinations to the set, for instance with a simple FOR loop.
“getInput(index)” and “getOutput(index)” will get you back these input/expected array values from the given index point.
“extractBatch” Gives us the ability to extract only a given range of the preloaded set, instead of the all. This can be handy for instance if we have 7000 entries in the set, but we would like to work with only 20 for a given task.
The main method just generates some random input/expected values, stores them in the set with the help of the FOR loop, and outputs them in the end as a demonstration.
Going back to our Network class, let’s create a method called “trainWithSet”. This method will accept a whole trainset to work with, a number of training loops we would like to go through the whole set, and the batchSize we would like to work with:
We need a new main method to handle traning sets, let’s make one:
I’ve made a network with 5 neurons at the input layer, 2 at the output and 3 at both of the hidden layers. For this example this will be sufficient, but this is the time when we need to think of the hidden layer’s size. If we define the number of neurons too small here, the network won’t have enough space to “store” very large number of data combinations, because the new input values that would set the weights and biases, can override the already properly defined ones, resulting in never-ending try and error iterations, that will never produce accurate result for all the desired values.
On the other hand, having too large network size will make the network extremely slow to work with, and making it significantly slower to learn as well.
So we instantiated a new trainset in the main method, having the same number of input and output neuron numbers as our network does. We added 6 data sets, each containing the input set, and the expected output set for it.
Finally, we called each input set entry values (in our example, that’s 6 entries to loop through), and verify out network if it produces the expected output values, after the training.
If we run our program, we can see that the values are all random numbers all over the place:
This is fully expected now that we know how the training process works. Let’s notch up the training to run 1000 times:
We can see that the numbers are converging closer and closer to the expected output values, the more and more training do we make before testing. This is fully expected again. Let’s try 100,000 training iterations anyway:
Yep, as expected, all the tested output values are getting extremely close to the expected values, after this many training iterations.
You can see where we are going with this. Yes, again referring back to the previous article, we can use this to train the network with large number of written single digit numbers, to “guess” our uniquely handwritten sample that we will provide to it.
Finally, training for handwriting recognition with MNIST data set:
MNIST is a large, open and free database of handwritten digit values, and their supposed output labels. It has a training set over 60,000 examples and test set of 10,000 examples:
http://yann.lecun.com/exdb/mnist/
Let’s download the training set of images and training set of labels and store them in /res folder. We can make another file here, called number.png, a 28*28pixel large file that will eventually contain our personally handwritten testable image.
We will make several classes to work with the MNIST dataset values and connect them with our network. First the “MnistDbFile.java” to help us work with the database files:
Next is the “MnistImageFile.java” to work with the images in the database:
Next is the “MnistLabelFile.java” to help us work with the labels over the database:
And finally, the “Mnist.java” file, that will contain our main method to run the training algorithms, connect them with the training sets, and finally test our handwritten number and try to guess its value:
Let’s take a look at the main method in the “Mnist.java” class. The input neuron array size is 784, each one representing a single pixel value from a written number (from 28*28 images). An output neuron array size can be 10, each one representing a single digit number, from 0-9. The two hidden layer’s neuron sizes are set 70 and 35 in this case.
The “createTrainSet” method gives us the ability to choose only a range of batch set’s from the 60,000+ values. Setting this to a reasonable small number helps us reduce the time needed to load up the desired number of training values. In this example, we are using the values from 0 to 5000 from the 60,000.
The “trainData” method does all the training steps. We pass the created neural network to it, then the created trainSet. After that, we pass the number of “epoch” we want to loop through, then the number of “iterations” we would like to loop through. Finally, we pass the number of batch size we would like to work with. We preloaded 5000 images so might as well pass all those, but we can always chose a smaller or bigger number.
By “iteration” we mean the regular of times we loop through the training methods, as we did in the previous examples. On the other hand, “Epoch” in the neural network industry means the number of times rerun all the iteration loops with all the working datasets. You can think of the two terms as nested loops. “Iterations” being the inner loop, while “epoch” being the outer loop. The final results will be more accurate as more and more training iterations and epochs do we make, with more and more unique datasets. Naturally, the larger these numbers get, the slower our whole training process will get also.
The “testMyImage” methods will load our custom handwritten image, and tries to guess its digit value, based on the network’s trained knowledge. Let’s write any number with a mouse, with white brush over a completely black background. These pixels will represent input values from range from 0 (being completely black) to 1 (being completely white). In my case, I wrote number 3:
rwurl=https://imgur.com/00CJKGH
Let’s run the program. As you can see, this significantly takes longer than our previous small examples. We are working with larger amount of data over a larger network. This is a good time to mention, that the training normally only needs to occur once, even if it takes hours or days to setup. Once we properly trained our network, all the weight and bias values can be serialized and saved to a file for instance, so every time when we want to read a new handwritten image and ask the network for its guessed digit value, it can process it almost instantaneously. I will not go in detail of discussing the serialization process in this article, but will assume that the reader at this level does know what I’m talking about, or can figure it out very easily.
Did the network produce an accurate result? If yes, excellent! If not, don’t give up! Keep fiddling with the parameter values, give the network some more storage range in form of neurons, more testing data, or more iteration/epoch loops and retry the results until you manage to get it right.
Got any inspiration where else could you use this technology?