rwurl=https://imgur.com/FC1QvBY
In the second article in the series, I am attempting to:
- Very briefly mention a few examples of all the Neural Network types and branches, since there are many.
- Focus on the most oldest and most simple one, the “Fully Connected, Feed Forward Neural Network”
- Explain in great detail how it works using intuition and graphs, rather than math, to make it easy as possible to understand.
- Explain the commonly used related terminology.
- Show a real life example where and how you could use it.
The first steps to achieve artificial neural networks were made 75 years ago, and it became one of the hottest emerging technologies in recent years. The original idea was to produce a working mathematical abstraction, how would a biological brain function in theory, as I've mentioned in the previous article.
You don't have to be a neuroscientist to have at least a very basic understanding how would a biological brain work. Having a large number of brain-cells called "neurons", that can form connections called "synapses" between each other, based on the various signals that they receive from out body over our lifetime. If you receive a similar experience, the similar neurons will fire up with those connections, so you will remember the given situation easier and react to it faster, more accurately.
There are many-many types of Neural Networks branches and sub-branches nowadays, all of them trying to archive being closest to "perfect" solution for the given idea. The search is still ongoing, we still don't know how exactly the biological brain works, but we don't even know if that is the best way to achieve intelligence also. We may going to come up with even more efficient way than our current biological solution, like we did in many other areas in the modern industrial world.
Some of main aNN branch examples include the "Feed Forward Neural Networks", that are referred sometimes as "Conventional" neural networks. This is the earliest and oldest solution, based on the idea where neuron connection are "fed-forward" between neurons, so the information can travel through them in simple intuitive way, usually starting from leftmost and ending up in the rightmost positions.
The most well-known sub-branches here include the "Convolutional Neural Networks", where the connections are filtered and grouped between neurons, to simplify and scale down large amount of information to abstracted representations. This is generally used for image recognitions nowadays. Other well-known sub-branch is the "Fully Connected Neural Networks". Here, each neuron in a given layer is connected with every single neuron on the previous layer.
More modern main branch examples are the "Recurrent Neural Networks", where connections can form circles or reach similar non-conventional connections between each other. Some sub-branch examples can include "Bi-directional NN", or "Long Short-Term Memory NN". The latter example is generally used for speech recognition.
"Spiking Neural Networks" are sometimes referred as the third generation of NN, which can activate neuron connection in a seemingly randomly "spiking" way, and are probably the closest real representations of the biological brain solutions nowadays.
In this article we are going to deal with (you guessed it), the oldest and simplest one to tackle: the Fully Connected, Feed Forward Neural Networks.
Let’s understand step-by-step what do they consist of and how they work first, then later on we can talk about how we can use them.
What is a Fully Connected, Feed Forward Neural Network?
From the highest level, think of it as a calculating box where on one side you can feed in some information, and on the other side you can receive the calculated results:
rwurl=https://imgur.com/A0LWkLq
You can have more than one input and output values, technically any number of input or output values you would require, even very large ones:
rwurl=https://imgur.com/subBMJW
If you open the box, you will see all the neurons some layers separating them. The very first layer is the “input layer” and each neuron there will store an input value. Similarly the very last layer is the “output layer”, each neuron there will store the final output value:
rwurl=https://imgur.com/s5iHctX
Those in between layers are referred as “hidden layers”. They are called "hidden" because we never see (nor we really care) what happens in them, we just want them to help figure out the right results for out final “output layer”. The number of these hidden layers can be several, but usually a few is enough, as the larger this number gets, the slower all the calculations can take.
As I’ve said before, in FCNN each neurons in a given layer are connected to all the neurons in previous adjacent layers. One single connection has to be adjacent, we cannot skip over a layer, and so one connection between two neurons would be represented like this:
rwurl=https://imgur.com/x6Wk5VI
Connecting one neuron to all from the previous layer can be represented like this:
rwurl=https://imgur.com/qzTJiqO
After finishing populating all the rest of the connections, the network will look like this, hence the name “Fully connected”:
rwurl=https://imgur.com/0RfKlUy
Let’s break down this some more. Probably the most interesting component here is the “Neuron”. What would that be and how does it work?
This can get fairly “mathy”, but I will try to spare you by avoiding referring to math, and just giving the intuitive explanation whenever I can.
If we focus on one neuron, we can see that it can receive many values from one side, apply a summary function that adds these values up, and lastly it will apply a “Sigmoid” function to this sum, before releasing the neuron’s calculated output.
rwurl=https://imgur.com/IKKutPg
The sigmoid is an “S” shaped function as you can see in this graph, and the purpose of it to transform the sum value between 0 and 1. Even if the sum turns out to be a crazily large or crazily small number for instance, it will always be “corrected” back somewhere between 0 and 1 with this function. We are doing this to simplify working with the data in the network. It’s much simpler to understand the numbers beings closer to 1 as “perhaps yes”, and the numbers being close to 0 as “perhaps no”.
rwurl=https://imgur.com/Lz82eVY
What do I mean by “perhaps”? As I’ve said in the first article, neural networks by design are not meant for super precise calculations like we would expect from normal computers, but to do good approximations, and they will do better and better approximations as they train more.
Going back to our example, let’s assume we have 3 neurons with output values between 0 and 1 somewhere: 0.8, 0.3, 0.5:
rwurl=https://imgur.com/HpiYEUE
The sum function will add all the received values up.
sum(0.8, 0.3, 1.6) = 0.8 + 0.3 + 0.5 = 1.6
After that, the neuron will apply the Sigmoid function to this value so we will squeeze any result back between 0 and 1 somewhere, resulting 0.832 as the output value from this neuron:
sigmoid(1.6) = 0.832
This would be the formula for the Sigmoid function, for those who would like to see the math as well:
rwurl=https://imgur.com/p3Su53a
If we continue doing this over every neuron, until we get the final output layer, we will get our final calculated values, but you perhaps realized: we would have the same output results every time for the same given input values. In many practical cases we cannot modify the input value since we are receiving them from some other sources, also the summary and the sigmoid function’s behavior is fixed as well, but we would still like to influence and shape the outputted values somehow. Because of this need, we invented the idea of “Weights”, that are basically custom numbers, stored at the connections between the neurons. People usually refer to connections between neurons simply as “Weights”.
So how do “Weights” come in play?
Weights are getting multiplied with the neuron output, before that value gets summarized with the rest in the summary function, so for example if all the weights will be 1, nothing would change:
rwurl=https://imgur.com/yPchuhO
sum (0.8, 0.3, 0.5) = 0.8*1 + 0.3*1 + 0.5*1 = 1.6
But if we turn these weight values up or down somewhere, the outputted value can be very different:
rwurl=https://imgur.com/idwKgeJ
sum (0.8, 0.3, 0.5) = 0.8*-0.5 + 0.3*2.2 + 0.5*0.4 = -0.4 + 0.66 + 0.2 = 0.46
Now this solutions would be almost perfect, but people found out over time, that there may still be cases when even applying heavy Weight modifications all around the network, the final outputted values would still not be close to desired numbers, because of the Sigmoid function design. Here was the concept of “Bias” born.
“Bias” is very similar to Weights as being a single modifiable arbitrary number, but the difference is that it only applies to every neuron once, in the Sigmoid function, to translate it left or right.
Imagine a situation where your final values after applying the Summary function with Weights are converging to near 0. But after applying the Sigmoid function as well, it will bump back the output value to somewhere around 0.5, while you would rather keep that value indication 0. This is where a Bias can be applied and will basically translate the whole sigmoid function to a direction, modifying the output greatly. Let’s see the difference with a bias of -5 or +5:
rwurl=https://imgur.com/Lj0Rk3N
As we can see, if we would add a Bias of -5 (red graph) to the summary before applying the Sigmoid function would result the neuron output very close to 1, while with the bias of 5 (blue graph), the output would be very close to 0.
So we’re happy now, with all these flexibility we really could archive any desired final output values!
The basic concept of “Fully Connected, Feed Forward Neural Network” is established. How or where could we use it?
Let’s have a nice practical example: We want it to read written number from 0 to 9. How can we approach this problem with our newly setup Neural Network?
First of all, let’s clear our objective: to turn any of these written “three” number images, or any similar ones, to result “3”:
rwurl=https://imgur.com/lUsf7X9
That includes all these written “four” number images, to “4”:
rwurl=https://imgur.com/iecL0HO
… and so on, so on.
We would need to turn all these images to input values first.
Let’s take a closer look at one of them. We can see that it’s been made of individual pixels. 28 rows * 28 columns of them:
rwurl=https://imgur.com/zAKEqpT
Each of these pixels have a brightness value, some of them are very bright, and some of them are darker. If we represent the brightest “unused” pixels with 0.0 and as they got darker, with a number closer and closer to 1.0, indicating that they have some sort of “activated” values there:
rwurl=https://imgur.com/CeYu7a6
If we convert all the remaining pixels to number representations as well, and write these values down in one long row, we halve all the input neuron values ready to be processed with our NN, all 784 (28*28) of them!
As for the output neurons, the most straightforward is to have one for each desired number (0-9). So 10 neurons in total.
rwurl=https://imgur.com/KkJUhGQ
If we plug in the digitized values to the input layer, from the image that represents written number three, we would like to receive 0.0 on all of the output neurons, except on the fourth one, that would need to be 1.0 ideally, to clearly represent number “3” ideally. (Implying the first neuron represents “0”, the second “1”, and so on until the 10th neuron, representing “9”.)
rwurl=https://imgur.com/CyWDBrz
But if we do that, we will find out that the output layer’s neuron values are nowhere near this but show some utter garbage:
rwurl=https://imgur.com/30oMUWC
That’s because the network haven’t been “Trained” yet.
“Training” the network means (re)adjusting all the Weights and Biases over the network to certain positions, so if we plug in the said input values, the network should produce the calculated output closest to possible to the desired ideal output.
We could try to manually adjust any the Weight or Bias number to some positive or negative number values, but will quickly realize that with even a fair number of neurons, there are just so many combinations it’s humanly not comprehendible to do so.
This is where the concept of “Backpropagation” comes extremely handy.
Backpropagation is one of the key features at the neural networks and is a form of a learning algorithms. It’s probably one of the most confusing concepts of it however. Simplifying it as much as possible, the basic idea is to take that utter garbage output from the neural network output, try to compare it our initially desired output, and see how far each of those outputted values are from the desired ones.
This number is called the “Error Bias” and if we have this number, the algorithm will try to adjust the weights and biases accordingly, starting from the rightmost layers, and work themselves back until they reach the input layer. We start from the back because the final output is at the back, and the connected Weights and Biases that are affecting that layer directly are in the previous layer, and we apply this idea over each layer.
After the Backpropagation is finished, we re-do the Feed-Forward step again and see if we got closer to the given value, by comparing the actual and the desired number again. A proper training can take hundreds, or millions of Feed-Forward and Backpropagation steps, until the network is conditioned to give us the closest numbers to the desired ones. Now we need to do this training process for every new input values and still make sure that the network remains giving valid results for the previous input values as well. You can begin to understand, that properly training a network over large amount of input values, to always outputs accurately close to the desired outputs is extremely hard to archive and usually takes very large number of training steps. But this is where the whole industry is working hard by discovering creative and different ways to approach this difficult issue.