Please draw a digit with your mouse or with your finger on the canvas! In the table with the digits you will see how this network classifies what you draw.
0:  
1:  
2:  
3:  
4:  
5:  
6:  
7:  
8:  
9: 
Please check out all parts in this series describing a neural network from scratch. Both theory and with practical examples as well as a complete Java implementation:
]]>Also, please check out all parts in this series:
Data augmentation is all about fabricating more data from the data you actually got – adding variance without losing the information the data carries. Doing this reduces the risk of overfitting and generally the accuracy on unseen data can be improved.
In the specific case of images as input data (as is the case in the MNIST dataset) augmentation can for instance be:
I decided to go for an affine transformation. I’ve used them many times before in CG and know they are very straightforward.
Affine transformations map one affine space to another affine space. To be a bit more concrete we can say that an affine transformation can transform a specific coordinate via operations such as rotations, scalings and translations … and tell what that coordinate would be after these changes. The affine transformation can be represented as matrices and can be combined such that a series of transformations still can be expressed as a single matrix.
For instance (and here described in a twodimension) we could compose a transformation M as this:
$$
M = T_{(x, y)}R_{\theta} S_{(x, y)}T_{(x, y)}\\
$$
… where:
When we have this M it is just a matter of multiplying it with input coordinates to get their new location in the target space as defined by M. Conversely we could multiply coordinates from the target space with the inverse of M to go back to the original space.
In Java creating these affine transformation matrices is as simple as this:
// Center the input image on origin AffineTransform m = getTranslateInstance(14, 14); // Randomly rotate a bit m.rotate(toRadians(rnd() * 20)); // Randomly rescale somewhat m.scale(rnd() * 0.25 + 1, rnd() * 0.25 + 1); // Restore it from origin with a slight random translation m.translate(14 + (rnd() * 3), 14 + (rnd() * 3));
This is all we need to transform coordinates in the original MNIST digits to coordinates on new fabricated digits, slightly modified versions of the originals to train the network on.
The method as a whole, which is a mutator on the DigitDataentity (see part 5), looks like this:
/** * Creates a slightly modified version of the original digit. */ public void transformDigit() { double[] dst = new double[data.length]; boolean potentialOverspill; int overspillCounter = 0; do { potentialOverspill = false; AffineTransform m = getTranslateInstance(14, 14); m.rotate(toRadians(rnd() * 20)); m.scale(rnd() * 0.25 + 1, rnd() * 0.25 + 1); m.translate(14 + (rnd() * 3), 14 + (rnd() * 3)); Point2D wPoint = new Point2D.Double(); Point2D rPoint = new Point2D.Double(); looping: for (int y = 0; y < 28; y++) { for (int x = 0; x < 28; x++) { wPoint.setLocation(x, y); m.inverseTransform(wPoint, rPoint); clamp(rPoint, 0, 28); // integer part int xi = (int) rPoint.getX(); int yi = (int) rPoint.getY(); // fractional part double xf = rPoint.getX()  xi; double yf = rPoint.getY()  yi; double interpolatedValue = (1  xf) * (1  yf) * pixelValue(xi, yi, data) + (1  xf) * yf * pixelValue(xi, yi + 1, data) + xf * (1  yf) * pixelValue(xi + 1, yi, data) + xf * yf * pixelValue(xi + 1, yi + 1, data); if (interpolatedValue > 0 && onBorder(x, y)) { potentialOverspill = true; overspillCounter++; break looping; } dst[y * 28 + x] = interpolatedValue; } } } while (potentialOverspill && overspillCounter < 5); if (overspillCounter < 5) transformedData = dst; }
As you can see on rows 2941 the code above also features interpolation which makes the transformed result smoother.
Also there is a check to see whether the resulting digit potentially was transformed in such a way that some part of it spills outside the target 28×28array. When this seems to be the case we discard that change and try again. If we could not reach a good transformed digit within 5 retries we skip the transformation for this round, falling back on the original digit. This very rarely happens. Next round we might get more lucky and get a valid transformation.
And speaking of rounds. How often do we mutate the input data ? After each epoch of training I transform the entire dataset like this:
trainData.parallelStream().forEach(DigitData::transformDigit);
This way the neural network never sees the same digit twice (with the rare exception of a few badluck transformation attempt as described above). In other words: By data augmentation we have kind of created a dataset that is infinite. Of course this is not true in a strict mathematical meaning … but for all our training purposes the variance of the distribution we picked for our random affine transformations is absolutely sufficient to create a stream of unique data.
The error rate of that small neural network that I had settled on (only 50 hidden neurons, see previous article) was down 1% on average and can now steadily be trained to an error rate in the range 1.7% – 2%.
Small change. Great impact.
Please try out how well one of these trained network performs on a small MNIST Playground I created.
Also, if you want to take a closer look the code is here: https://bitbucket.org/tobias_hill/mnistexample/src/Augmentation/
]]>
This is the fifth and last part in this series of articles:
The MNIST database contains handwritten digits and has a training set of 60.000 samples, and a test set of 10.000 samples. The digits are centered in a fixedsize image of 28×28 pixels.
This dataset is super convenient for anyone who just wants to explore their machine learning implementation. It requires minimal efforts on preprocessing and formatting.
All code for this little experiment is available here. This project is of course dependent on the neural network implementation too.
Reading the datasets in java is straightforward. The data format is described on the MNIST pages.
Each digit is stored in a class called DigitData. That class of course contains the data (i.e. the input) and the label (i.e. the expectation). I also added a small trick to enhance the toString() of theDigitData class:
/** * Simply converts grayscale to an asciishade */ private char toChar(double val) { return " .:=+*#%@".charAt(min((int) (val * 10), 9)); }
Calling toString() actually gives an asciishaded output of the data:
This has been convenient when examining which digits the network confuses for other digits. We will get back to that in the end of this article.
The boundary layers are given by our data:
The hidden layers requires more exploring and testing. I tried a few different network layouts and realized that it was not that hard to get good accuracy with a monstrous net. As a consequence I decided to keep the number of hidden neurons down just to see whether I still could get decent results. I figure that a constrained setup would teach more about how to get that extra percentage of accuracy. I decided to set the max hidden neurons to 50 and started exploring what I could attain.
I had quite good result early on with a funnellike structure with only 2 hidden layers and kept on exploring that. For instance the first one with 36 and the second with 14 neurons.
784 input ⇒ 36 hidden ⇒ 14 hidden ⇒ 10 output neurons
After some trial and error I decided to use two activation functions which has not been presented in previous articles in this series. The Leaky ReLU and the Softmax.
Leaky ReLU is a variant of ReLU. Only difference is that it is not totally flat for negative inputs. Instead it has a small positive gradient.
They were initially designed to work around the problem that the zerogradient part of ReLU might shut down neurons. Also see this question on Quora for details on when and why you might want to test Leaky ReLU instead of ReLU.
Softmax is an activation function which you typically use in the output layer when you do classification. The nice thing with softmax is that gives you a categorical probability distribution – It will tell you the probability per each class in the output layer. So, suppose we did send the digit data representing the digit 7 through the network it might output something like:
Class  P 

0  0.002 
1  0.011 
2  0.012 
3  0.002 
4  0.001 
5  0.001 
6  0.002 
7  0.963 
8  0.001 
9  0.005 
Σ = 1.0 
As you can see the probability is highest for digit 7. Also note that the probabilities sum to 1.
Softmax can also of course be used with a threshold so that if neither of the classes get a probability above that threshold we can say that the network did not recognize anything in the input data.
The bad thing about Softmax is that it is not as simple as other activation functions. It gets a bit uglier both in the forward and backpropagation pass. It actually broke my activation abstraction where I until the introduction of softmax could define an Activation as only the function itself for forward pass, fn(), and the derivative step of that function for backpropagation, dFn().
The reason Softmax did this is that its dFn() (\(\frac{\partial o}{\partial i}\)) can utilize the last factor from the chain rule (\(\frac{\partial C}{\partial o}\)) to make the calculation clearer/easier. Hence the Activation abstraction had to be extended to deal with calculating the \(\frac{\partial C}{\partial i}\) product.
In all other activation functions this is simply a multiplication:
// Also when calculating the Error change rate in terms of the input (dCdI) // it is just a matter of multiplying, i.e. ∂C/∂I = ∂C/∂O * ∂O/∂I. public Vec dCdI(Vec out, Vec dCdO) { return dCdO.elementProduct(dFn(out)); }
But in softmax it looks like this (see dCdI()function below):
//  // Softmax needs a little extra love since element output depends on more // than one component of the vector. Simple element mapping will not suffice. //  public static Activation Softmax = new Activation("Softmax") { @Override public Vec fn(Vec in) { double[] data = in.getData(); double sum = 0; double max = in.max(); // Trick: translate the input by largest element to avoid overflow. for (double a : data) sum += exp(a  max); double finalSum = sum; return in.map(a > exp(a  max) / finalSum); } @Override public Vec dCdI(Vec out, Vec dCdO) { double x = out.elementProduct(dCdO).sumElements(); Vec sub = dCdO.sub(x); return out.elementProduct(sub); } };
Read more about Softmax in general here and how the derivative of Softmax is calculated here.
So, all in all my setup typically is:
NeuralNetwork network = new NeuralNetwork.Builder(28 * 28) .addLayer(new Layer(36, Leaky_ReLU)) .addLayer(new Layer(14, Leaky_ReLU)) .addLayer(new Layer(10, Softmax)) .initWeights(new Initializer.XavierNormal()) .setCostFunction(new CostFunction.Quadratic()) .setOptimizer(new Nesterov(0.02, 0.87)) .l2(0.00010) .create();
With that in place, let’s start training!
The training loop is simple.
The code looks like this:
List<DigitData> trainData = FileUtil.loadImageData("train"); List<DigitData> testData = FileUtil.loadImageData("t10k"); int epoch = 0; double errorRateOnTrainDS; double errorRateOnTestDS; StopEvaluator evaluator = new StopEvaluator(network, 10, null); boolean shouldStop = false; long t0 = currentTimeMillis(); do { epoch++; shuffle(trainData, getRnd()); int correctTrainDS = applyDataToNet(trainData, network, true); errorRateOnTrainDS = 100  (100.0 * correctTrainDS / trainData.size()); if (epoch % 5 == 0) { int correctOnTestDS = applyDataToNet(testData, network, false); errorRateOnTestDS = 100  (100.0 * correctOnTestDS / testData.size()); shouldStop = evaluator.stop(errorRateOnTestDS); double epocsPerMinute = epoch * 60000.0 / (currentTimeMillis()  t0); log.info(format("Epoch: %3d  Train error rate: %6.3f %%  Test error rate: %5.2f %%  Epocs/min: %5.2f", epoch, errorRateOnTrainDS, errorRateOnTestDS, epocsPerMinute)); } else { log.info(format("Epoch: %3d  Train error rate: %6.3f %% ", epoch, errorRateOnTrainDS)); } } while (!shouldStop);
And the running through the entire dataset is done in batches and in parallel (as already shown in last article):
/** * Run the entire dataset <code>data</code> through the network. * If <code>learn</code> is true the network will learn from the data. */ private static int applyDataToNet(List<DigitData> data, NeuralNetwork network, boolean learn) { final AtomicInteger correct = new AtomicInteger(); for (int i = 0; i <= data.size() / BATCH_SIZE; i++) { getBatch(i, data).stream().parallel().forEach(img > { Vec input = new Vec(img.getData()); Vec wanted = new Vec(img.getLabelAsArray()); Result result = learn ? network.evaluate(input, wanted) : network.evaluate(input); if (result.getOutput().indexOfLargestElement() == img.getLabel()) correct.incrementAndGet(); }); network.updateFromLearning(); } return correct.get(); }
The only missing piece from the loop above is to know when to stop …
As mentioned before (when introducing the L2 regularization) we really want to avoid overfitting the network. When that happens the accuracy on the training data might still improve while the accuracy on the test data starts to decline. We keep track of that with the StopEvaluator. This utility class keeps a moving average of the error rate of the test data to detect when it definitively have started to decline. It also stores how the network looked at the best test run (trying to find the peak of this test run).
The code looks like this:
/** * The StopEvaluator keeps track of whether it is meaningful to continue * to train the network or if the error rate of the test data seems to * be on the raise (i.e. when we might be on our way to overfit the network). * * It also keeps a copy of the best neural network seen. */ class StopEvaluator { private int windowSize; private final NeuralNetwork network; private Double acceptableErrorRate; private final LinkedList<Double> errorRates; private String bestNetSoFar; private double lowestErrorRate = Double.MAX_VALUE; private double lastErrorAverage = Double.MAX_VALUE; public StopEvaluator(NeuralNetwork network, int windowSize, Double acceptableErrorRate) { this.windowSize = windowSize; this.network = network; this.acceptableErrorRate = acceptableErrorRate; this.errorRates = new LinkedList<>(); } // See if there is any point in continuing ... public boolean stop(double errorRate) { // Save config of neural network if error rate is lowest we seen if (errorRate < lowestErrorRate) { lowestErrorRate = errorRate; bestNetSoFar = network.toJson(true); } if (acceptableErrorRate != null && lowestErrorRate < acceptableErrorRate) return true; // update moving average errorRates.addLast(errorRate); if (errorRates.size() < windowSize) { return false; // never stop if we have not filled moving average } if (errorRates.size() > windowSize) errorRates.removeFirst(); double avg = getAverage(errorRates); // see if we should stop if (avg > lastErrorAverage) { return true; } else { lastErrorAverage = avg; return false; } } public String getBestNetSoFar() { return bestNetSoFar; } public double getLowestErrorRate() { return lowestErrorRate; } private double getAverage(LinkedList<Double> list) { return list.stream().mapToDouble(Double::doubleValue).average().getAsDouble(); } }
With the network layout as shown above (50 hidden neurons, Nesterov and L2 configured:ish as above) I consistently train the network to an errorrate of about 2.5%.
The record run had an errorrate of only 2,24% but I don’t think that is relevant unless the majority of my runs would be around that rate. The reason is: Tuning hyper parameters back and forth, trying to beat a record on the test data (although fun) could as well mean that we are overfitting to the test data. In other words: I might have found a lucky mix of parameters that happens to perform very well but still is not that good on unseen data^{1}.
So, let’s have a look on a few of the digits that the network typically confuses:
Digit:  Label:  Neural net: 

5  6  
5  3  
2  6 
These are just a few examples. Several other bordercases are available too. It makes sense that the network find it quite hard to classify these and the reason ties back to what we discussed in the ending section of the first part in this series: some of these points in the 784Dspace are simply too far away from their group of digits and possibly closer to some other group of points/digits. Or in natural language: They look more like some other digit than the one they are supposed to represent.
This is not to say that we don’t care about those hard cases. Quite the contrary. The world is a truly ambiguous place and machine learning needs to be able to handle ambiguity & nuances. You know, to be less machinelike (“Affirmative”) and more human (“No problemo”). But what I think this indicates is that the solution is not always in the data at hand. Quite often the context gives the clues needed for a human to correctly classify (or understand) something … such as a badly written digit. But that is another big and fascinating topic which falls outside this introduction.
This was the fifth and final part in this series. I learned a lot while writing this and I hope you did learn something by reading it.
Feel free to reach out. Feedback is welcome!
These are the resources which I have found out to be better than most others. Please dive in if you want a slightly deeper understanding:
This article has also been published in the Mediumpublication Towards Data Science. If you liked what you’ve just read please head over to the mediumarticle and give it a few Claps. It will help others finding it too. And of course I hope you spread the word in any other way you see fit. Thanks!
Footnotes:
]]>
This is the forth part in a series of articles:
Until now we have seen how weights and biases can be automatically adjusted via backpropagation to let the network improve. So what is a good starting value for these parameters?
Generally you are quite fine just to randomize the parameters between 0.5 and 0.5 with a mean of 0.
However, on deep networks, research has shown that we can get improved convergence if we let the values for the weights inversely depend on the number of neurons in the connected layer (or sometimes two connected layers). In other words: Many neurons ⇒ init weights closer to 0. Fewer neurons ⇒ higher variance can be allowed. You can read more about why here and here.
I have implemented a few of the more popular ones and made it possible to inject it as a strategy when creating the neural network:
NeuralNetwork network = new NeuralNetwork.Builder(4) .addLayer(new Layer(6, Sigmoid, 0.5)) .addLayer(new Layer(14, Softplus, 0.5)) .initWeights(new Initializer.XavierNormal()) .setCostFunction(new CostFunction.Quadratic()) .setOptimizer(new GradientDescent(0.01)) .create();
The implemented strategies for initialization are: XavierNormal, XavierUniform, LeCunNormal, LeCunUniform, HeNormal, HeUniform and Random. You can also specify the weights for a layer explicitly by providing the weight matrix when the layer is being created.
In the last article we did briefly touch on the fact that we can feed samples through the network for as long as we wish and then, at our own discretion, chose when to update the weights. The separation in the api between the evaluate()method (which gathers learning/impressions) and the updateFromLearning()method (which let that learning sink in) is there to allow exactly that. It leaves it open to the user of the API which of the following strategies to use:
In Stochastic Gradient Descent we update the weights and biases after every evaluation. Remember from part 2 that the cost function is dependent on the input and the expectation: \(C = C(W, b, S_σ, x, exp)\). As a consequence the cost landscape will look slightly different for every sample and the negative gradient will point in the direction of steepest descent in that persampleunique landscape.
Now, let’s assume we have the total cost landscape which would be the averaged cost function for the entire dataset. All samples. Within that a Stochastic Gradient Descent would look like a chaotic walk.
As long as the learning rate is reasonably small that walk would on average nevertheless descends towards a local minimum.
The problem is that SGD is hard to parallelize efficiently. Each update of the weights has to be part of the next calculation.
The opposite of Stochastic Gradient Descent is the Batch Gradient Descent (or sometimes simply Gradient Descent). In this strategy all training data is evaluated before an update to the weights is made. The average gradient for all samples is calculated. This will converge towards a local minimum as carefully calculated small descending steps. The drawback is that the feedback loop might be very long for big datasets.
Minibatch gradient descent is a compromise between SGD and BGD – batches of N samples are run through the network before the weights are updated. Typically N is between 16 to 256. This way we get something that is quite stable in its descent (although not optimal):
Only some minor changes had to be made to the code base to allow parallel execution of the evaluation()method. As always … state is the enemy of parallelization. In the feed forward phase the only state to treat with caution is the fact that the outvalue from each neuron is stored per neuron. Competing threads would surely overwrite this data before it had been used in backpropagation.
By using thread confinement this could be handled.
// Before: private Vec out; // After change: private ThreadLocal<Vec> out = new ThreadLocal<>();
Also the part where the deltas for the weights and biases are accumulated and finally applied had to be synchronized:
/** * Add upcoming changes to the Weights and Biases. * NOTE! This does not mean that the network is updated. */ public synchronized void addDeltaWeightsAndBiases(Matrix dW, Vec dB) { deltaWeights.add(dW); deltaBias = deltaBias.add(dB); } public synchronized void updateWeightsAndBias() { if (deltaWeightsAdded > 0) { Matrix average_dW = deltaWeights.mul(1.0 / deltaWeightsAdded); optimizer.updateWeights(weights, average_dW); deltaWeights.map(a > 0); // Clear deltaWeightsAdded = 0; } if (deltaBiasAdded > 0) { Vec average_bias = deltaBias.mul(1.0 / deltaBiasAdded); bias = optimizer.updateBias(bias, average_bias); deltaBias = deltaBias.map(a > 0); // Clear deltaBiasAdded = 0; } }
With that in place it is perfectly fine to use as many cores as desired (or available) when processing training data.
A typical way to parallelize mini batch execution on the neural net looks like this:
for (int i = 0; i <= data.size() / BATCH_SIZE; i++) { getBatch(i, data).parallelStream().forEach(i > { Vec input = new Vec(i.getData()); Vec expected = new Vec(i.getLabel()); network.evaluate(input, expected); }); network.updateFromLearning(); }
Especially note the construct for spreading the processing of all samples within the batch as a parallel stream in line 2.
That’s all needed.
Best of all: The speedup is significant^{1}.
One thing to keep in mind here is that spreading the execution this way (running a batch over several parallel threads) adds entropy to the calculation that is sometimes not wanted. To illustrate: when the errordeltas are summed in the addDeltaWeightsAndBiasesmethod they may be added in slightly different order every time you run. Although all contributing terms from the batch are the same the changed order they are summed may give small small differences which over time and in large neural nets starts to show, resulting in nonreproducible runs. This is probably fine in this kind of playground neural network implementation … but if you want to do research you would have to approach parallelism in a different way (typically making big matrices of the input batch and parallel all the matrixmultiplications of both feed forward and backpropagation on GPUs/TPUs).
In last article we touched on the fact that we made the actual update of the weights and biases pluggable as a strategy. The reason is that there are other ways to do it than the basic gradient descent.
Remember, this:
$$W^+ = W \eta\nabla C\tag{eq 1}\label{eq 1}$$
Which in code looks like:
public void updateWeights(Matrix weights, Matrix dCdW) { weights.sub(dCdW.mul(learningRate)); }
Although striving downhill the SGD (and its batched variants) suffers from the fact that the magnitude of the gradient is proportional to how steep the slope is at the point of evaluation. A flatter surface means a shorter length of the gradient … giving a smaller step. As a consequence it might get stuck on saddle points – for instance halfway down this path:
(as you can see there are better local minimas in several directions than getting stuck at that saddle point)
One way to better handle situations like this is to let the update of the weights not only depend on the calculated gradient but also the gradient calculated in last step. Somewhat like incorporating echos from previous calculations.
A physical analogy of this is to think of a marble with mass rolling down a hill. Such a marble might have enough momentum to keep enough of its speed to cover flatter parts or even climb up small hills – this way escaping the saddle point and continue towards better local minimas.
Not surprisingly one of the simplest Optimizers is called …
In momentum we introduce another constant to tell how much echo we want from previous calculations. This factor, γ, is of called momentum. Equation 1 above now becomes:
$$
\begin{align}
v^+ &= \gamma v + \eta \nabla_w C\\[0.5em]
W^+ &= W – v^+
\end{align}
$$
As you can see the Momentum optimizer keeps the last delta as v and calculates a new v based on the previous one. Typically the γ could be in the range 0.7 – 0.9 which means that 70% – 90% of previous gradient calculations contributes to this new calculation.
In code it is configured as:
NeuralNetwork network = new NeuralNetwork.Builder(2) .addLayer(new Layer(3, Sigmoid)) .addLayer(new Layer(3, activation)) .setOptimizer(new Momentum(0.02, 0.9)) .setCostFunction(new CostFunction.Quadratic()) .create();
And applied in the Momentum optimizer like this:
@Override public void updateWeights(Matrix weights, Matrix dCdW) { if (lastDW == null) { lastDW = dCdW.copy().mul(learningRate); } else { lastDW.mul(momentum).add(dCdW.copy().mul(learningRate)); } weights.sub(lastDW); }
This small change on how we update the weights often improves convergence.
We can do even better. Another simple yet powerful optimizer is the Nesterov accelerated gradient (NAG).
The only difference from momentum is that we in NAG calculate the Cost at a position which has already incorporated the echo from last gradient calculation.
$$
\begin{align}
v^+ &= \gamma v + \eta \nabla_w C(W – \gamma v)\\[0.5em]
W^+ &= W – v^+
\end{align}
$$
From a programming perspective this is bad news. A nice property of the optimizers until now has been that they are only applied when the weights are updated. Here, all of a sudden, we need to incorporate the last gradient v already when calculating the cost function. Of course we could extend the idea of what an Optimizer is and allow it to also augment weight lookups for instance. That would not be coherent though since all other Optimizers (that I know of) only focus on the update.
With a small trick we can work around that problem. By introducing another variable x = W – γv we can rewrite the equations in terms of x. Specifically this introduction means that the problematic cost function would only be dependent on x and no historic vvalue^{2}. After rewriting it we can rename x back to w again to make it look more like something we recognise. The equation becomes:
$$
\begin{align}
v^+ &= \gamma v – \eta \nabla_w C\\[0.5em]
W^+ &= W – \gamma v + (1 +\gamma) v^+
\end{align}
$$
Much better. Now the Nesterov optimizer can be applied only at update time:
@Override public void updateWeights(Matrix weights, Matrix dCdW) { if (lastDW == null) { lastDW = new Matrix(dCdW.rows(), dCdW.cols()); } Matrix lastDWCopy = lastDW.copy(); lastDW.mul(momentum).sub(dCdW.mul(learningRate)); weights.add(lastDWCopy.mul(momentum).add(lastDW.copy().mul(1 + momentum))); }
To read more about how and why NAG often works better than Momentum please have a look here and here.
There are of course a lot of other fancy Optimizers that could have been implemented too. A few of them often perform better than NAG but at the cost of added complexity. In terms of improved convergence per lines of code NAG strikes a sweet spot.
When I just had finished writing the first version of the neural network code I was eager to test it. I promptly downloaded the MNIST dataset and started playing around. Almost right off the bat I was up to quite remarkable figures in terms of accuracy. Comparing to what others had reached on equivalent network designs my was better – way better. This of course made me suspicious.
Then it struck me. I had compared my stats on the training dataset with their stats on the test dataset. I had trained my network in an infinite loop (almost) on the training data and never even loaded the test dataset. And when I did I got a shock. My network was totally crap on unseen data even though it was close to demigod on the training dataset. I had just swallowed the bitter pill of overfitting.
When you train your network too aggressively you might end up in a situation where your network has learned (almost memorised) the training data but as a side effect also lost the general broad strokes of what the training data represents. This can be seen by plotting the networks accuracy on the training data vs the test data during training of the network. Typically you will notice that the accuracy on the training data continues to increase while the accuracy on the test data initially increases after which it flattens out and then starts decreasing again.
This is when you’ve gone to far.
One observation here is that we need to be able to detect when this happens and stop training at that point. This is often referred to as Early stopping and something I will get back to in the last article in this series, Part 5 – Training the network to read handwritten digits.
But there are a few other ways to reduce the risk of overfitting too. I will wrap up this article by presenting a simple, but powerful approach.
In L2 regularization we we try to avoid that individual weights in the network get too big (or to be more precise: too far away from zero). The idea is that instead of letting a few weights have a very strong influence over the output we instead prefer that lot of weights interact, each and everyone with a moderate contribution. Based on the discussion in Part 1 – Foundation it does not seem too far fetched that several smaller weights (when compared to fewer bigger) will give a smoother and less sharp/peaky separation of the input data. Moreover, it seems reasonable that a smoother separation will retain general characteristics of the input data and conversely: a sharp and peaky ditto would be able to very precisely cut out the specific characteristics of the input data.
So how do we avoid that individual weights in the network get too big?
In L2 regularization this is accomplished by adding another costterm to the cost function:
$$
\begin{eqnarray}
C = C_0 +\underbrace{\frac{\lambda}{2}\sum_w w^2}_{\text{new cost term}}
\end{eqnarray}
$$
As you can see big weights will contribute to a higher cost since they are squared in that additional cost sum. Smaller weights will not matter much at all.
The λ factor will tell how much L2 regularization we want. Setting it to zero will cancel the L2 effect altogether. Setting it to, say 0.5, will let the L2 regularization be a substantial part of the total cost of the network. In practice you will have to find best possible λ given your network layout and your data.
Quite frankly we do not care much about the cost as a specific scalar value. We are more interested in what happens when we train the network via gradient descend. Let’s see what happens with the gradient calculation of this new cost function for a specific weight k.
$$
\begin{align}
\frac{\partial C}{\partial w_k} &=\frac{\partial C_0}{\partial w_k} +\frac{\partial}{\partial w_k}\frac{\lambda}{2}\sum_w w^2\\
&=\frac{\partial C_0}{\partial w_k} + \lambda w_k
\end{align}\tag{eq 2}\label{eq 2}
$$
As you can see we only get an additional term λw_{k} to subtract from the weight w_{k} in each update.
Sometimes you’ll see that the number of samples, n, and the learningrate, η, is part of the update term – i.e. \(\frac{\eta\lambda}{n}w_k\). I see no reason to do so since \(\frac{\eta\lambda}{n}\) is just another constant.
In code, L2 regularization can be configured as:
NeuralNetwork network = new NeuralNetwork.Builder(28 * 28) .addLayer(new Layer(30, Leaky_ReLU)) .addLayer(new Layer(20, Leaky_ReLU)) .addLayer(new Layer(10, Softmax)) .initWeights(new Initializer.XavierNormal()) .setCostFunction(new CostFunction.Quadratic()) .setOptimizer(new Nesterov(0.01, 0.9)) .l2(0.0002) .create();
And is applied at updatetime in this way:
public synchronized void updateWeightsAndBias() { if (deltaWeightsAdded > 0) { if (l2 > 0) weights.map(value > value  l2 * value); Matrix average_dW = deltaWeights.mul(1.0 / deltaWeightsAdded); optimizer.updateWeights(weights, average_dW); deltaWeights.map(a > 0); // Clear deltaWeightsAdded = 0; }
This minor change makes the neural network implementation a lot easier to control when it comes to overfitting.
So, that’s a wrap for this article. Several small gems have been presented in both theory and practice … all of which, with only a few lines of code, makes the neural network better, faster and stronger.
In the next article, which will be the final one in this series, we will see how well the network performs on a classical dataset – Mnist: Part 5 – Training the network to read handwritten digits.
Feedback is welcome!
This article has also been published in the Mediumpublication Towards Data Science. If you liked what you’ve just read please head over to the mediumarticle and give it a few Claps. It will help others finding it too. And of course I hope you spread the word in any other way you see fit. Thanks!
Footnotes:
]]>This is the third part in a series of articles:
I assume you have read both previous articles and have a fairly good understanding about the forward pass and the learning/training pass of a neural network.
This article will be quite different. It will show and describe it all in a few snippets of java code.
All my code – a fully working neural net implementation – can be found, examined & downloaded here.
Normally when writing software I use a fair amount of open source – as a way to get good results faster and to be able to use fine abstractions someone else put a lot of thought into.
This time it is different though. My ambition is to show how a neural network works with a minimal requirement on your part in terms of knowing other open source libraries. If you manage to read Java 8 code you should be fine. It is all there in a few short and tidy files.
A consequence of this ambition is that I have not even imported a library for Linear Algebra. Instead two simple classes have been created with just the operations needed – meet the usual suspects: the Vec and the Matrix class. These are both very ordinary implementations and contain the typical arithmetic operations additions, subtractions, multiplications and dotproduct.
The fine thing about vectors and matrices is that they often can be used to make code expressive and tidy. Typically when you are faced with doing operation on every element from one set with every other on the other set, like for instance weighted sums over a set of inputs, there is a good chance you can arrange your data in vectors and/or matrices and get a very compact and expressive code. A neural network is no exceptions to this^{1}.
I too have collapsed the calculations as much as possible by using objects of types Vec and Matrix. The result is neat and tidy and not far from mathematical expressions. No forloops in forloops are blurring the top view. However, when inspecting the code I encourage you to make sure that any neat looking call on objects of type Vec or Matrix in fact results in exactly the series of arithmetic operations which were defined in part 1 and part 2.
Two operations on the Vec class is not as common as the typical ones I mentioned above. Those are:
A typical NeuralNetwork has many Layers. Each Layer can be of any size and contains the weights between the preceding layer and itself as well as the biases. Each layer is also configured with an Activation and an Optimizer (more on what that is later).
Whenever designing something that should be highly configurable the builder pattern is a good choice^{2}. This gives us a very straightforward and readable way to define and create neural networks:
NeuralNetwork network = new NeuralNetwork.Builder(30) // input to network is of size 30 .addLayer(new Layer(20, ReLU)) .addLayer(new Layer(16, Sigmoid)) .addLayer(new Layer(8, Sigmoid)) .setCostFunction(new CostFunction.Quadratic()) .setOptimizer(new GradientDescent(0.01)) .create();
To use such a network you typically either feed a single input vector to it or you also add an expected result in the call to evaluate()method. When doing the latter the network will observe the difference between the actual output and the expected and store an impression of this. In other words, it will backpropagate the error. For good reasons (which I will get back to in the next article) the network does not immediately update its weights and biases. To do so – i.e. let any impressions “sink in” – a call to updateFromLearning() has to be made.
For example:
// Evaluate without learning Vec in1 = new Vec(0.1, 0.7); Result out1 = network.evaluate(in1); // Evaluate with learning (i.e. an expected outcome is provided) Vec exp1 = new Vec(0.3, 0.2); Result out2 = network.evaluate(in1, exp1); // Let the observation "sink in" network.updateFromLearning(); // ... at this point we have a better network and the result from this call would be improved out2 = network.evaluate(i1, exp1);
Let’s dive into the feed forward pass – the call to evaluate(). Please note in the marked lines below that the input data is passed through layer by layer:
/** * Evaluates an input vector, returning the networks output. * If <code>expected</code> is specified the result will contain * a cost and the network will gather some learning from this * operation. */ public Result evaluate(Vec input, Vec expected) { Vec signal = input; for (Layer layer : layers) signal = layer.evaluate(signal); if (expected != null) { learnFrom(expected); double cost = costFunction.getTotal(expected, signal); return new Result(signal, cost); } return new Result(signal); }
Within the layer the input vector is multiplied with the weights and then the biases is added. That is in turn used as input to the activation function. See the marked line. The layer also stores the output vector since it is used in the backpropagation.
/** * Feed the invector, i, through this layer. * Stores a copy of the out vector. * @param i The input vector * @return The out vector o (i.e. the result of o = iW + b) */ public Vec evaluate(Vec i) { if (!hasPrecedingLayer()) { out.set(i); // No calculation i input layer, just store data } else { out.set(activation.fn(i.mul(weights).add(bias))); } return out.get(); }
(The reason there are get and and set operation on the out variable I will get back to in next article. For now just think of it as a variable of type Vec)
The code contains a few predefined activation functions. Each of these contain both the actual activation function, σ, as well as the derivative, σ’.
//  //  A few predefined ones  //  // The simple properties of most activation functions as stated above makes // it easy to create the majority of them by just providing lambdas for // fn and the diff dfn. public static Activation ReLU = new Activation( "ReLU", x > x <= 0 ? 0 : x, // fn x > x <= 0 ? 0 : 1 // dFn ); public static Activation Leaky_ReLU = new Activation( "Leaky_ReLU", x > x <= 0 ? 0.01 * x : x, // fn x > x <= 0 ? 0.01 : 1 // dFn ); public static Activation Sigmoid = new Activation( "Sigmoid", x > 1.0 / (1.0 + exp(x)), // fn x > x * (1.0  x) // dFn ); public static Activation Softplus = new Activation( "Softplus", x > log(1.0 + exp(x)), // fn x > 1.0 / (1.0 + exp(x)) // dFn ); public static Activation Identity = new Activation( "Identity", x > x, // fn x > 1 // dFn );
Also there are a few cost functions included. The Quadratic looks like this:
/** * Cost function: Quadratic, C = ∑(y−exp)^2 */ class Quadratic implements CostFunction { @Override public String getName() { return "Quadratic"; } @Override public double getTotal(Vec expected, Vec actual) { Vec diff = actual.sub(expected); return diff.dot(diff); } @Override public Vec getDerivative(Vec expected, Vec actual) { return actual.sub(expected).mul(2); } }
A cost function has one method for calculating the total cost (as a scalar) but also the important differentiation of the cost function to be used in the …
As I mentioned above: if an expected outcome is passed to the evaluate function the network will learn from it. See the marked lines.
/** * Evaluates an input vector, returning the networks output. * If <code>expected</code> is specified the result will contain * a cost and the network will gather some learning from this * operation. */ public Result evaluate(Vec input, Vec expected) { Vec signal = input; for (Layer layer : layers) signal = layer.evaluate(signal); if (expected != null) { learnFrom(expected); double cost = costFunction.getTotal(expected, signal); return new Result(signal, cost); } return new Result(signal); }
In the learnFrom()method the actual backpropagation happens. Here you should be able to follow the steps from part 2 in detail in code. It is somewhat soothing too see that the rather lengthy mathematical expressions from part 2 just boils down to this:
/** * Will gather some learning based on the <code>expected</code> vector * and how that differs to the actual output from the network. This * difference (or error) is backpropagated through the net. To make * it possible to use mini batches the learning is not immediately * realized  i.e. <code>learnFrom</code> does not alter any weights. * Use <code>updateFromLearning()</code> to do that. */ private void learnFrom(Vec expected) { Layer layer = getLastLayer(); // The error is initially the derivative of costfunction. Vec dCdO = costFunction.getDerivative(expected, layer.getOut()); // iterate backwards through the layers do { Vec dCdI = layer.getActivation().dCdI(layer.getOut(), dCdO); Matrix dCdW = dCdI.outerProduct(layer.getPrecedingLayer().getOut()); // Store the deltas for weights and biases layer.addDeltaWeightsAndBiases(dCdW, dCdI); // prepare error propagation and store for next iteration dCdO = layer.getWeights().multiply(dCdI); layer = layer.getPrecedingLayer(); } while (layer.hasPrecedingLayer()); // Stop when we are at input layer }
Please note that the learning (the partial derivatives) in the backpropagation is stored per layer by a call to addDeltaWeightsAndBiases()method.
Not until a call to updateFromLearning()method has been made the weights and biases change:
/** * Let all gathered (but not yet realised) learning "sink in". * That is: Update the weights and biases based on the deltas * collected during evaluation & training. */ public void updateFromLearning() { for (Layer l : layers) if (l.hasPrecedingLayer()) // Skip input layer l.updateWeightsAndBias(); }
The reason why this is designed as two separate steps is that it allows the network to observe a lot of samples and learn from these … and then only finally update the weights and biases as an average of all observations. This is in fact call Batch Gradient Descent or Mini Batch Gradient Descent. We will get back to these variants in the next article. For now you can as well call updateFromLearning after each call to evaluate (with expectations) to make the network improve after each sample. That, to update the network after each sample, is called Stochastic Gradient Descent.
This is what the updateWeightsAndBias()method looks like. Notice that an average of all calculated changes to the weights and biases is fed into the two methods updateWeights() and updateBias() on the optimizer object.
public synchronized void updateWeightsAndBias() { if (deltaWeightsAdded > 0) { Matrix average_dW = deltaWeights.mul(1.0 / deltaWeightsAdded); optimizer.updateWeights(weights, average_dW); deltaWeights.map(a > 0); // Clear deltaWeightsAdded = 0; } if (deltaBiasAdded > 0) { Vec average_bias = deltaBias.mul(1.0 / deltaBiasAdded); bias = optimizer.updateBias(bias, average_bias); deltaBias = deltaBias.map(a > 0); // Clear deltaBiasAdded = 0; } }
We have not really talked about the concept of optimizers yet. We have only said that the weights and biases are updated by subtracting the partial derivatives scaled with some small learning rate – i.e:
$$w^+=w – \eta \frac {\partial C}{\partial w}$$
And yes, that is a good way to do it and it is easy to understand. However, there are other ways. A few of these are a bit more complicated but offers faster convergence. The different strategies on how to update weights and biases is often referred to as Optimizers. We will see another way to do it in the next article. As a consequence it is reasonable to leave this as a pluggable strategy that the network can be configured to use.
For now we just use the simple GradientDescent strategy for updating our weights and biases:
/** * Updates Weights and biases based on a constant learning rate  i.e. W = η * dC/dW */ public class GradientDescent implements Optimizer { private double learningRate; public GradientDescent(double learningRate) { this.learningRate = learningRate; } @Override public void updateWeights(Matrix weights, Matrix dCdW) { weights.sub(dCdW.mul(learningRate)); } @Override public Vec updateBias(Vec bias, Vec dCdB) { return bias.sub(dCdB.mul(learningRate)); } }
And that’s pretty much it. With a few lines of code we have the possibility to configure a neural network, feed data trough it and make it learn how to classify unseen data. This will be shown in Part 5 Training the network to read handwritten digits.
But first, let’s pimp this up a few notches. See you in Part 4 – Better, faster, stronger.
Feedback is welcome!
This article has also been published in the Mediumpublication Towards Data Science. If you liked what you’ve just read please head over to the mediumarticle and give it a few Claps. It will help others finding it too. And of course I hope you spread the word in any other way you see fit. Thanks!
Footnotes:
]]>This is the second part in a series of articles:
I assume you have read the last article and that you have a good idea about how a neural network can transform data. If the last article required a good imagination (thinking about subspaces in multidimensions) this article on the other hand will be more demanding in terms of math. Brace yourself: Paper and pen. A silent room. Careful thought. A good nights sleep. Time, stamina and effort. It will sink in.
In the last article we concluded that a neural network can be used as a highly adjustable vector function. We adjust that function by changing weights and the biases but it is hard to change these by hand. They are often just too many and even if they were fewer it would nevertheless be very hard to get good results by hand.
The fine thing is that we can let the network adjust this by itself by training the network. This can be done in different ways. Here I will describe something called supervised learning. In this kind of learning we have a dataset that has been labeled – i.e. we already have the expected output for every input in this dataset. This will be our training dataset. We also make sure that we have a labeled dataset that we never train the network on. This will be our test dataset and will be used to verify how good the trained network classifies unseen data.
When training our neural network we feed sample by sample from the training dataset through the network and for each of these we inspect the outcome. In particular we check how much the outcome differs from what we expected – i.e. the label. The difference between what we expected and what we got is called the Cost (sometimes this is called Error or Loss). The cost tells us how right or wrong our neural network was on a particular sample. This measure can then be used to adjust the network slightly so that it will be less wrong the next time this sample is feed trough the network.
There are several different cost functions that can be used (see this list for instance).
In this article I will use the quadratic cost function:
$$
C = \sum\limits_j \left(y_j – exp_j\right)^2
$$
(Sometimes this is also written with a constant 0.5 in front which will make it slightly cleaner when we differentiate it. We will stick to the version above.)
Returning to our example from part 1.
If we expected:
$$
exp = \begin{bmatrix}
1 \\
0.2 \\
\end{bmatrix}
$$
… and got …
$$
y = \begin{bmatrix}
0.712257432295742 \\
0.533097573871501 \\
\end{bmatrix}
$$
… the cost would be:
$$
\begin{align}
C & = (1 – 0.712257432295742)^2\\[0.8ex]
& + (0.2 – 0.533097573871501)^2\\[0.8ex]
& = 0.287742567704258^2\\[0.8ex]
& + (0.333097573871501)^2\\[0.8ex]
& = 0.0827957852690395\\[0.8ex]
& + 0.11095399371908007\\[0.8ex]
& = 0.19374977898811957
\end{align}
$$
As the cost function is written above the size of the error explicitly depends on the network output and what value we expected. If we instead define the cost in terms of the inputed value (which of course is deterministically related to the output if we also consider all weights, biases and what activation functions were used) we could instead write \(C = C(y, exp) = C(W, b, S_σ, x, exp)\) – i.e. the cost is a function of the Weights, the biases, the set of activation functions, the input x, and the expectation.
The cost is just a scalar value for all this input. Since the function is continuous and differentiable (sometimes only piecewise differentiable. For example when ReLUactivations are used) we can imagine a continuous landscape of hills and valleys for the cost function. In higher dimensions this landscape is hard to visualize but with only two weights w_{1} and w_{2} it might look somewhat like this:
Suppose we got exactly the costvalue specified by the red dot in the image (based on just a w_{1} and w_{2} in that simplified case). Our aim now is to improve the neural network. If we could reduce the cost the neural network would be better at classifying our labeled data. Preferably we would like to find the global minimum of the costfunction within this landscape. In other words: the deepest valley of them all. Doing so is hard and there are no explicit methods for such a complex function that a neural network is. However, we can find a local minimum by using an iterative process called Gradient Descent. A local minimum might be sufficient for our needs and if not we can always adjust the network design in order to get a new costlandscape to explore locally and iteratively.
From multivariable calculus we know that the gradient of a function, \(\nabla f\) at a specific point will be a vector tangential to the surface pointing in the direction where the function increases most rapidly. Conversely, the negative gradient \(\nabla f\) will point in the direction where the function decreases most rapidly. This fact we can use to calculate new weights W^{+} from our current weights W:
$$
W^+ = W \eta\nabla C\tag{eq 1}\label{eq 1}
$$
In the equation above \(\eta\) is just a small constant called learningrate. This constant tells us how much of the gradient vector we will use to change our current set of weights into new ones. If chosen too small the weights will be adjusted too slowly and our convergence towards a local minimum will take long time. If set too high we might overshoot and miss (or get a bouncy nonconvergent iterative behaviour).
All things in the equation above is just simple matrix arithmetics. The thing we need to take a closer look at is the gradient of the cost function with respect to the weights:
$$
\nabla C = \begin{bmatrix}
\frac{\partial C}{\partial w_1} \\
\frac{\partial C}{\partial w_2} \\
\vdots\\
\frac{\partial C}{\partial w_n} \\
\end{bmatrix}
$$
As you see we are for the time being not interested in the specific scalar cost value, C, but rather how much the Cost function changes when the weights changes (calculated one by one).
If we expand the neat vector form \(\eqref{eq 1}\) it looks like this:
$$
\begin{bmatrix}
w^+_1\\
w^+_2\\
\vdots\\
w^+_n\\
\end{bmatrix}
=
\begin{bmatrix}
w_1\\
w_2\\
\vdots\\
w_n\\
\end{bmatrix}
– \eta
\begin{bmatrix}
\frac{\partial C}{\partial w_1} \\
\frac{\partial C}{\partial w_2} \\
\vdots\\
\frac{\partial C}{\partial w_n} \\
\end{bmatrix}
$$
The fine thing about using the gradient is that it will adjust those weights that are in most need for a change more, and the weights in less need of change less. That is closely connected to the fact that the negative gradient vector points exactly in the direction of maximum descent. To see this please have a look once again at the simplified costfunction landscape image above and try to visualize a separation of the red gradient vector into to its component vectors along the axes.
The gradient descent idea works just as fine in N dimensions, although it is hard to visualize it. The gradient will still tell which components need to change more and which ones need to change less to reduce the function C.
So far I have just talked about weights. What about the biases? Well, the same reasoning is just as valid for them but instead we calculate (which is simpler):
$$\frac{\partial C}{\partial b}$$
Now it is time to see how we can calculate all these partial derivatives for both weights and biases. This will lead us into a more hairy territory so buckle up. To make sense of it all first we need a word on …
The rest of this article is notation intense. A lot of the symbols and letters you will see is just there as subscripted indexes to help us keep track on which layer and neuron I am referring to. Don’t let these indexes make the mathematical expressions become intimidating. The indexes are there to help and to make it more precise. Here is a short guide on how to read them:
Also please pay attention to that the input to the neuron was called z in last article (which is quite common too) but has been changed to i here. The reason is that I find it easier to remember it as i for input and o for output.
The easiest way to describe how to calculate the partial derivatives is to look at one specific single weight:
$$\frac{\partial C}{\partial w_k}$$
… i.e. how much does the total cost change when exactly that w_{k} changes.
We will also see that there is a slight difference how to treat the partial derivative if that specific weight is connected to the last output layer or if it is connected to any of the preceding hidden layers.
Now consider a single weight w_{L }in the last layer:
Our task is to find:
$$\frac{\partial C}{\partial w_L}$$
As outlined in last article there are a few steps between the weight and the cost function:
It is fair to say that it is hard to calculate …
$$\frac{\partial C}{\partial w_L}$$
… just by looking at it. However, if we separate it into the three steps I just described things get a lot easier. Since the three steps are chained functions we could separate the derivative of it all by using the chain rule from calculus:
$$
\frac{\partial C}{\partial w_L} = \frac{\partial i_L}{\partial w_L}\frac{\partial o_{L1}}{\partial i_L}\frac{\partial C}{\partial o_{L1}}\tag{eq 2}\label{eq 2}
$$
As a matter of fact these three partial derivatives happens to be straightforward to calculate:
$$\frac{\partial i_L}{\partial w_L}$$  How much does the input value to the neuron change when w_{L} changes? The input to the neuron is o_{H}w_{L} + b. The derivative of this with respect to w_{L} is simply o_{H} – the output from the previous layer. 
$$\frac{\partial o_{L1}}{\partial i_L}$$  How much does the output from the neuron change when the input changes? The only function transforming the input to output is the activation function – hence we just need the derivative of the activation function σ’.

$$\frac{\partial C}{\partial o_{L1}}$$  How much does the cost change when the output from the neuron changes? This is simply the derivative of the costfunction with respect to a specific out value.
In the case of the Quadratic cost function: (For this specific (nice) cost function the derivative of all others terms in that total cost sum is 0 since they do not depend on o_{L1}) 
This means we have all three factors needed for calculating
$$\frac{\partial C}{\partial w_L}$$
In other words we now know how to update all weights in the last layer.
Let’s work our way backwards from here.
Now consider a single weight w_{H }in the hidden layer before the last layer:
Our goal is to find
$$\frac{\partial C}{\partial w_H}$$
… i.e. how much a change in w_{H} would change the total cost.
We proceed as we did in the last layer. The chain rule gives us:
$$
\frac{\partial C}{\partial w_H} = \frac{\partial i_H}{\partial w_H}\frac{\partial o_{H}}{\partial i_H}\frac{\partial C}{\partial o_{H}}
$$
The two first factors are the same as before and resolves to:
However, the factor …
$$\frac{\partial C}{\partial o_{H}}$$
… is a bit more tricky. The reason is that a change in o_{H} obviously changes the input to all neurons in the last layer and as a consequence alters the cost function in a much broader sense as compared to when we just had to care about one output from a single last layer neuron (I have illustrated this by making the connections wider in the image above).
To resolve this we need to split the partial derivative of the total cost function into each and every contribution from the last layer.
$$
\frac{\partial C}{\partial o_{H}} = \frac{\partial C_{oL1}}{\partial o_{H}} + \frac{\partial C_{oL2}}{\partial o_{H}} + \cdots + \frac{\partial C_{oLn}}{\partial o_{H}}
$$
… where each term describes how much the Cost changes when o_{H} changes, but only for that part of the signal that is routed via that specific output neuron.
Let’s inspect that single term for the first neuron. Again it will get a bit clearer if we separate it by using the chain rule:
$$
\frac{\partial C_{oL1}}{\partial o_{H}} = \frac{\partial i_{L1}}{\partial o_{H}} \frac{\partial C_{oL1}}{\partial i_{L1}}
$$
Let’s look at each of these:
$$\frac{\partial i_{L1}}{\partial o_{H}}$$  How much does the input to the last layer change when the output of the neuron in the current hidden layer changes?
The input to the last layer neuron is i_{L1} = o_{H}w_{L} + b. This time however please note that we are interested in the partial derivative of o_{H} why the result simply is w. 
$$\frac{\partial C_{oL1}}{\partial i_{L1}}$$  How much does the cost change when the input to the last layer changes?
But hang on … doesn’t this look familiar? It should because this is exactly what we had a moment ago for this neuron when we calculated the weight adjustments of the last layer. In fact it is just the two last factors in chain \(\eqref{eq 2}\). This is a very important realisation and the magic part of backpropagation. 
Let’s pause for a moment and let it all sink in.
To summarize what we’ve just found out:
From a computational viewpoint this very good news since we are only depending on calculations we just did (last layer only) and we do not have to traverse deeper than that. This is sometimes called dynamic programming which often is a radical way to speed up a an algorithm once the dynamic algorithm is found.
And since we only depend on generic calculations in the last layer (in particular we are not depending on the cost function) we can now proceed backwards layer by layer. No hidden layer is different from any other layer in terms of how we calculate the weight updates. We say that we let the cost (or error or loss) backpropagate. When we reach the first layer we are done. At that point we have partial derivatives for all weights in the equation \(\eqref{eq 1}\) and are ready to make our gradient descent step downhill in the cost function landscape.
And you know what: that’s it.
Ok, I hear you. Actually they follow the same pattern. The only difference is that the first term in the derivative chain of both the last layer and the hidden layer expressions would be:
$$\frac{\partial i}{\partial b}$$
… instead of
$$\frac{\partial i}{\partial w}$$
Since the input to the neuron is o_{H}w_{L}+b the partial derivative of that with respect to b simply is 1. So where we in the weight case multiplied the chain with the output from last layer we simply ignore the output in the bias case and multiply with one.
All other calculations are identical. You can think of the biases as a simpler subcase to the weight calculations. The fact that a change in the biases does not depend on the output from previous neuron actually makes sense. The biases are added “from the side” regardless of the data coming on the wire to the neuron.
From my quite recent descent into backpropagationland I can imagine that the reading above can be quite something to digest. Don’t be paralyzed. In practice it is quite straightforward and probably all things get clearer and easier to understand if illustrated with an example.
Let’s build on the example from Part 1 – Foundation:
Let us start with w_{5}:
$$
\begin{align}
\frac{\partial C}{\partial w_5} & = \frac{\partial i_L}{\partial w_5}\frac{\partial o_{L1}}{\partial i_L}\frac{\partial C}{\partial o_{L1}} \\[1.2ex]
& = O_{h1}\cdot \sigma'(O_{L1}) \cdot 2 (O_{L1} – exp_{L1})\\[1.2ex]
\end{align}
$$
By picking data from the forward pass in part 1 and by using data from the example cost calculation above we will calculate factor by factor. Also let’s assume that we have a learning rate of η = 0.1 for this example.
$$
\begin{align}
O_{h1} & = 0.41338242108267 \\[0.8ex]
\\[0.8ex]
\sigma'(O_{L1}) & = \sigma(I_{L1})(1 \sigma(I_{L1})) \\[0.8ex]
& = O_{L1} (1 O_{L1}) \\[0.8ex]
& = 0.712257432295742\\[0.8ex]
& \cdot ( 1 – 0.712257432295742) \\[0.8ex]
& = 0.204946782435218 \\[0.8ex]
\\[0.8ex]
2 (O_{L1} – exp_{L1}) & = 2 \cdot (0.712257432295742 – 1)\\[0.8ex]
& = 2\cdot 0.2877425677042583\\[0.8ex]
&= 0.5754851354085166\\[0.8ex]
\\[0.8ex]
\frac{\partial C}{\partial w_5} & =0.41338242108267\\[0.1ex]
& \cdot 0.204946782435218\\[0.1ex]
& \cdot0.57548513540851\\[0.1ex]
& = 0.0487559046913\tag{eq 3}\label{eq 3}\\[0.8ex]
\\[0.8ex]
\boldsymbol{w^+_5} & = w_5 – \eta \cdot \frac{\partial C}{\partial w_5} \\[0.8ex]
&= 0.7 – 0.1(0.0487559046913) \\[0.8ex]
& = 0.70487559046914 \\[0.8ex]
\end{align}
$$
Likewise we can calculate all other parameters in the last layer:
$$
\begin{align}
\boldsymbol{w^+_6} & = 0.3068546661278576 \\[0.8ex]
\boldsymbol{w^+_7} & = 0.5110160830532412 \\[0.8ex]
\boldsymbol{w^+_8} & = 0.11548767720436505 \\[0.8ex]
\boldsymbol{b^+_3} & = 0.16179438268412716 \\[0.8ex]
\boldsymbol{b^+_4} & = 0.3334180996136582 \\[0.8ex]
\end{align}
$$
Now we move one layer backwards and focuses on w_{1}:
$$
\begin{align}
\frac{\partial C}{\partial w_1} & = \frac{\partial i_{H1}}{\partial w_1} \frac{\partial o_{H1}}{\partial i_{H1}}\cdot \left(\sum\limits_j{\frac{\partial C_{oLj}}{\partial o_{H1}}}\right)\\[0.8ex]
& = \frac{\partial i_H}{\partial w_H}\frac{\partial o_{H1}}{\partial i_H}\cdot \left(\frac{\partial C_{oL1}}{\partial o_{H1}} + \frac{\partial C_{oL2}}{\partial o_{H1}}\right) \\[0.8ex]
& = \frac{\partial i_H}{\partial w_H}\frac{\partial o_{H1}}{\partial i_H}\cdot
\left(
\frac{\partial i_{L1}}{\partial o_{H}} \frac{\partial C_{oL1}}{\partial i_{L1}} +
\frac{\partial i_{L2}}{\partial o_{H}} \frac{\partial C_{oL2}}{\partial i_{L2}}
\right)
\\[1.2ex]
& = x_{1}\cdot \sigma'(O_{H1}) \cdot
\left(
w_5\frac{\partial C_{oL1}}{\partial i_{L1}} +
w_6\frac{\partial C_{oL2}}{\partial i_{L2}}
\right)
\\[1.2ex]
\end{align}
$$
Once again, by picking data from the forward pass in part 1 and by using data from above this is straightforward to calculate factor by factor.
$$
\begin{align}
x_1 & = 2 \\[0.1ex]
\\[0.1ex]
\sigma'(O_{H1}) & = \sigma(I_{H1})(1 \sigma(I_{H1})) \\[0.8ex]
& = O_{H1} (1 O_{H1}) \\[0.8ex]
& = 0.41338242108266987\\[0.8ex]
& \cdot ( 1 – 0.41338242108266987) \\[0.8ex]
& = 0.2424973950225001 \\[0.8ex]
\\[0.8ex]
w_5 &= 0.7\\[0.8ex]
w_6 &= 0.3\\[0.8ex]
\end{align}
$$
From calculations in last layer:
$$
\begin{align}
\frac{\partial C_{oL1}} {\partial i_{L1}} &= 0.2049… \cdot 0.5754… \text{ (from }\eqref{eq 3}\text{)}\\[0.1ex]
&= 0.1179438268412716\\[0.1ex]
\\[0.8ex]
\frac{\partial C_{oL2}} {\partial i_{L2}} &= 0.16581900386341794\\[0.1ex]
& \text{ (calculated likewise, not shown)}\\[0.8ex]
\\[0.8ex]
\end{align}
$$
Now we have it all and can finally find w_{1}:
$$
\begin{align}
\frac{\partial C}{\partial w_1} & = 2 \cdot 0.204946782435218 \\[0.8ex]
& \cdot (0.7 \cdot −0.117943… + (0.3) \cdot 0.165819…)\\[0.8ex]
& = 0.0641679049644533\\[0.8ex]
\\[0.8ex]
\boldsymbol{w^+_1} & = w_1 – \eta \cdot \frac{\partial C}{\partial w_1} \\[0.8ex]
&= 0.3 – 0.1(0.0641679049644533) \\[0.8ex]
& = 0.306416790496445 \\[0.8ex]
\end{align}
$$
In the same manner we can calculate all other parameters in the hidden layer:
$$
\begin{align}
\boldsymbol{w^+_2} & = 0.20093134370476357 \\[0.8ex]
\boldsymbol{w^+_3} & = 0.39037481425533205\\[0.8ex]
\boldsymbol{w^+_4} & = 0.6013970155571453\\[0.8ex]
\boldsymbol{b^+_1} & = 0.25320839524822264\\[0.8ex]
\boldsymbol{b^+_2} & = 0.4504656718523818 \\[0.8ex]
\end{align}
$$
That’s it. We have successfully updated the weights and biases in the whole network!
Let us verify that the network now behaves slightly better on the input
x = [2, 3].
First time we fed that vector through the network we got the output
y = [0.712257432295742 0.533097573871501]
… and a cost of 0.19374977898811957.
Now, after we have updated the weights we get the output
y = [0.7187729999291985 0.523807451860988]
… and a cost of 0.18393989144952877.
Please note that both components have moved slightly towards what we expected, [1, 0.2], and that the total cost for the network is now lower.
Repeated training will make it even better!
Next article shows how this looks in java code. When you feel ready for it, please dive in: Part 3 – Implementation in Java.
Feedback is welcome!
]]>This article has also been published in the Mediumpublication Towards Data Science. If you liked what you’ve just read please head over to the mediumarticle and give it a few Claps. It will help others finding it too. And of course I hope you spread the word in any other way you see fit. Thanks!
This is the first part in a series of articles:
Some weeks ago I decided to pick up on machine learning. Most of all my interest is Applied machine learning and the business opportunities and new software this paradigm might bring. I figured that a reasonable way forward then would be to pick up any framework such as TensorFlow or DL4J and start playing around. So I did … and became frustrated. The reason is that just setting up a good behaving neural network in any of these frameworks requires a fair amount of understanding of the concepts and the inner workings: Activation functions, Optimizers, Regularization, Dropouts, Learningrateannealing, etc. – I was clearly groping in the dark.
I just had to get some deeper understanding of it all. Consequently I dived head first into the vast internetocean of information and after a week of reading I unfortunately realized that not much of the information had turned into knowledge. I had to rethink.
Long story short: I decided to build a minor neural network myself. As a gettheknowledgeIneedplayground.
It turned out fine. The practical/theoretical mix required when building from scratch was just the right way for me to get a deeper understanding. This is what I’ve learned.
From outside a neural network is just a function. As such it can take an input and produce an output. This function is highly parameterized and that is of fundamental importance.
Some of those parameters we set ourselves. Those parameters are called hyperparameters and can be seen as configuration of our neural network. The majority of parameters however are inherent to the network and nothing we directly control. To understand why those are important we need to take a look at the inside.
A simple neural network typically consists of a set of hidden layers each containing a number of neurons, marked yellow in the picture. The input layer is marked blue.
In the configuration above the input is a vector X of size 4 and output is a vector Y of size 3.
As you can see in the picture there is a connection between every neuron in one layer to every neuron in the next^{1}. Every such connection is in fact a parameter or weight. In this example we already have 94 extra parameters in form of weights. In bigger networks there can be magnitudes more. Those weights will define how the network behaves and how capable it is to transform input to a desired output.
Before I explain how the network as a whole can be used to transform data we need to zoom in further. Say hello to a single neuron:
The input to every neuron is the weighted sum of the output from every neuron in the previous layer. In the example this would be:
$$\sum_{i=0}^n w_i \cdot x_i $$
To that we add a scalar value called a bias, b, which gives a total input to the neuron of:
$$z = \left(\sum_{i=0}^n w_i \cdot x_i\right) + b$$
A short sidenote: When considering the entire layer of neurons, we instead write this in vector form z = Wx + b, where z, x and b are vectors and W is a matrix^{2}.
 x contains all outgoing signals from the preceding layer.
 b contains all biases in the current layer.
 W has all the weight for all connections between preceding layer and the current.
The input signal is then transformed within the neuron by applying something called an activation function, denoted σ. The name, activation function, stems from the fact that this function commonly is designed to let the signal pass through the neuron if the insignal z is big enough, but limit the output from the neuron if z is not. We can think of this as the neuron firing or being active if stimuli is strong enough.
More importantly the activation function adds nonlinearity to the network which is important when trying to fit the network efficiently (by fit the network I mean train the network to produce the output we want). Without it the network would just be a linear combination of its input.
Quite often an activation function called Rectified Linear Unit, or ReLU, is used (or variants thereof). The ReLU is simply:
$$
σ(z) = max(z, 0)
$$
Another common activation function is this logistic function, called the sigmoidfunction:
$$
σ(z) ={\frac {1}{1+e^{z}}}
$$
As you can see from the graphs they both behave the way I described: They let the signal pass if big enough, and limits it if not.
Finally, after having applied the activation function to z, getting σ(z), we have the output from the neuron.
Or stated in vectorform: after having applied the activation function to vector z, getting σ(z) (where the function is applied to every element in the vector z) we have the output from all the neurons in that layer.
Now we have all the bits and pieces to describe how to apply the entire neural network (i.e. “the function”) to some data x to get an output y = f(x).
We simply feed the data through every layer in the network. This is called feed forward and it works like this:
Let’s look at a simple example:
Assume that we have chosen the sigmoid function as activation in all layers:
$$σ(z) = {\frac {1}{1+e^{z}}}$$
Now let us, layer by layer, neuron by neuron, calculate what output y this network would give on the input vector x = [2 3].
N1 = σ(0.3 · 2 + (0.4) · 3 + 0.25) = σ(0.35) = 0.41338242108267
N2 = σ(0.2 · 2 + 0.6 · 3 + 0.45) = σ(2.65) = 0.934010990508781
N3 = σ(0.7 · N1 + 0.5 · N2 + 0.15) = σ(0.90637319001226) = 0.712257432295742
N4 = σ((0.3) · N1 + (0.1) · N2 + 0.35) = σ(0.132584174624321) = 0.533097573871501
Hence, this network produces the output y = [0.712257… 0.533097…] on the input x = [2 3].
If we are lucky, or skilled at setting initial weights and biases, this might be exactly the output we want for that input. More likely it is not at all what we want. If the latter is the case we can adjust the weights and biases until we get the output we want.
Let’s for a while think about why a neural network is designed the way it is and why adjustments of weights and biases might be all that is needed to make the network behave more to our expectations.
The high parameterization of the network makes it very capable of mimicking almost any function. If the function we try to mimic is more complex than what is possible to express by the set of weights and biases we can just create a slightly larger network (deeper and/or wider) which will give us more parameters which in turn would be better to fit the network to the function we want.
Also note that by the way a neural network is constructed we are free to chose any dimension on the input and any dimension on the output. Often neural networks are designed to reduce dimensions – i.e. map points from a high dimensional space to points in a low dimensional space. This is typical for classification of data. I will get back to that with an example in the end of this article.
Now consider what happens in each neuron: σ(Wx + b) – i.e. we feed the activation function the signal from the preceding layer but we scale and translate it first. So, what does it mean to scale and translate an argument to a function? Think about it for a while.
For simplicity, let’s see what happens i two dimensions when we scale the input to a function.
Here is what the sigmoidfunction looks like with a nonscaled input, i.e. σ(x):
And here is what it looks like if we scale the input by a factor 5, i.e. σ(5x). As you can see, and probably have guessed, scaling the input compresses or expands the function over the xaxis.
Finally, adding a scalar to the input means we move the function on the xaxis. Here is σ(5x – 4):
So by scaling and translating the input to the activation function we can move it and stretch it.
Also remember that the output from one layer is the input to the next. That is, a curve (such as any of the ones above) produced in layer L will be scaled and translated before fed to the activation function in layer L+1. So now we need to ask, what does it mean to scale and translate the output from a function in layer L? Scaling simply means altering the magnitude of that signal, i.e. stretch or compress it along the yaxis. Translating obviously mean moving it along the yaxis.
So what does this give us?
Although the discussion above does not prove anything it strongly suggest that by altering weights and biases of a neural network we can stretch and translate input values (the individual vector components even) in a nonlinear way pretty much to our liking.
Also consider that the depth of a network will let weights operate on different scales and with different contributions to the total function – i.e. an early weight alters the total function in broad strokes, while a weight just before the output layer operate on a more detailed level.
This gives the neural network a very high expressiveness but at the cost of having heaps of parameters to tune. Fortunately we do not have to tune this by hand. We can let the network selfadjust so that the output better meet our expectations. This is done via a process called gradient descent and backpropagation which is the topic of the next article in this series, Part 2 – Gradient descent and backpropagation.
We have concluded that a neural network can mimic a mapping (a function) from a vector x to another vector y. At this point it is fair to ask: In what way does this help us?
One quite common way of using a neural network is to classify data it has never seen before. So, before wrapping up this already lengthy article, I will give a glimpse about a practical example of classification. An example that we will get back to in later articles.
When writing a neural network one of the first task typically thrown at it is classification of hand written digits (kind of the “Hello World” in machine learning). For this task there is a dataset of 60 000 images of hand written digits called the MNIST dataset. The resolution is 28 x 28 and the color in each pixel is a greyscale value between 0 and 255. The dataset is labeled which means that it is specified what number each image represent.
If we flatten out each image – i.e. take each row of the image and put it in a long line – we will get a vector of size 28 x 28 = 784. Also we normalize the greyscale values to range between 0 and 1. Now it would be nice if we could feed this vector x to a neural network and as an output get a yvector of size 10 telling what number the network thinks the input represent (i.e. each element of the output vector telling what probability the network gives the image being a zero, a one, a two, … , a nine). Since the MNIST dataset is labeled we can train this network which essentially means: autoadjust the weights and biases. The cool thing is that if that training is done right the network will be able to classify images of hand written digits it has never seen before – something that would have been very hard to program declaratively.
How could this be?
I will try to explain that with a concluding food for thought: Each input vector x can be seen as a single point in a 784dimensional space. Think about it. A vector of length 3 represents a point in 3D. A vector of length 784 then represents a point in 784D. Since each pixel value is normalized between 0 and 1 we know that all points for this dataset lays within the unit cube, i.e. between 0 and 1 on all 784 axes. It is reasonable to think that all points representing a number N lay fairly close to each other in this space. For instance all images of the digit 2 will lay close to each other in some subspace, as well as all 7:s will be close but in a different subspace. When designing this network we decided that the output should be a vector of size 10 where each component is a probability. This means the output is a point in a 10dimensional unit cube. Our neural network maps points in the 728Dcube to points in the 10Dcube.
Now, what fitting the network in this particular case really means is finding those subspaces in the 784Dinput space with our neural network function and transform (scale, translate) them in such a way that they are clearly separable in 10D. For instance: We want all inputs of the digit seven, however they may differ slightly, to output a vector y where the component representing the number seven is close to 1 and all other 9 components are close to 0.
I tend to think of the fitting process as shrink wrapping surfaces (hyperplanes) around those subspaces^{3}. If we do not shrink wrap those surfaces too hard (which would result in something called overfitting) it is quite likely that digits that the network has not yet seen still would end up within the correct subspace – in other words, the network would then be able to say: “Well, I have never seen this digit before but it is within the subspace which I consider to be number 7”.
And that is somewhat cool!
Ok, that will do for an intro. Feedback is welcome!
Now, onto next article in this series were you will learn how neural networks can be trained: Part 2 – Gradient descent and backpropagation.
This article has also been published in the Mediumpublication Towards Data Science. If you liked what you’ve just read please head over to the mediumarticle and give it a few Claps. It will help others finding it too. And of course I hope you spread the word in any other way you see fit. Thanks!
Footnotes:
]]>