Part 3 – Implementation in Java

In this article you will see how the theories presented in previous two articles can be implemented in easy to understand java code. The full neural network implementation can be downloaded, inspected in detail, built upon and experimented with.

This is the third part in a series of articles:

I assume you have read both previous articles and have a fairly good understanding about the forward pass and the learning/training pass of a neural network.

This article will be quite different. It will show and describe it all in a few snippets of java code.

All my code – a fully working neural net implementation – can be found, examined & downloaded here.


Normally when writing software I use a fair amount of open source – as a way to get good results faster and to be able to use fine abstractions someone else put a lot of thought into.

This time it is different though. My ambition is to show how a neural network works with a minimal requirement on your part in terms of knowing other open source libraries. If you manage to read Java 8 code you should be fine. It is all there in a few short and tidy files.

A consequence of this ambition is that I have not even imported a library for Linear Algebra. Instead two simple classes have been created with just the operations needed – meet the usual suspects: the Vec and the Matrix class. These are both very ordinary implementations and contain the typical arithmetic operations additions, subtractions, multiplications and dot-product.

The fine thing about vectors and matrices is that they often can be used to make code expressive and tidy. Typically when you are faced with doing operation on every element from one set with every other on the other set, like for instance weighted sums over a set of inputs, there is a good chance you can arrange your data in vectors and/or matrices and get a very compact and expressive code. A neural network is no exceptions to this1.

I too have collapsed the calculations as much as possible by using objects of types Vec and Matrix. The result is neat and tidy and not far from mathematical expressions. No for-loops in for-loops are blurring the top view. However, when inspecting the code I encourage you to make sure that any neat looking call on objects of type Vec or Matrix in fact results in exactly the series of arithmetic operations which were defined in part 1 and part 2.

Two operations on the Vec class is not as common as the typical ones I mentioned above. Those are:

  • The outer product (Vec.outerProduct()) which is the element wise multiplication of a column vector and row vector resulting in a matrix.
  • The Hadamard product (Vec.elementProduct()) which simply is the element wise multiplication of two vectors (both being column or row vectors) resulting in a new vector.


A typical NeuralNetwork has many Layers. Each Layer can be of any size and contains the weights between the preceding layer and itself as well as the biases. Each layer is also configured with an Activation and an Optimizer (more on what that is later).

Whenever designing something that should be highly configurable the builder pattern is a good choice2. This gives us a very straightforward and readable way to define and create neural networks:

To use such a network you typically either feed a single input vector to it or you also add an expected result in the call to evaluate()-method. When doing the latter the network will observe the difference between the actual output and the expected and store an impression of this. In other words, it will backpropagate the error. For good reasons (which I will get back to in the next article) the network does not immediately update its weights and biases. To do so – i.e. let any impressions “sink in” – a call to updateFromLearning() has to be made.

For example:

Feed forward

Let’s dive into the feed forward pass – the call to evaluate(). Please note in the marked lines below that the input data is passed through layer by layer:

Within the layer the input vector is multiplied with the weights and then the biases is added. That is in turn used as input to the activation function. See the marked line. The layer also stores the output vector since it is used in the backpropagation.

(The reason there are get and and set operation on the out variable I will get back to in next article. For now just think of it as a variable of type Vec)

Activation functions

The code contains a few predefined activation functions. Each of these contain both the actual activation function, σ, as well as the derivative, σ’.

Cost functions

Also there are a few cost functions included. The Quadratic looks like this:

A cost function has one method for calculating the total cost (as a scalar) but also the important differentiation of the cost function to be used in the …


As I mentioned above: if an expected outcome is passed to the evaluate function the network will learn from it. See the marked lines.

In the learnFrom()-method the actual backpropagation happens. Here you should be able to follow the steps from part 2 in detail in code. It is somewhat soothing too see that the rather lengthy mathematical expressions from part 2 just boils down to this:

Please note that the learning (the partial derivatives) in the backpropagation is stored per layer by a call to addDeltaWeightsAndBiases()-method.

Not until a call to updateFromLearning()-method has been made the weights and biases change:

The reason why this is designed as two separate steps is that it allows the network to observe a lot of samples and learn from these … and then only finally update the weights and biases as an average of all observations. This is in fact call Batch Gradient Descent or Mini Batch Gradient Descent. We will get back to these variants in the next article. For now you can as well call updateFromLearning after each call to evaluate (with expectations) to make the network improve after each sample. That, to update the network after each sample, is called Stochastic Gradient Descent.

This is what the updateWeightsAndBias()-method looks like. Notice that an average of all calculated changes to the weights and biases is fed into the two methods updateWeights() and updateBias() on the optimizer object.

We have not really talked about the concept of optimizers yet. We have only said that the weights and biases are updated by subtracting the partial derivatives scaled with some small learning rate – i.e:

$$w^+=w – \eta \frac {\partial C}{\partial w}$$

And yes, that is a good way to do it and it is easy to understand. However, there are other ways. A few of these are a bit more complicated but offers faster convergence. The different strategies on how to update weights and biases is often referred to as Optimizers. We will see another way to do it in the next article. As a consequence it is reasonable to leave this as a pluggable strategy that the network can be configured to use.

For now we just use the simple GradientDescent strategy for updating our weights and biases:


And that’s pretty much it. With a few lines of code we have the possibility to configure a neural network, feed data trough it and make it learn how to classify unseen data. This will be shown in Part 5 -Training the network to read handwritten digits.

But first, let’s pimp this up a few notches. See you in Part 4 – Better, faster, stronger.

Feedback is welcome!


This article has also been published in the Medium-publication Towards Data Science. If you liked what you’ve just read please head over to the medium-article and give it a few Claps. It will help others finding it too. And of course I hope you spread the word in any other way you see fit. Thanks!



  1. Quite the contrary: the reason it is possible to speed up learning immensely on GPUs and TPUs is the huge amount of vector and matrix operations within the feed forward and backpropagation.
  2. Most code examples I have found explaining how a neural network works has quite an awful design. Almost without exception they define the network with a number of badly named multidimensional arrays with constants defining their sizes. A very static design not at all inviting you to try out different network setups.

Tobias Hill


One thought on “Part 3 – Implementation in Java

  1. Shouldn’t it be e.g. x -> x * (1.0 – x) for the derivative of the sigmoid function because you put the output of the layer after already having applied the activation function as x? The same applies to the other activation functions. In the case of Leaky ReLU it does make no difference because this function keeps the sign and softmax is defined separately.

Comments are closed.