Deep Neural Networks in Swift, lessons learned

I am recently very hyped up about Deep Learning in general. I read a lot about it (as for a Swift developer I probably still know nothing) and I’ve come to the following conclusion:

“To learn and understand neural networks you have to implement them yourself from scratch.”

— people on the Internet

And I did so. My first few successful attempts were written in Python, which is the leading language in that area. There are a lot of materials, courses, books in case you are interested and the community is there to help. But I felt Python was too easy of a mode. You have all the support you need, especially with amazing libraries like numpy, pandas or matplotlib and jupyter notebook environment. Then I thought, why not doing it in Swift? What could go wrong?

Problems

Vectors, Matrices, Derivatives

Like the rest of the machine learning stuff, neural networks are computer science + math. If you want to know what’s going on inside, you cannot avoid them. There are some popular frameworks that abstract it away like Keras but the more you know, the better you can get (oh, thank you Captain Obvious). As there is no serious matrix calculation library in Swift, I decided to write my own. It is probably not the most efficient or the fastest but I learned a ton in the process. You can find it on github. It should work both on iOS and macOS, but no guarantees. It is very raw at the moment. I was loosely inspired with Python’s numpy. I wanted to have very clear API which is as close to mathematical notation as possible. Here is an example of some of the things you can do with it:

It might be useful for making proof of concepts of machine learning algorithms.

Environment

At the beginning, I wanted to tackle some popular but reasonably large dataset. Work environment was for me the major pain. As Swift is a compiled language, everything compiles at once. Processing and shuffling MNIST dataset took like 8-10 minutes, which is a problem if you want to iterate fast. Look out for bugs and tune hyperparameters. I eventually decided to use small dataset. But I miss jupyter notebooks and Python in such R&D explorations. The dataset could just lie in the memory (if it fitted) while I develop or tune the network.

Debugging

I find debugging neural networks notoriously hard. Especially, when I couldn’t tell if the library I wrote for the calculations is 100% correct. Yes, it is tested, but still. Maybe with experience it will get easier. After your matrix sizes are okay, you are left with a large numbers of very small floating point values (weights, biases, derivatives). I found the cost as pretty reasonable indicator of network health. In general, it should go down, otherwise there could be something wrong. What I should have done is gradient checking, and I definitely will before developing my neural network next time. But that’s a subject for another post.

Implementation

In a nutshell, neural network consists of forward and backward propagation. The former one is used for making predictions, the latter for counting the weights update. I don’t feel qualified yet to explain what neural network is and how it works. The Internet is quite saturated with blogposts, books and courses about it so you’ll l have no trouble finding out about it. Instead, I wanted to focus on a nice code and API, which is often omitted in such posts. Here is what I wanted to create, inspired by Keras API.

https://gist.github.com/jknthn/cf53f93b1e84981b7a387332ff2e916d#file-nngoal-swift

I decided to implement it with 2 classes: DeepNeuralNetowrk and Layer. The network contains Layers and functions which allow you to train and test it. The Layer knows how to do forward and backward propagation, update parameters, cache values and is aware about the neighbouring layers.

Layer Initialization

Before I could use a layer I had to initialize its parameters. But there is not enough info about it until the model of neural network is completed. First, I created Layer class with init and convenience static methods for verbose creation for 2 kinds of layers, the input and fully connected.

I use initialize function after neural network graph is build to create initial weights and biases. Weights matrix is created with small random values to break networks symmetry (so the network can actually learn). Biases are just initially zeros.

Layer Forward Propagation

Forward propagation is the simplest part of a network. For each layer (excluding input, which I counted as an actual layer) calculate dot product of weights matrices and the output of previous layer, usually called A, and add biases vector. Variables A and Z are cached in the layer to use them in backpropagation. The else case is for the first layer, which just passes and caches values. It’s not yet typesafe but I will work on it. Activation is an enum, which contains all implemented activation functions with 2 properties (which are actually (Matrix) -> Matrix closures). forward variable applies activation function to input, backward counts its derivative.

Layer Backpropagation

So it begins. Backpropagation is the reason why neural networks are so hot, and why they actually work pretty well. If there is a bug in my code, most probably it’s here. Again, I needed to cache derivatives for the other layers to use and to update parameters. I anticipated 3 cases. When the layer has both previousLayer and nextLayer, it’s one of the hidden layers. And the layer performs our calculations using dZ from the nextLayer (which comes first in backpropagation). Derivatives dW and db are needed to update layer’s weights and biases so the network can actually learn. In case there is no nextLayer and y matrix is passed to the function, the current layer is the output layer that has to calculate post-activation gradient, which is a base for next steps of algorithm. Real values y and predicted yHat are used to calculate an error and start out propagation throughout the network. When there is no previousLayer, current layer is an input layer without weight and there is no need to backpropagate through it at all.

Update Weights

For the network to actually learn something I had to update its weights and biases. That’s how network learns. It fits its function to the data.

DeepNeuralNetwork Initialization

Before I could assemble everything together I had to create DeepNeuralNetwork class with some hyperparameters and the ability to add layers and create weights. The good thing is that Layer does heavy lifting there while DeepNeuralNetwork is just a manager. As for the hyperparameters, I chose 2 to begin with: number of iterations and learning rate. Function compile iterates through the layers structure and initalize it with its neighbours.

Helper Methods

Still, right now each Layer can go forward and backward on itself. The network has to work as a whole. Functions layersForward, layersBackward and layersUpdate iterate through every layer and perform those operations for the network. The cost function is cross entropy, which is my objective to minimise here.

Fit And Test

I know it’s a lot to take in without even touching the maths. Final methods of the DeepNeuralNetwork class. The fit function is used to actually teach the network with given amount of iterations. For each one I did in order to train the network:

Forward propagation to obtain predictions
Calculate cost function
Backpropagation with predicted values
Update layer’s weights and biases
Every 100 iterations I print cost to see how it’s going

Test method is just forward propagation where I print total accuracy and every example prediction + real value to check if they match.

Result

The goal of my work and this blog post was to learn and implement simple, fully connected deep neural network. I managed to complete this objective, though it took me a lot of time. Here is the result of the training on 5000 iterations on Iris dataset.

Although Iris dataset contains 3 classes, I simplified my task a little bit by checking if the flower is Iris-setosa or not. I managed to get ~96% accuracy on the training set, and 100% on the test one. The dataset itself is quite linearly separable, hence the high accuracy of my network.

It is possible to implement machine learning algorithms in Swift even though it is not very easy right now. As Swift matures, I’m pretty sure there could be some space for doing data science in Swift, especially with development of environment like Playgrounds. Right now, however, it is no match to possibilities given by Python and its community.

For me, it was a great adventure. As Swift is my main language, I understand neural networks better after this implementation. I am going to pursue it a little bit further with adding more features to my neural network and maybe open sourcing it in future.

Demo project – everything happens in the console
Matswift