I am recently very hyped up about Deep Learning in general. I read a lot about it (as for a Swift developer I probably still know nothing) and I’ve come to the following conclusion:

“To learn and understand neural networks you have to implement them yourself from scratch.”

— people on the Internet

And I did so. My first few successful attempts were written in Python, which is the leading language in that area. There are a lot of materials, courses, books in case you are interested and the community is there to help. But I felt Python was too easy of a mode. You have all the support you need, especially with amazing libraries like numpy, pandas or matplotlib and jupyter notebook environment. Then I thought, why not doing it in Swift? What could go wrong?

## Problems

### Vectors, matrices, derivatives

Like the rest of the machine learning stuff, neural networks are computer science + math. If you want to know what’s going on inside, you cannot avoid them. There are some popular frameworks that abstract it away like Keras but the more you know, the better you can get (oh, thank you Captain Obvious). As there is no serious matrix calculation library in Swift, I decided to write my own. It is probably not the most efficient or the fastest but I learned a ton in the process. You can find it on github. It should work both on `iOS`

and `macOS`

, but no guarantees. It is very raw at the moment. I was loosely inspired with Python’s numpy. I wanted to have very clear API which is as close to mathematical notation as possible. Here is an example of some of the things you can do with it:

It might be useful for making proof of concepts of machine learning algorithms.

### Environment

At the beginning, I wanted to tackle some popular but reasonably large dataset. Work environment was for me the major pain. As `Swift`

is a compiled language, everything compiles at once. Processing and shuffling MNIST dataset took like 8-10 minutes, which is a problem if you want to iterate fast. Look out for bugs and tune hyperparameters. I eventually decided to use small dataset. But I miss jupyter notebooks and Python in such R&D explorations. The dataset could just lie in the memory (if it fitted) while I develop or tune the network.

### Debugging

I find debugging neural networks notoriously hard. Especially, when I couldn’t tell if the library I wrote for the calculations is 100% correct. Yes, it is tested, but still. Maybe with experience it will get easier. After your matrix sizes are okay, you are left with a large numbers of very small floating point values (weights, biases, derivatives). I found the cost as pretty reasonable indicator of network health. In general, it should go down, otherwise there could be something wrong. What I should have done is gradient checking, and I definitely will before developing my neural network next time. But that’s a subject for another post.

## Implementation

In a nutshell, neural network consists of forward and backward propagation. The former one is used for making predictions, the latter for counting the weights update. I don’t feel qualified yet to explain what neural network is and how it works. The Internet is quite saturated with blogposts, books and courses about it so you’ll l have no trouble finding out about it. Instead, I wanted to focus on a nice code and API, which is often omitted in such posts. Here is what I wanted to create, inspired by Keras API.

I decided to implement it with 2 classes: `DeepNeuralNetowrk`

and `Layer`

. The network contains Layers and functions which allow you to train and test it. The Layer knows how to do forward and backward propagation, update parameters, cache values and is aware about the neighbouring layers.

### Layer initialization

Before I could use a layer I had to initialize its parameters. But there is not enough info about it until the model of neural network is completed. First, I created `Layer`

class with init and convenience static methods for verbose creation for 2 kinds of layers, the input and fully connected.

I use `initialize`

function after neural network graph is build to create initial weights and biases. Weights matrix is created with small random values to break networks symmetry (so the network can actually learn). Biases are just initially zeros.

### Layer forward propagation

Forward propagation is the simplest part of a network. For each layer (excluding input, which I counted as an actual layer) calculate dot product of `weights`

matrices and the output of previous layer, usually called `A`

, and add `biases`

vector. Variables `A`

and `Z`

are cached in the layer to use them in backpropagation. The `else`

case is for the first layer, which just passes and caches values. It’s not yet typesafe but I will work on it. `Activation`

is an `enum`

, which contains all implemented activation functions with 2 properties (which are actually `(Matrix) -> Matrix`

closures). `forward`

variable applies activation function to input, `backward`

counts its derivative.

### Layer backpropagation

So it begins. Backpropagation is the reason why neural networks are so hot, and why they actually work pretty well. If there is a bug in my code, most probably it’s here. Again, I needed to cache derivatives for the other layers to use and to update parameters. I anticipated 3 cases. When the layer has both `previousLayer`

and `nextLayer`

, it’s one of the hidden layers. And the layer performs our calculations using `dZ`

from the `nextLayer`

(which comes first in backpropagation). Derivatives `dW`

and `db`

are needed to update layer’s `weights`

and `biases`

so the network can actually learn. In case there is no `nextLayer`

and `y`

matrix is passed to the function, the current layer is the output layer that has to calculate post-activation gradient, which is a base for next steps of algorithm. Real values `y`

and predicted `yHat`

are used to calculate an error and start out propagation throughout the network. When there is no `previousLayer`

, current layer is an input layer without weight and there is no need to backpropagate through it at all.

### Update weights

For the network to actually learn something I had to update its weights and biases. That’s how network learns. It fits its function to the data.

### DeepNeuralNetwork initialization

Before I could assemble everything together I had to create `DeepNeuralNetwork`

class with some hyperparameters and the ability to add layers and create weights. The good thing is that `Layer`

does heavy lifting there while `DeepNeuralNetwork`

is just a manager. As for the hyperparameters, I chose 2 to begin with: number of iterations and learning rate. Function `compile`

iterates through the layers structure and initalize it with its neighbours.

### Helper methods

Still, right now each `Layer`

can go forward and backward on itself. The network has to work as a whole. Functions `layersForward`

, `layersBackward`

and `layersUpdate`

iterate through every layer and perform those operations for the network. The `cost`

function is cross entropy, which is my objective to minimise here.

### Fit and test

I know it’s a lot to take in without even touching the maths. Final methods of the `DeepNeuralNetwork`

class. The `fit`

function is used to actually teach the network with given amount of iterations. For each one I did in order to train the network:

- Forward propagation to obtain predictions
- Calculate cost function
- Backpropagation with predicted values
- Update layer’s
`weights`

and`biases`

- Every 100 iterations I print cost to see how it’s going

Test method is just forward propagation where I print total accuracy and every example prediction + real value to check if they match.

## Result

The goal of my work and this blog post was to learn and implement simple, fully connected deep neural network. I managed to complete this objective, though it took me a lot of time. Here is the result of the training on 5000 iterations on Iris dataset.

Although Iris dataset contains 3 classes, I simplified my task a little bit by checking if the flower is Iris-setosa or not. I managed to get ~96% accuracy on the training set, and 100% on the test one. The dataset itself is quite linearly separable, hence the high accuracy of my network.

It is possible to implement machine learning algorithms in Swift even though it is not very easy right now. As Swift matures, I’m pretty sure there could be some space for doing data science in Swift, especially with development of environment like Playgrounds. Right now, however, it is no match to possibilities given by Python and its community.

For me, it was a great adventure. As Swift is my main language, I understand neural networks better after this implementation. I am going to pursue it a little bit further with adding more features to my neural network and maybe open sourcing it in future.

Demo project – everything happens in the console

Matswift