DNNs from Scratch in Zig

Sunday, April 23, 2023

Zig is one of the new names in the world of system's programming. I've been following its growth over the last few months, and have been looking for an excuse to program with it. Needing a break from my experiments with Chaos Networks, I thought this would be the perfect time to take a small detour and have some fun with Zig.

What We Are Building
====================================================================================================================================================================================================================================================================================================================================

Today we will be exploring deep neural networks, neural networks with at least two layers. DNNs are typically written in high level Python libraries like Tensorflow utilizing more complex libraries written in c++ to handle the heavy compute pieces like automatic differentiation.

We will not be using any such libraries and instead write our own simple DNN from scratch using nothing but the Zig standard library.

By the end of this post our homemade DNN will be solving MNIST with 96% accuracy!

The Architecture
====================================================================================================================================================================================================================================================================================================================================

In contrast to the majority of public libraries, we will not be utilizing the concepts of tensors and the associated idea of automatic differentiation as doing so would be outside the scope of a simple weekend project. Instead, we will manually compute the gradients for layers. This causes our architecture to shift from the standard abstraction with layer structs possessing only forward methods, to layer structs possessing both forward and backwards methods. In other words, each layer will be in charge of computing its own individual gradients and the gradients for its inputs (so we can chain layers) as will be shown later.

Our simple executable will consist of four modules:

Let's break them down.

layer.zig
====================================================================================================================================================================================================================================================================================================================================

const std = @import("std");

pub fn Layer(comptime I: usize, comptime O: usize) type {
    const LayerGrads = struct { 
        weight_grads: []f64, 
        input_grads: []f64,
        const Self = @This();

        pub fn destruct(self: Self, allocator: *std.mem.Allocator) void {
            allocator.free(self.weight_grads);
            allocator.free(self.input_grads);
        }
    };

    return struct {
        inputs: usize,
        outputs: usize,
        weights: *[I * O]f64,
        last_inputs: []f64,
        const Self = @This();

        pub fn forward(self: *Self, inputs: []f64, allocator: *std.mem.Allocator) ![]f64 {
            const batch_size = inputs.len / I;
            var outputs = try allocator.alloc(f64, batch_size * O);
            var b: usize = 0;
            while (b < batch_size) : (b += 1) {
                var o: usize = 0;
                while (o < O) : (o += 1) {
                    var sum: f64 = 0;
                    var i: usize = 0;
                    while (i < I) : (i += 1) {
                        sum += inputs[b * I + i] * self.weights[O * i + o];
                    }
                    outputs[b * O + o] = sum;
                }
            }
            self.last_inputs = inputs;
            return outputs;
        }

        pub fn backwards(self: *Self, grads: []f64, allocator: *std.mem.Allocator) !LayerGrads {
            var weight_grads = try allocator.alloc(f64, I * O);

            const batch_size = self.last_inputs.len / I;
            var input_grads = try allocator.alloc(f64, batch_size * I);

            var b: usize = 0;
            while (b < batch_size) : (b += 1) {
                var i: usize = 0;
                while (i < I) : (i += 1) {
                    var o: usize = 0;
                    while (o < O) : (o += 1) {
                        weight_grads[i * O + o] += (grads[b * O + o] * self.last_inputs[b * I + i]) / @intToFloat(f64, batch_size);
                        input_grads[b * I + i] += grads[b * O + o] * self.weights[i * O + o];
                    }
                }
            }
            return LayerGrads{ .weight_grads = weight_grads, .input_grads = input_grads };
        }

        pub fn applyGradients(self: *Self, grads: []f64) void {
            var i: usize = 0;
            while (i < I * O): (i += 1) {
                self.weights[i] -= 0.01 * grads[i];
            }
        }

        pub fn init(allocator: *std.mem.Allocator) !Self {
            var memory = try allocator.alloc(f64, I * O);
            var weights = memory[0 .. I * O];
            var prng = std.rand.DefaultPrng.init(123);
            var w: usize = 0;
            while (w < I * O) : (w += 1) {
                weights[w] = prng.random().floatNorm(f64) * 0.2;
            }
            return Self{
                .inputs = I,
                .outputs = O,
                .weights = weights,
                .last_inputs = undefined,
            };
        }

        pub fn destruct(self: *Self, allocator: *std.mem.Allocator) void {
            allocator.free(self.weights);
        }
    };
}

If you are not familiar with neural networks, the most important thing to know is the forward method is the act of passing inputs into the layer and receiving outputs. The method simply performs matrix multiplication, multiplying the inputs by the weights of the layer. There are more performant ways to accomplish the matrix multiplication in the forward method, but as this is simply an educational endeavor, the above is adequate.

Also note the backwards method. As stated above, our simple DNN is in charge of calculating its own gradient. The backwards method does just that, taking grads from further down the chain of execution, and returning the grads for its own weights and inputs to the forward method. We need the grads for the inputs so we can chain the layers together.

One of the very cool features of Zig is the comptime directive which allows us to write code that is interpreted at compile time. In this case we are taking advantage of the comptime feature to write a generic layer struct that takes dynamic inputs and outputs arguments. Using this, we can write code like the following.

var allocator = std.heap.page_allocator;

var layer1 = try layer.Layer(784, 100).init(&allocator);
var layer2 = try layer.Layer(100, 10).init(&allocator);

const outputs1 = try layer1.forward(inputs, &allocator);
const outputs2 = try layer2.forward(outputs2, &allocator);

By hard coding the forward and backward methods at comptime we have some more comfort with the general correctness and expected errors we would receive if we passed in incorrectly shaped data at runtime. The end goal would be having complete comptime guarantees that if the executable compiles the structure of the neural network is working as expected. I did not accomplish this yet, and the comptime usage above is basically useless, but was a great exercise in my learning of Zig.

relu.zig
====================================================================================================================================================================================================================================================================================================================================

const std = @import("std");

pub const Relu = struct {
    last_inputs: []f64,
    const Self = @This();

    pub fn new() Self {
        return Self {
            .last_inputs = undefined,
        };
    }

    pub fn forward(self: *Self, inputs: []f64, allocator: *std.mem.Allocator) ![]f64 {
        var outputs = try allocator.alloc(f64, inputs.len);
        var i: usize = 0;
        while (i < inputs.len): (i += 1) {
            if (inputs[i] < 0) {
                outputs[i] = 0.01 * inputs[i];
            } else {
                outputs[i] = inputs[i];
            }
        }
        self.last_inputs = inputs;
        return outputs;
    }

    pub fn backwards(self: *Self, grads: []f64, allocator: *std.mem.Allocator) ![]f64 {
        var outputs = try allocator.alloc(f64, grads.len);
        var i: usize = 0;
        while (i < self.last_inputs.len): (i += 1) {
            if (self.last_inputs[i] < 0) {
                grads[i] = 0.01 * grads[i];
            } else {
                outputs[i] = grads[i];
            }
        }
        return outputs;
    }
};

I should clarify this is actually the leaky relu and not plain relu.

Similar to the Layer struct above we have both forward and backwards methods as this struct is also in charge of managing its own grads.

nll.zig
====================================================================================================================================================================================================================================================================================================================================

const std = @import("std");

pub fn NLL(comptime I: usize) type {
    const NLLOuput = struct {
        loss: []f64,
        input_grads: []f64,
        const Self = @This();

        pub fn destruct(self: Self, allocator: *std.mem.Allocator) void {
            allocator.free(self.loss);
            allocator.free(self.input_grads);
        }
    };

    return struct {
        pub fn nll(inputs: []f64, targets: []u8, allocator: *std.mem.Allocator) !NLLOuput {
            const batch_size = targets.len;
            var sum_e = try allocator.alloc(f64, batch_size);
            defer allocator.free(sum_e);
            var b: usize = 0;
            while (b < batch_size) : (b += 1) {
                var sum: f64 = 0;
                var i: usize = 0;
                while (i < I): (i += 1) {
                    sum += std.math.exp(inputs[b * I + i]);
                }
                sum_e[b] = sum;
            }

            var loss = try allocator.alloc(f64, batch_size);
            b = 0;
            while (b < batch_size): (b += 1) {
                loss[b] = -1 * std.math.ln(std.math.exp(inputs[b * I + targets[b]]) / sum_e[b]);
            }

            var input_grads = try allocator.alloc(f64, batch_size * I);
            b = 0;
            while (b < batch_size): (b += 1) {
                var i: usize = 0;
                while (i < I): (i += 1) {
                    input_grads[b * I + i] = std.math.exp(inputs[b * I + i]) / sum_e[b];
                    if (i == targets[b]) {
                        input_grads[b * I + i] -= 1;
                    }
                }
            }

            return NLLOuput {
                .loss = loss,
                .input_grads = input_grads 
            };
        }
    };
}

Given some inputs and some targets the nll method performs negative log likelihood loss, and returns the loss and the grads for the inputs.

Similar to the Layer struct, we are utilizing Zig's comptime feature, but as mentioned above, this really is unnecessary and done mostly for my own personal education on how comptime works.

mnist.zig
====================================================================================================================================================================================================================================================================================================================================

const std = @import("std");


const Data = struct {
    train_images: []f64,
    train_labels: []u8,
    test_images: []f64,
    test_labels: []u8,
    const Self = @This();

    pub fn destruct(self: Self, allocator: *std.mem.Allocator) void {
        allocator.free(self.train_images);
        allocator.free(self.train_labels);
        allocator.free(self.test_images);
        allocator.free(self.test_labels);
    }
};


pub fn readMnist(allocator: *std.mem.Allocator) !Data {
    const train_images_path: []const u8 = "data/train-images-idx3-ubyte";
    const train_images_u8 = try readIdxFile(train_images_path, 16, allocator);
    defer allocator.free(train_images_u8);
    var train_images = try allocator.alloc(f64, 784 * 60000);
    var i: u32 = 0;
    while (i < 784 * 60000): (i += 1) {
        const x: f64 = @intToFloat(f64, train_images_u8[i]);
        train_images[i] = x / 255;
    }

    const train_labels_path: []const u8 = "data/train-labels-idx1-ubyte";
    const train_labels = try readIdxFile(train_labels_path, 8, allocator);

    const test_images_path: []const u8 = "data/t10k-images-idx3-ubyte";
    const test_images_u8 = try readIdxFile(test_images_path, 16, allocator);
    defer allocator.free(test_images_u8);
    var test_images = try allocator.alloc(f64, 784 * 10000);
    i = 0;
    while (i < 784 * 10000): (i += 1) {
        const x: f64 = @intToFloat(f64, test_images_u8[i]);
        test_images[i] = x / 255;
    }

    const test_labels_path: []const u8 = "data/t10k-labels-idx1-ubyte";
    const test_labels = try readIdxFile(test_labels_path, 8, allocator);

    return Data {
        .train_images = train_images,
        .train_labels = train_labels,
        .test_images = test_images,
        .test_labels = test_labels
    };


}

pub fn readIdxFile(path: []const u8, skip_bytes: u8, allocator: *std.mem.Allocator) ![]u8 {
    const file = try std.fs.cwd().openFile(
        path,
        .{},
    );
    defer file.close();

    const reader = file.reader();
    try reader.skipBytes(skip_bytes, .{});
    const data = reader.readAllAlloc(
        allocator.*,
        1000000000,
    );
    return data;
}

MNIST is stored in the IDX file format which basically has some leading bits specifying the type and structure of data. As we know these two items already, we simply skip the leading bytes, load the data, and divide it by 255 to normalize everything between 0 and 1.

Putting It All Together
====================================================================================================================================================================================================================================================================================================================================

const layer = @import("layer.zig");
const nll = @import("nll.zig");
const mnist = @import("mnist.zig");
const relu = @import("relu.zig");

const std = @import("std");

const INPUT_SIZE: u32 = 784;
const OUTPUT_SIZE: u32 = 10;
const BATCH_SIZE: u32 = 32;
const EPOCHS: u32 = 25;

pub fn main() !void {
    var allocator = std.heap.page_allocator;

    // Get MNIST data
    const mnist_data = try mnist.readMnist(&allocator);

    // Prep loss function
    const loss_function = nll.NLL(OUTPUT_SIZE);

    // Prep NN
    var layer1 = try layer.Layer(INPUT_SIZE, 100).init(&allocator);
    var relu1 = relu.Relu.new();
    var layer2 = try layer.Layer(100, OUTPUT_SIZE).init(&allocator);

    // Do training
    var e: usize = 0;
    while (e < EPOCHS) : (e += 1) {
        // Do training
        var i: usize = 0;
        while (i < 60000 / BATCH_SIZE) : (i += 1) {
            // Prep inputs and targets
            const inputs = mnist_data.train_images[i * INPUT_SIZE * BATCH_SIZE .. (i + 1) * INPUT_SIZE * BATCH_SIZE];
            const targets = mnist_data.train_labels[i * BATCH_SIZE .. (i + 1) * BATCH_SIZE];

            // Go forward and get loss
            const outputs1 = try layer1.forward(inputs, &allocator);
            const outputs2 = try relu1.forward(outputs1, &allocator);
            const outputs3 = try layer2.forward(outputs2, &allocator);
            const loss = try loss_function.nll(outputs3, targets, &allocator);

            // Update network
            const grads1 = try layer2.backwards(loss.input_grads, &allocator);
            const grads2 = try relu1.backwards(grads1.input_grads, &allocator);
            const grads3 = try layer1.backwards(grads2, &allocator);
            layer1.applyGradients(grads3.weight_grads);
            layer2.applyGradients(grads1.weight_grads);

            // Free memory
            allocator.free(outputs1);
            allocator.free(outputs2);
            allocator.free(outputs3);
            grads1.destruct(&allocator);
            allocator.free(grads2);
            grads3.destruct(&allocator);
            loss.destruct(&allocator);
        }

        // Do validation
        i = 0;
        var correct: f64 = 0;
        const outputs1 = try layer1.forward(mnist_data.test_images, &allocator);
        const outputs2 = try relu1.forward(outputs1, &allocator);
        const outputs3 = try layer2.forward(outputs2, &allocator);
        var b: usize = 0;
        while (b < 10000) : (b += 1) {
            var max_guess: f64 = outputs3[b * OUTPUT_SIZE];
            var guess_index: usize = 0;
            for (outputs3[b * OUTPUT_SIZE .. (b + 1) * OUTPUT_SIZE]) |o, oi| {
                if (o > max_guess) {
                    max_guess = o;
                    guess_index = oi;
                }
            }
            if (guess_index == mnist_data.test_labels[b]) {
                correct += 1;
            }
        }

        // Free memory
        allocator.free(outputs1);
        allocator.free(outputs2);
        allocator.free(outputs3);

        correct = correct / 10000;
        std.debug.print("Average Validation Accuracy: {}\n", .{correct});
    }

    layer1.destruct(&allocator);
    layer2.destruct(&allocator);
    mnist_data.destruct(&allocator);
}

The above code is the entire main.zig file. Utilizing the modules we created, we can easily create a simple DNN that solves MNIST. The majority of the code above is simply allocator frees and validation on the test set.

Note the chaining of the different Layers and Relu. While we aren't utilizing tensors or performing automatic differentiation in the classic sense, by chaining Layer and Relu structs together, we could have an infinitely large DNN.

In this case, two layers are more than enough to boast an excellent score on MNIST. See the training results below:

> zig build run
Average Validation Accuracy: 8.782e-01
Average Validation Accuracy: 9.011e-01
Average Validation Accuracy: 9.103e-01
Average Validation Accuracy: 9.18e-01
Average Validation Accuracy: 9.239e-01
Average Validation Accuracy: 9.286e-01
Average Validation Accuracy: 9.318e-01
Average Validation Accuracy: 9.35e-01
Average Validation Accuracy: 9.384e-01
Average Validation Accuracy: 9.412e-01
Average Validation Accuracy: 9.435e-01
Average Validation Accuracy: 9.462e-01
Average Validation Accuracy: 9.476e-01
Average Validation Accuracy: 9.491e-01
Average Validation Accuracy: 9.501e-01
Average Validation Accuracy: 9.512e-01
Average Validation Accuracy: 9.516e-01
Average Validation Accuracy: 9.526e-01
Average Validation Accuracy: 9.539e-01
Average Validation Accuracy: 9.544e-01
Average Validation Accuracy: 9.553e-01
Average Validation Accuracy: 9.561e-01
Average Validation Accuracy: 9.572e-01
Average Validation Accuracy: 9.579e-01
Average Validation Accuracy: 9.586e-01

For our simple network, 95.86% accuracy is fantastic! If we let it train longer, it could probably do even better.

It does run fairly slowly and though I have not profiled it, I would guess the constant memory allocation and frees in the training loop are the bottleneck. This could be fixed easily by reusing the same buffers, but it is currently more than fast enough for our purposes.

Thank you so much for reading!

the repo