Chaos Networks Part 2

Monday, March 6, 2023

This is part two of Chaos Networks Part 1.

In the previous post, Chaos Networks Part 1, we discussed the lack of chaos current feed-forward layer-based neural networks have, and proposed our hypothesis that this lack of chaos, or rather, the paradigm of using layers, is detrimental, and leads to extremely inefficient neural networks that have magnitudes more weights and connections than required (I don't think we used these exact words, but now we know the point of all this). Please note, this is similar to something already well established called NEAT (neural evolution of augmenting topologies), though I have not seen it implemented as we do.

It seems, although not thoroughly tested, our hypothesis holds some weight.

Quick Recap

====================================================================================================================================================================================================================================================================================================================================

Last post we created a graph based feed-forward neural network from scratch, including the writing of our own very simple automatic differentiation library that handles 1-dimensional tensors. We trained this network on the MNIST dataset, and were able to achieve about 91% accuracy, so not great.

We hypothesized that the main contributor to our lack of validation accuracy was the lack of using batches. As we will see, it was not, or at least, not entirely that.

Finding the Hidden Error

Debugging our Chaos Network was painful. There was no undefined behaviour, there were no visible bugs, only a gut feeling that our network should be performing better.

To keep this section short I will give a bullet point list of the steps taken to rectify this bug.

Compare each and every Tensor Operation with that of Pytorch's
Get stuck checking the Negative Log Likelihood for three days. It was fine
Go back through the other Tensor Operations
Curse the fates for letting me be born in the era of computing
Recreate the exact layout, weights, and order of operations of my Chaos Network in Pytorch to compare gradients when going backwards
Add the split_on_add tensor operation

I'm not going to do a deep dive into the code for our automatic differentiation library, as we did a small dive in Part 1 (we will cover some updates later), but the main issue was fixed when adding the following tensor operation:

impl<const N: usize> Tensor1D<N> {
    pub fn split_on_add(self, count: usize) -> Vec<Self> {
        match &self.tape {
            Some(tape) => {
                let mut new_tensors: Vec<Tensor1D<N>> = (0..count)
                    .map(|_i| Self::new_with_tape(self.data, Some(tape.clone())))

                for t in new_tensors.iter_mut() {
                    let self_id = t.grad_for;
                    let old_self_id = self.grad_for;
                    // For each split element grab their gradients, and the parent gradients, and
                    // add them together
                    t.tape.as_mut().unwrap().write().unwrap().add_operation((
                        self_id,
                        Box::new(move |g| {
                            let tg = g.remove(self_id);
                            let mut tg2 = g.remove_or_0(old_self_id);
                            tg2.data = element_wise_addition::<N>(&tg2.data, &tg.data);
                            g.insert(old_self_id, tg2);
                        }),
                    ));
                }
                new_tensors
            }
            None => (0..count)
                .map(|_i| Tensor1D::new_without_tape(self.data))
                .collect(),
        }
    }
}

For brevity's sake, in version 0.1 of our Chaos Network, in our forward function, we would multiply the weight of the current node once for every connection it had, but our automatic differentiation library was not designed to handle multiple operations by the same tensor. By adding the above function, we first split_on_add the tensor we want to perform multiple operations with, and then can simply pop a version off of the returned Vec of our tensors.

It doesn't matter if the above makes sense, the context required to understand it fully is missing, the main point is in Chaos Networks Part 1 we had an error accumulating gradients while going backwards, and that is now fixed!

Minor Upgrades and Improvements

One of the improvements we made is switching from a traditional graph tree-structure where each node holds its connections to the next, to a streamlined version where the network holds a HashMap of edges. Because we are using Rust, this greatly reduced the use of Rc and RefCell although this was probably not a limiting factor for speed.

As it turns out, the two greatest restrictions to our speed were the HashMap we used when accumulating gradients while going forward through the network, and each tensor having to join its held tape after each operation.

Improving the speed of the HashMap was relatively simple. We used rustc_hash which greatly improved our hashing speeds.

Removing the need for tensors to handle their own tape was a little more involved. In Part 1 we mentioned the inspiration for our automatic differentiation implementation came from dfdx, an absolutely awesome deep learning library for Rust. Each tensors tape is really just Vec<Box<dyn FnOnce(&mut Gradients) + Send + Sync>)>. Maybe the author of dfdx has a much more efficient way to combine vectors than I do, but the combining of these tapes (hundreds of thousands every second) would greatly slow down the runtime performance. In the end, I switched to a system where there is one tape of type Arc<RwLock<Tape>>, of which each tensor has a copy.

These two minor improvements greatly increased the runtime performance of our Chaos Network.

Let There Be Batches

There are two reasons why we went through the pain of creating our own automatic differentiation library when libraries like dfdx already exist: to learn how they work, and to specialize it for our use case. The automatic differentiation library we wrote is highly specialized for our Chaos Network. It only has the functions we need, does not work in many general purpose situations unless great care is taken, and in the case of batches, is specially built to manage the gradients for 1-dimensional tensors differently.

Let's examine the new and updated code we created to support batches in the mul implementation.

#[derive(Debug, Clone)]
pub struct Tensor1D<const N: usize> {
    pub id: u64,
    pub grad_for: u64,
    pub data: [f64; N],
    pub tape: Option<Arc<RwLock<Tape<N>>>>,
}

First we introduce the concept of a 1-dimensional tensor (above).

impl<'a, 'b, const N: usize> Mul<&'b mut Tensor1D<N>> for &'a mut Tensor0D<N> {
    type Output = Tensor1D<N>

    fn mul(self, other: &'b mut Tensor1D<N>) -> Self::Output {
        let new_data: [f64; N] = other.data.map(|d| self.data * d);
        let mut new = Tensor1D::new_without_tape(new_data);

        match (&self.tape, &other.tape) {
            (Some(self_tape), Some(_other_tape)) => {
                let new_id = new.grad_for;
                let self_id = self.grad_for;
                let other_id = other.grad_for;
                let self_data = self.data;
                let other_data = other.data.clone();
                self_tape.write().unwrap().add_operation((
                    new_id,
                    Box::new(move |g| {
                        let mut tg1 = g.remove(new_id);
                        let mut tg2 = tg1.clone();
                        tg1.data = element_wise_mul::<N>(&tg1.data, &other_data);
                        tg2.data = tg2.data.map(|x| x * self_data);
                        g.insert(self_id, tg1);
                        g.insert(other_id, tg2);
                    }),
                ));
                new.set_tape(self.tape.clone());
            }
            (Some(self_tape), None) => {
                let new_id = new.grad_for;
                let self_id = self.grad_for;
                let other_data = other.data.clone();
                self_tape.write().unwrap().add_operation((
                    new_id,
                    Box::new(move |g| {
                        let mut tg = g.remove(new_id);
                        tg.data = element_wise_mul::<N>(&tg.data, &other_data);
                        g.insert(self_id, tg);
                    }),
                ));
                new.set_tape(self.tape.clone());
            }
            (None, Some(_other_tape)) => {
                panic!("Switch operator orientation");
            }
            (None, None) => (),
        }

        new
    }
}

The above is the only implementation of Mul for both 0-dimensional and 1-dimensional tensors. That would not make sense in the context of a normal automatic differentiation library, but because we have highly specialized one built for Chaos Networks, not only does it make sense, but it saves us an immense amount of work and computation.

Let's examine one of the passing test cases for our implementation.

#[test]
fn test_mul_1d() {
    let tape: Arc<RwLock<Tape<3>>> = Arc::new(RwLock::new(Tape::new()));
    let mut a = Tensor0D::new_with_tape(2., Some(tape.clone()));
    let mut b = Tensor1D::new_without_tape([1., 2., 3.]);
    let mut c = &mut a * &mut b;
    // Check value match
    assert_eq!([2., 4., 6.], c.data);
    // Check gradients
    let mut grads = c.backward();
    let a_grads = grads.remove(a.grad_for);
    assert_eq!([1., 2., 3.], a_grads.data);
}

If you haven't done a lot of work with automatic differentiation everything may seem normal but the above code is noteworthy because it is different than what Pytorch would yield.

import torch as torch
import torch.nn as nn

w = torch.tensor(2., dtype=torch.float64, requires_grad=True)
x = torch.tensor([1, 2, 3], dtype=torch.float64)
y = w * x
print(y) # Prints tensor([2., 4., 6.], dtype=torch.float64, grad_fn=<MulBackward0>)
y.backward(torch.ones_like(x))
print(w.grad) # Prints tensor(6., dtype=torch.float64)

Notice how the gradients of w (the analogous Python variable for mut a above) are the summation of the gradient vector produced by our implementation of Mul. This is done with purpose. Notice the way the tapes are combined in Mul function above. Most noticeably, they are combined through element wise multiplication.

All of the above to say, 1-dimensional tensors are treated as a batch. Each element is a separate input corresponding with that batch. This is why gradients are not combined. If you are curious to see how this works, check out the repo!

Adding this form of batching improved the runtime performance of the network by a factor close to the batch size being used (MASSIVE IMPROVEMENT).

Training the Chaos Network

We have a neural network represented by an acyclic graph, now we must train it. Before showing any code, let's ask ourselves a few questions.

What should the structure of the network be? We have complete freedom in how we want to structure our Chaos Network, but complete freedom is just that, complete freedom. It is our job to establish laws regulating this structure.

How should the structure of the network change as we train it? One of the core ideas behind the Chaos Network is that it is chaotic, morphing as we train it. Once again, it is up to us to create the laws regulating this growth.

The first question has a few relatively simple answers, and I went with the easiest.

#[derive(Default)]
pub struct Network<const N: usize> {
    pub inputs_count: usize,
    pub leaves_count: usize,
    pub nodes: Vec<Node<N>>,
    pub connections_to: HashMap<i32, Vec<usize>>,
    pub mode: NetworkMode,
    pub tape: Arc<RwLock<Tape<N>>>,
}

impl<const N: usize> Network<N> {
    pub fn new(inputs: usize, outputs: usize) -> Self {
        let mut network: Network<N> = Network::default();
        network.add_nodes(NodeKind::Leaf, outputs);
        network.add_nodes(NodeKind::Input, inputs);
        network
    }

    pub fn add_nodes(&mut self, kind: NodeKind, count: usize) {
        match kind {
            NodeKind::Normal => {
                let node_index = self.batch_insert_normal_nodes(count);
                for i in 0..(count) {
                    for _ii in 0..10 {
                        self.add_node_connection_to(node_index + i);
                        self.add_node_connection_from(node_index + i);
                    }
                }
            }
            NodeKind::Input => {
                self.inputs_count += count;
                let mut nodes: Vec<Node<N>> = (0..count).map(|_i| Node::new(kind)).collect();
                for _i in 0..count {
                    self.nodes.insert(0, nodes.remove(0));
                }
                self.shift_all_connections_after(0, count, ShiftDirection::Forward);
                if self.leaves_count == 0 {
                    panic!("This should be called after leaves are added for now")
                }
                let mut rng = rand::thread_rng();
                let odds_of_being_picked = 100. / (count as f64);
                for i in 0..count {
                    if odds_of_being_picked > rng.gen::<f64>() {
                        let input_node_index = distribution_between.sample(&mut rng);
                        self.add_connection_between(i, input_node_index);
                    }
                }
            }
            NodeKind::Leaf => {
                self.leaves_count += count;
                for _i in 0..count {
                    self.nodes.push(Node::new(kind));
                }
            }
        }
    }
}

The above is the definition of the Network and the new and add_nodes implemented functions. Nodes are one of three variations: Input, Normal and Leaf. Input Nodes receive input, Normal Nodes would be like nodes in hidden layers and Leaf Nodes are output.

The new function is syntactic sugar around the add_nodes function. The add_nodes function takes the NodeKind and the count, and adds the count of new NodeKinds. Most notably, not all Input Nodes start with connections as this is part of the over-paramaterization Chaos Networks are trying to fix.

The above answers our first posed question, but the second question about how the network should grow is a bit trickier. Outlining the problem further, at its core we have a graph that we want to change, and a heuristic (the validation accuracy) that we can use to evaluate that change. This sounds like a perfect use for evolutionary algorithms. For those not familiar, we essentially have some starting population, make some changes to that population, take the best subset N from the new population evaluated using our heuristic function, and repeat.

In our case, we start with some N Chaos Networks, train those and some randomly modified versions of those, evaluate how well they perform, mix the current population and the new randomly modified population, and repeat.

The code doing this is below. I've had to remove some of the context to keep things smaller.

pub struct StandardClassificationNetworkHandler<const I: usize, const O: usize, const N: usize> {
    max_training_steps: usize,
    steps_per_training_step: usize,
    train_data: Arc<RepeatingNetworkData<I, N>>,
    test_data: Arc<RepeatingNetworkData<I, N>>,
    validation_steps: usize,
}


impl<const I: usize, const O: usize, const N: usize> StandardClassificationNetworkHandler<I, O, N> {
    pub fn train(&mut self) {
        // Create the initial population
        let mut population: Vec<Network<N>> =
            (0..POPULATION_SIZE).map(|_i| Network::new(I, O)).collect();

        // Prep some stuff for getting random data points
        let train_data_distribution = Uniform::from(0..self.train_data.len());
        let test_data_distribution = Uniform::from(0..self.test_data.len());
        let mut rng = thread_rng();

        // Do the actual training
        for training_step in 0..self.max_training_steps {
            // Prep the random indexes
            let train_data_random_indexes = (0..self.steps_per_training_step)
                .map(|_i| {
                    (0..N)
                        .map(|_ii| train_data_distribution.sample(&mut rng))
                        .collect::<Vec<usize>>()
                        .try_into()
                        .unwrap()
                })
                .collect::<Vec<[usize; N]>>();
            let test_data_random_indexes = (0..self.validation_steps)
                .map(|_i| {
                    (0..N)
                        .map(|_ii| test_data_distribution.sample(&mut rng))
                        .collect::<Vec<usize>>()
                        .try_into()
                        .unwrap()
                })
                .collect::<Vec<[usize; N]>>();

            // Current population and maybe new networks
            population = if training_step != 0 && training_step % 10 == 0 {
                let new_networks = population.iter().map(|x| x.clone()).collect();
                let new_morphed_networks = self.train_population(
                    new_networks,
                    &train_data_random_indexes,
                    &test_data_random_indexes,
                    true,
                );
                let new_population_networks = self.train_population(
                    population,
                    &train_data_random_indexes,
                    &test_data_random_indexes,
                    false,
                );
                
                // Merge them together
                new_population_networks
                    .into_iter()
                    .zip(new_morphed_networks.into_iter())
                    .rev()
                    .skip(POPULATION_SIZE / 2)
                    .map(|(a, b)| vec![a.0, b.0])
                    .flatten()
                    .collect()
            } else {
                self.train_population(
                    population,
                    &train_data_random_indexes,
                    &test_data_random_indexes,
                    false,
                )
                .into_iter()
                .map(|(n, _, _)| n)
                .collect()
            }
        }
    }

    fn train_population(
        &self,
        population: Vec<Network<N>>,
        train_data_random_indexes: &Vec<[usize; N]>,
        test_data_random_indexes: &Vec<[usize; N]>,
        do_morph: bool,
    ) -> Vec<(Network<N>, f64, f64)> {
        // Do the training and validation
        let new_networks: Vec<(Network<N>, f64)> = population
            .into_par_iter()
            .map(|mut network| {
                if do_morph {
                    let mut rng = rand::thread_rng();
                    let percent_nodes_to_add = rng.gen::<f64>() / 10.;
                    let percent_connections_to_add = rng.gen::<f64>() / 10.;
                    let percent_connections_to_remove = (rng.gen::<f64>() / 10.) * 3.;
                    grow(
                        &mut network,
                        percent_nodes_to_add,
                        percent_connections_to_add,
                    );
                    prune(&mut network, percent_connections_to_remove);
                }
                let train_data = self.train_data.clone();
                let test_data = self.test_data.clone();
                for step in 0..self.steps_per_training_step {
                    train_next_batch(
                        &mut network,
                        train_data.next(&train_data_random_indexes[step]),
                    );
                }
                let average_validation_accuracy: f64 = (0..self.validation_steps)
                    .map(|step| {
                        validate_next_batch(
                            &mut network,
                            test_data.next(&test_data_random_indexes[step]),
                        )
                    })
                    .sum::<f64>()
                    / (self.validation_steps as f64);
                (network, average_validation_accuracy)
            })
            .collect();
        let network_sizes: Vec<f64> = new_networks
            .iter()
            .map(|(n, _)| n.get_connection_count() as f64)
            .collect();
        let min_network_size = network_sizes
            .iter()
            .min_by(|a, b| a.partial_cmp(&b).unwrap())
            .unwrap();
        let mut new_networks: Vec<(Network<N>, f64, f64)> = new_networks
            .into_iter()
            .map(|(n, ava)| {
                let connection_count = n.get_connection_count() as f64;
                (n, ava, ava + ((min_network_size / connection_count) * 0.05))
            })
            .collect();
        new_networks.sort_by(|a, b| b.2.partial_cmp(&a.2).unwrap());
        new_networks
    }
}

Essentially, this does exactly what is described above. Let's walk through it:

We first create a population of Chaos Networks (for most cases I tried, the population size was between 10 and 100)
Every training_step we train the population for a selected number of batches
Every 10 training_steps we take the population, clone it, and perform some random adding and removing of nodes and connections
We remove the bottom 50% of the population and replace it with the top 50% of the new networks created from making random changes to the existing population

I made the following changes to the traditional evolutionary algorithm. One, our population size is tiny. This is simply due to hardware restrictions. Two, we have no crossover between the population. Most of the time, there is an idea of crossover when mutating the population. We do not do this, instead each member of the population has its own "child" that gets mutated randomly. It may be beneficial to come back and explore the idea of crossover and changing the mutation function. Three, we do not only chose Chaos Networks with the highest score. When evaluating which Chaos Networks should survive to the next generation, we take the best 50% of the current population and the best 50% of the new morphed population. This is not standard, but the motivation for this is the same as the motivation in the next point. Four, we do not mutate the population every iteration. I found that mutating the population every training step did not give the new additions enough time to increase their average_validation_accuracy enough to stick around through another mutation such that our starting networks would become the only networks that ever last a round in the population.

The heuristic function used is worth mentioning. We score not only on average_validation_accuracy over some number of batches, but also on the comparison of the size of the networks. This encourages not only a high average_validation_accuracy but a low network size.

Evaluating the Chaos Network

How does it perform? Not very well, but it is important to qualify that statement.

Through testing, I have found Chaos Networks with about 1,300 connections that have an average_validation_accuracy of about 93.0% on the MNIST dataset (you could probably try for longer and get better). I have not tested it on other datasets yet.

Now 93.00% is not impressive, but the size of the network for that percent is, from my understanding, very good. To put this into perspective, a standard feed-forward neural network that will solve MNIST will have an input layer with 784 nodes, a hidden layer with 512 nodes, and an output layer with 10 nodes. This standard feed-forward network will have over 400,000 connections. I am confident with some tweaking to the heuristic and evolutionary algorithm approach we can improve these results even further.

Haters will say you could build a feed-forward neural network that is even smaller and performs better for MNIST which is (probably) correct, but the Chaos Network is not meant solely for MNIST, it is a framework for building feed-forward neural networks that are not over-paramaterized.

Thanks for reading!

the repo