Building an autograd engine from (almost) scratch

Josh Black-Star, pythonmlautograd


With all the recent AI hype, I've decided that it might be a good idea to learn how model training actually works so I'm going to try to build an autograd engine from (almost) scratch using only numpy and base Python. The natural first step seems like it will be to create a Tensor class wrapped around a numpy array, so let's do that. There are a large number of operations associated with tensors, but let's start with the basic arithmatic operations; addition, subtraction, multiplication and (true) division:

import numpy as np

class Tensor:
    def __init__(self, data) -> None: = data if isinstance(data, np.ndarray) else np.array(data)

    def __add__(self, other) -> "Tensor":
        return Tensor( +

    def __sub__(self, other) -> "Tensor":
        return Tensor( -

    def __mul__(self, other) -> "Tensor":
        return Tensor( *

    def __truediv__(self, other) -> "Tensor":
        return Tensor( /

Note: I'm not a web dev and haven't figured out how to get syntax highlighting working in NextJS yet. Forgive me.

Now that we have these four basic binary operations made, we can run something like the following to use them:

if __name__ == "__main__":
    x = Tensor([8])
    y = Tensor([5])
    z = x + y


Sweet. I think that's a pretty good starting point.

I've included a list of all of the binary operations below a tensor could perform, of which there are quite a few! We'll skip most of these for now, but I wanted to include the list anyway to act as a list for future reference. Note: I will try to implement these from scratch where possible, but the focus here is to understand how a tensor works, so I may lean on numpy a little bit here and there.

Arithmetic operations:
- Addition (+)
- Subtraction (-)
- Multiplication (*)
- Division (/)
- Modulus (%)
- Power (**)

Comparison operations:
- Equal (==)
- Not equal (!=)
- Less than (<)
- Less than or equal to (<=)
- Greater than (>)
- Greater than or equal to (>=)

Bitwise operations:
- Bitwise AND (&)
- Bitwise OR (|)
- Bitwise XOR (^)
- Bitwise NOT (~)
- Left shift (<<)
- Right shift (>>)

Logical operations:
- Logical AND (np.logical_and)
- Logical OR (np.logical_or)
- Logical NOT (np.logical_not)
- Logical XOR (np.logical_xor)

Other operations:
- Dot product (
- Cross product (np.cross)
- Outer product (np.outer)
- Matrix multiplication (@)

Let's zoom in a little more. In order to progress, we'll have to define both the forward and backward method for each of these binary operations. In the context of neural networks and automatic differentiation, the terms "forward" and "backward" methods refer to two key steps in the computation and optimization process.

The forward method is the process of passing the input data through the neural network to generate an output. This involves performing all the computations defined by the network's architecture, such as matrix multiplications, activations, convolutions, etc., in the order from the input layer to the output layer. The result of the forward method is the network's prediction given the input data.

The backward method, also known as backpropagation, is the process of calculating the gradient of the loss function with respect to the network's parameters. This is done by applying the chain rule of differentiation from calculus in a recursive manner from the output layer back to the input layer. The gradients obtained from the backward method are used to update the network's parameters during the optimization process (e.g., gradient descent), with the goal of minimizing the loss function.

Put simply, the forward method is about making predictions, and the backward method is about learning from the prediction errors to improve the model. In order to define these in our code, we'll define the forward method as the regular operation, and we'll treat the backward method as the derivative. Example for the Add class below:

class Add:
    def forward(x, y):
        z = x + y
        return x + y

    def backward(context, grad):
        d(x + y)/dx = 1
        d(x + y)/dy = 1

        return [1] for both x and y
        x, y = context.args
        return [1], [1]

Taking a little break for now... will be back to write more in a few hours.

Hello from 2024!

I have not been keeping up too much with the play-by-play writing of how development of punytorch has been going, because it was honestly pretty boring to write, and I don't think it served anyone to read about how I did something arguably pretty basic. If you care about the development, follow the link to the repository (and help me out, if you're able!) I'm having an incredibly fulfilling time working on this project, but I think I should only write about more interesting topics, which I still plan to do. No one wants to read an anthology of my commit messages. The repository is getting pretty cool, though, and you should check it out. As of writing this, I'm exploring transformers and text generation, which is crazy hard. Huge props to all the AI labs working on this stuff, they are all brilliant. I thought about writing a post titled "Dtype Hell" talking about the last 3 or 4 days of building, but figured the time would be better spent trying to get myself out of Dtype Hell, so I'm going to do that instead. See you next time!

© Josh Black-Star.RSS