The Chain Rule & Backpropagation

Backpropagation (Single Node)The Chain RuleBackpropagating a Linear Regressor

🧠The Theory

AI/ML Concept: Backpropagation (Single Node)

This concept of multiplying derivatives backward through a chain of equations is called Backpropagation.

When data flows forward through our network to make a prediction, it is called the Forward Pass. But when we calculate the error, we have to trace that error backward to see exactly who is responsible for it. We propagate the error backwards.

The Loss function says: "Hey $\hat{y}$ , you were off by this much!" (The Outer Derivative)
The Prediction function $\hat{y}$ turns around to the weight and says: "Hey $w$ , because the input $x$ was this size, your portion of the blame is this!" (The Inner Derivative multiplied by the Outer)

By chaining these derivatives together, the neural network can assign exact mathematical blame to every single weight, no matter how many hidden layers deep it is.

📐The Math

Math: The Chain Rule

In machine learning, our math is a chain of functions nested inside each other.

We calculate a prediction: $\hat{y} = w \cdot x$
We plug that prediction into our loss function: $L = (\hat{y} - y)^2$

If we substitute the first equation into the second, our full equation is:
$L = ((w \cdot x) - y)^2$

How do we find the derivative of this nested function with respect to our weight ( $\frac{\partial L}{\partial w}$ )? We use The Chain Rule.

The Chain Rule states that the derivative of nested functions is the product of their individual derivatives.
$\frac{\partial L}{\partial w} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial w}$

Let's break that down:

The Outer Derivative ( $\frac{\partial L}{\partial \hat{y}}$ ): How does the prediction affect the loss? Using the power rule on $L = (\hat{y} - y)^2$ , the derivative is $2(\hat{y} - y)$ .
The Inner Derivative ( $\frac{\partial \hat{y}}{\partial w}$ ): How does the weight affect the prediction? The derivative of $\hat{y} = w \cdot x$ with respect to $w$ is just $x$ .

Multiply them together, and you have the exact formula for your gradient:
$\text{Gradient} = 2(\hat{y} - y) \cdot x$

💡Insights and Mistakes

Developer's Insight: Asymptotic Convergence

While running the Backpropagation loop, I noticed an interesting pattern with the Loss over time. I had to manually increase the epochs variable to track when the model actually finished learning.

Here is what I observed during execution:

Epoch 20: Loss = 404.0076
Epoch 50: Loss = 2.7143 (Still dropping rapidly)
Epoch 100: Loss = 0.0006 (Slowing down significantly)
Epoch 112: Loss = 0.00008 (Converged. Barely changing after this point).

The Insight: Why does the learning slow down so drastically at the end? It is mathematically baked into our gradient formula: 2 * x * (y_hat - y).

As the model gets smarter, the prediction y_hat gets closer and closer to the true target y. This means the error (y_hat - y) approaches $0$ . Since the error is multiplying the entire gradient equation, the gradient itself shrinks toward $0$ .

Because the gradient is shrinking, our step sizes get microscopically small. The model takes massive leaps when it's wrong, but delicately tip-toes as it approaches the exact right answer. This is known as asymptotic convergence!

⚙️The Code

def forward_pass(x: float, w: float) -> float:
    """the prediction (y_hat) of our model."""
    return w * x

def calculate_loss(y: float, y_hat: float) -> float:
    """the squared error loss."""
    return (y_hat - y) ** 2

def get_gradient(x: float, y: float, y_hat: float) -> float:
    """Uses the Chain Rule to calculate how much to change the weight."""
    return 2 * x * (y_hat - y)

# House that is 2000 SqFt (x = 2.0). True price is $100k (y = 100.0).
x = 2.0
y = 100.0

w = 1.0 # Initial weight (price per SqFt)
learning_rate = 0.01
epochs = 125

for epoch in range(epochs):
    y_hat = forward_pass(x, w) 
    loss = calculate_loss(y, y_hat)
    gradient = get_gradient(x, y, y_hat)
    w = w - learning_rate * gradient
    print(f"Epoch {epoch + 1}: Weight = {w:.4f}, Loss = {loss:.8f}")

Code Breakdown

get_gradient(...): Implements the Chain Rule formula 2 * x * (y_hat - y). This exact equation points the weight in the exact direction needed to reduce error.
The Loop: Notice the clear separation of the Forward Pass (forward_pass), Loss Calculation (calculate_loss), and Backpropagation (get_gradient). This mirrors exactly how PyTorch structures its training loops!

Partial Derivatives: Isolating the Blame The Derivative & Gradient Descent