AI Logbook
Live Learning Feed

AI Logbook

Understanding intelligent systems from first principles.

Partial Derivatives: Isolating the Blame

Isolating the BlamePartial DerivativesThe Two-Parameter Model

๐Ÿง The Theory

AI/ML Concept: Isolating the Blame

Why do we calculate two separate gradients? Because ww and bb do entirely different things geometrically.

  • The weight (ww) changes the angle or slope of our prediction line.
  • The bias (bb) shifts the entire prediction line up or down the y-axis.

If a prediction is wrong, the network needs to know exactly how much of the error was caused by a bad angle, and exactly how much was caused by a bad vertical shift. Partial derivatives allow the neural network to isolate the blame.

The Loss function looks at the total error and splits it up:

  • "Weight, if you change by this specific amount (โˆ‚Lโˆ‚w\frac{\partial L}{\partial w}), the error will go down."
  • "Bias, if you shift by this entirely different amount (โˆ‚Lโˆ‚b\frac{\partial L}{\partial b}), the error will also go down."

We then apply the Gradient Descent update rule to both of them simultaneously.

๐Ÿ“The Math

Math: Partial Derivatives

Yesterday, we took the derivative of our loss function with respect to a single weight. But a real prediction equation has a bias (bb) as well:
y^=(wโ‹…x)+b\hat{y} = (w \cdot x) + b

When we plug this into our squared error loss function, it looks like this:
L=((wโ‹…x)+bโˆ’y)2L = ((w \cdot x) + b - y)^2

Now we have two parameters we can change to reduce the error: ww and bb. To figure out how to update them, we need to take the derivative of the loss with respect to ww, and also the derivative of the loss with respect to bb.

When you have an equation with multiple variables and you take the derivative for just one of them, it is called a Partial Derivative (denoted with the symbol โˆ‚\partial instead of a standard dd).

To take a partial derivative, you treat the variable you are focusing on as normal, and you pretend every other variable is just a constant number.

  1. The Partial Derivative with respect to ww (โˆ‚Lโˆ‚w\frac{\partial L}{\partial w}):
    Using the Chain Rule, the outer derivative is 2(y^โˆ’y)2(\hat{y} - y). The inner derivative of (wโ‹…x)+b(w \cdot x) + b with respect to ww is just xx (because the constant bb disappears).
    โˆ‚Lโˆ‚w=2(y^โˆ’y)โ‹…x\frac{\partial L}{\partial w} = 2(\hat{y} - y) \cdot x

  2. The Partial Derivative with respect to bb (โˆ‚Lโˆ‚b\frac{\partial L}{\partial b}):
    The outer derivative is exactly the same: 2(y^โˆ’y)2(\hat{y} - y). The inner derivative of (wโ‹…x)+b(w \cdot x) + b with respect to bb is just 11 (because the derivative of bb is 11, and the wโ‹…xw \cdot x term is treated as a constant and disappears).
    โˆ‚Lโˆ‚b=2(y^โˆ’y)โ‹…1\frac{\partial L}{\partial b} = 2(\hat{y} - y) \cdot 1

๐Ÿ’กInsights and Mistakes

Developer's Insight: Disconnected Pipelines & Self-Scaling Math

During this implementation, I ran into an architectural bug and discovered a fascinating mathematical property of gradient descent.

1. The Disconnected Pipeline Bug
Initially, the math wasn't working because I forgot to add the bias into the forward_pass function. Even though my gradient formulas were perfect, the parameters updating at the bottom of the loop were disconnected from the prediction generation at the top. If a parameter isn't used in the forward pass, it cannot impact the loss, rendering backpropagation useless.

2. Non-Linear Step Scaling
I ran experiments adjusting the learning_rate (ฮฑ\alpha). I noticed that increasing the learning rate by 5 times (from 0.01 to 0.05) decreased the required epochs by almost 7 times (from 140 to 22). The speedup isn't perfectly 1:1; larger learning rates compound their efficiency right up until the point they cause the model to diverge.

3. The Self-Scaling Gradient
I tested changing the target dataset. I increased yy from 150 to 200, and then increased both xx and yy simultaneously. To my surprise, the model converged in the exact same number of epochs.
Why? Because the gradient equation 2 * x * (y_hat - y) scales itself. If the target yy is much larger, the initial error is massive. This causes the gradient to output a massive step size right out of the gate. The math automatically leaps further to cover the larger distance in the same amount of time.

โš™๏ธThe Code

def forward_pass(x: float, w: float, b: float) -> float:
    """the prediction (y_hat) of our model."""
    return (w * x) + b

def calculate_loss(y: float, y_hat: float) -> float:
    """the squared error loss."""
    return (y_hat - y) ** 2

def get_gradient_w(x: float, y: float, y_hat: float) -> float:
    """Uses the Chain Rule to calculate how much to change the weight."""
    return 2 * x * (y_hat - y)

def get_gradient_b(x: float, y: float, y_hat: float) -> float:
    """Uses the Chain Rule to calculate how much to change the bias."""
    return 2 * (y_hat - y)

# House that is 2000 SqFt (x = 2.0). True price is $150k (y = 150.0).
x = 2.0
y = 200.0

# Initial random parameters
w = 1.0 
b = 1.0
learning_rate = 0.05
epochs = 200


for epoch in range(epochs):
    y_hat = forward_pass(x, w, b) 
    loss = calculate_loss(y, y_hat)
    w_gradient = get_gradient_w(x, y, y_hat)
    b_gradient = get_gradient_b(x, y, y_hat)
    w = w - learning_rate * w_gradient
    b = b - learning_rate * b_gradient
    print(f"Epoch {epoch + 1}: Weight = {w:.4f}, Bias = {b:.4f}, Loss = {loss:.8f}")

Code Breakdown

  • forward_pass(...): Updated to include the bias parameter. Without this, the bias gradient has no effect!
  • get_gradient_w vs get_gradient_b: We compute two separate partial derivatives. The weight gradient is scaled by x, while the bias gradient is scaled by 1.
  • Both parameters are updated simultaneously using their respective gradients and the shared learning rate.