Partial Derivatives: Isolating the Blame
๐ง The Theory
AI/ML Concept: Isolating the Blame
Why do we calculate two separate gradients? Because and do entirely different things geometrically.
- The weight () changes the angle or slope of our prediction line.
- The bias () shifts the entire prediction line up or down the y-axis.
If a prediction is wrong, the network needs to know exactly how much of the error was caused by a bad angle, and exactly how much was caused by a bad vertical shift. Partial derivatives allow the neural network to isolate the blame.
The Loss function looks at the total error and splits it up:
- "Weight, if you change by this specific amount (), the error will go down."
- "Bias, if you shift by this entirely different amount (), the error will also go down."
We then apply the Gradient Descent update rule to both of them simultaneously.
๐The Math
Math: Partial Derivatives
Yesterday, we took the derivative of our loss function with respect to a single weight. But a real prediction equation has a bias () as well:
When we plug this into our squared error loss function, it looks like this:
Now we have two parameters we can change to reduce the error: and . To figure out how to update them, we need to take the derivative of the loss with respect to , and also the derivative of the loss with respect to .
When you have an equation with multiple variables and you take the derivative for just one of them, it is called a Partial Derivative (denoted with the symbol instead of a standard ).
To take a partial derivative, you treat the variable you are focusing on as normal, and you pretend every other variable is just a constant number.
-
The Partial Derivative with respect to ():
Using the Chain Rule, the outer derivative is . The inner derivative of with respect to is just (because the constant disappears).
-
The Partial Derivative with respect to ():
The outer derivative is exactly the same: . The inner derivative of with respect to is just (because the derivative of is , and the term is treated as a constant and disappears).
๐กInsights and Mistakes
Developer's Insight: Disconnected Pipelines & Self-Scaling Math
During this implementation, I ran into an architectural bug and discovered a fascinating mathematical property of gradient descent.
1. The Disconnected Pipeline Bug
Initially, the math wasn't working because I forgot to add the bias into the forward_pass function. Even though my gradient formulas were perfect, the parameters updating at the bottom of the loop were disconnected from the prediction generation at the top. If a parameter isn't used in the forward pass, it cannot impact the loss, rendering backpropagation useless.
2. Non-Linear Step Scaling
I ran experiments adjusting the learning_rate (). I noticed that increasing the learning rate by 5 times (from 0.01 to 0.05) decreased the required epochs by almost 7 times (from 140 to 22). The speedup isn't perfectly 1:1; larger learning rates compound their efficiency right up until the point they cause the model to diverge.
3. The Self-Scaling Gradient
I tested changing the target dataset. I increased from 150 to 200, and then increased both and simultaneously. To my surprise, the model converged in the exact same number of epochs.
Why? Because the gradient equation 2 * x * (y_hat - y) scales itself. If the target is much larger, the initial error is massive. This causes the gradient to output a massive step size right out of the gate. The math automatically leaps further to cover the larger distance in the same amount of time.
โ๏ธThe Code
def forward_pass(x: float, w: float, b: float) -> float:
"""the prediction (y_hat) of our model."""
return (w * x) + b
def calculate_loss(y: float, y_hat: float) -> float:
"""the squared error loss."""
return (y_hat - y) ** 2
def get_gradient_w(x: float, y: float, y_hat: float) -> float:
"""Uses the Chain Rule to calculate how much to change the weight."""
return 2 * x * (y_hat - y)
def get_gradient_b(x: float, y: float, y_hat: float) -> float:
"""Uses the Chain Rule to calculate how much to change the bias."""
return 2 * (y_hat - y)
# House that is 2000 SqFt (x = 2.0). True price is $150k (y = 150.0).
x = 2.0
y = 200.0
# Initial random parameters
w = 1.0
b = 1.0
learning_rate = 0.05
epochs = 200
for epoch in range(epochs):
y_hat = forward_pass(x, w, b)
loss = calculate_loss(y, y_hat)
w_gradient = get_gradient_w(x, y, y_hat)
b_gradient = get_gradient_b(x, y, y_hat)
w = w - learning_rate * w_gradient
b = b - learning_rate * b_gradient
print(f"Epoch {epoch + 1}: Weight = {w:.4f}, Bias = {b:.4f}, Loss = {loss:.8f}")Code Breakdown
forward_pass(...): Updated to include the bias parameter. Without this, the bias gradient has no effect!get_gradient_wvsget_gradient_b: We compute two separate partial derivatives. The weight gradient is scaled byx, while the bias gradient is scaled by1.- Both parameters are updated simultaneously using their respective gradients and the shared learning rate.