The Calculus of Regularization: Ridge (L2)
🧠The Theory
AI/ML Concept: The Rubber Band Effect
Why does squaring the weights fix multicollinearity? It comes down to the geometry of exponents.
Imagine two features are perfectly correlated, and the model needs them to contribute a total value of 1.0 to the prediction.
- Scenario A (Unregularized): The model sets and . The net effect is 1.0. The MSE is happy. But the penalty calculates: . Massive penalty!
- Scenario B (Regularized): The model sets and . The net effect is still 1.0. The MSE is still happy. But the penalty calculates: . Very low penalty!
Regularization acts like a rubber band attached to every weight, pulling them toward zero. The larger the weight grows, the harder the rubber band pulls back. This mathematically forces the algorithm to distribute the workload evenly across correlated features instead of letting one dominate.
📐The Math
Math: Penalizing Confidence ( Ridge)
If a model is allowed to grow its weights to infinity, it will exploit highly correlated features by assigning massive positive and negative weights that cancel each other out. This makes the model extremely unstable in production.
To stop this, we change the Loss Function. We add a Penalty Term that punishes the model simply for having large weights.
The (Ridge) penalty adds the squared sum of all weights to the Mean Squared Error, scaled by a tuning parameter called Lambda ():
The Calculus Update:
Because we changed the Loss function, we must take the derivative of this new penalty term to update our gradients. The derivative of with respect to a specific weight is .
So, our new Batch Gradient for the weights becomes:
(Note: We divide the penalty by to keep it on the same scale as our MSE average).
Crucial Rule: We never regularize the bias (). The bias just shifts the baseline up or down; it doesn't cause overfitting. The gradient for the bias remains unchanged.
💡Insights and Mistakes
Developer's Insight: Reading the Rubber Band
When I first wrote my regularized gradient function, I ran a quick test using a massive fake weight () and a of .
The base, unpenalized MSE gradient was -5.0. In gradient descent, we subtract the gradient, meaning the raw MSE was actively telling the model to increase the weight and make it even more massive!
When I printed the final regularized gradient, it had flipped to 39999.97. The penalty completely overpowered the MSE. By flipping the gradient positive, it forces the update rule to heavily subtract from the weight, snapping it back down toward zero.
The Bug: I also realized exactly why my output was basically (which is ). I forgot to divide my penalty calculation by ! If you don't divide by the number of data points, the penalty scales to infinity as your dataset grows, instantly crushing your weights to zero regardless of the MSE. Dividing by ensures the penalty's force remains proportional to the error term.
⚙️The Code
import numpy as np
def get_regularized_gradients(X: np.ndarray, y: np.ndarray, y_hat: np.ndarray, w: np.ndarray, lambda_param: float) -> tuple[np.ndarray, float]:
"""
Calculates the gradients for w and b, including the L2 (Ridge) penalty.
"""
N = len(y)
error_vector = y_hat - y
b_gradient = (2 / N) * np.sum(error_vector)
w_gradients_base = (2 / N) * np.dot(X.T, error_vector)
penalty_gradient = ((2 * lambda_param) / N) * w
w_gradients_final = w_gradients_base + penalty_gradient
return w_gradients_final, b_gradient
# --- A Quick Test ---
N = 100
w_massive = np.array([200.0, 150.0])
w_gradients_base = np.array([-5.0, -2.0])
np.random.seed(42) # For reproducibility
lambda_param = 100
w_final, b = get_regularized_gradients(
X=np.random.rand(N, 2),
y=np.random.rand(N),
y_hat=np.random.rand(N),
w=w_massive,
lambda_param=lambda_param
)
print("Base Gradients (without penalty):", w_gradients_base)
print("Final Gradients (with L2 penalty) for Lambda =", lambda_param, ":", w_final)
print("Bias Gradient:", b)Code Breakdown
This is the updated gradient descent math. We isolate the penalty gradient, explicitly ensure we divide it by N so it scales properly with the dataset, and add it to our base weight gradients. The bias is left completely unpenalized.