Evaluation Metrics: The R-Squared Proof
π§ The Theory
AI/ML Concept: Proving Your Representation
Mean Squared Error (MSE) is great for the Gradient Descent loop, but it is terrible for human evaluation. An MSE of 15,000 doesn't mean anything unless you know the scale of the dataset. normalizes the error into a scale-free ratio, acting much like a percentage score.
By implementing , we can run an A/B test to prove why feature engineering is so powerful.
When you train your model on the raw matrix, it will struggle to draw a straight line through parabolic temperatures and seasonal sine waves. Its RSS will be high, resulting in a low score. When you train on the engineered matrix, it will effortlessly bend to fit the data, dropping the RSS and skyrocketing the score. This metric scientifically proves that "Linear Regression is powerful if you give it the right representation."
πThe Math
Math: The Coefficient of Determination ()
To scientifically prove how much better our engineered matrix is, we compare it against the "dumbest possible model." The dumbest possible model ignores all features () and just predicts the average energy consumption () for every single day.
We calculate two things:
- Total Sum of Squares (TSS): The total error of the "dumb" mean model. It measures how much the true energy values vary from their own average.
- Residual Sum of Squares (RSS): The total error of our model. It measures how much the true energy values vary from our predictions.
Finally, we calculate :
- If our model's error (RSS) is exactly equal to the dumb model's error (TSS), equals 1, and . Our model explains 0% of the data's variance.
- If our model makes zero mistakes, RSS is 0, and . Our model explains 100% of the variance.
- Because RSS cannot be negative, can never be greater than 1.
π‘Insights and Mistakes
Developer's Insight: The Overfitting Illusion
When I wrote my function, I decided to test it using a tiny subset of my data (7 rows). The output completely baffled me:
- Raw Matrix : 0.9995
- Engineered Matrix : 0.9977
Both models scored basically 100%, and the "worse" raw matrix actually scored slightly higher! Did my feature engineering fail?
The Insight: My feature engineering didn't fail; I fell into the trap of overfitting. My engineered matrix had 8 features, but I only tested it on 7 rows of data. In linear algebra, if you have more variables than data points, the algorithm doesn't have to learn any underlying patternsβit can just solve the system of equations perfectly to memorize the data points.
When a model just connects the dots through memorization, its RSS drops to near 0, creating an artificial score of ~1.0. This proved to me that you can never trust an evaluation metric if you don't have significantly more data points than features.
After falling into the overfitting trap with a 7-row subset, I re-ran the evaluation on the full, 1000-day dataset. The metrics stabilized into their true mathematical reality:
- Engineered Feature Dataset : 0.924086 (92.4%)
- Raw Dataset : 0.895196 (89.5%)
The Insight: Volume exposes the truth. When the model had to generalize across 1000 days of random noise, cyclical seasons, and interacting variables, the naive straight-line matrix capped out at 89.5%.
By providing the model with a non-linear vocabulary (polynomials and sine waves), the engineered matrix successfully explained an additional ~3% of the chaotic variance. In real-world machine learning, squeezing an extra 3% out of a noisy system without changing the underlying algorithm is a massive architectural victory. It mathematically proves that linear regression's power is entirely bound by the representation of its data.
βοΈThe Code
import numpy as np
import math
def generate_smart_building_data(num_days: int = 1000):
"""Generates a synthetic, non-linear dataset for energy consumption."""
np.random.seed(42)
# Base Features
day_of_year = np.arange(1, num_days + 1) % 365
day_of_year[day_of_year == 0] = 365
is_weekend = (np.arange(num_days) % 7 >= 5).astype(int)
# Temperature follows a seasonal sine wave + daily noise
avg_temperature = 15 + 10 * np.sin(2 * np.pi * day_of_year / 365) + np.random.normal(0, 3, num_days)
working_hours = np.where(is_weekend == 1, np.random.uniform(0, 4, num_days), np.random.uniform(8, 12, num_days))
num_occupants = np.where(is_weekend == 1, np.random.randint(0, 50, num_days), np.random.randint(200, 500, num_days))
# Devices strongly correlate with occupants
num_devices_on = (num_occupants * np.random.uniform(1.5, 2.5, num_days)).astype(int)
# 2. Constructing the Target Variable (Energy Consumption)
base_load = 500.0
# Non-linear temperature effect (parabola centered at 22C)
temp_effect = 2.5 * (avg_temperature - 22)**2
# Interaction effect
occupancy_effect = 0.5 * (num_occupants * working_hours)
# Seasonal background load
seasonal_effect = 100 * np.sin(2 * np.pi * day_of_year / 365)
# Target generation
energy_consumption = base_load + temp_effect + occupancy_effect + seasonal_effect
# Lag Dependency (Yesterday's consumption affects today's baseline)
# Applying the lag iteratively
for i in range(1, num_days):
energy_consumption[i] += 0.2 * energy_consumption[i-1]
# The Irreducible Error (Noise)
noise = np.random.normal(0, 150, num_days)
energy_consumption += noise
dataset = np.column_stack((
avg_temperature,
num_occupants,
working_hours,
is_weekend,
num_devices_on,
day_of_year,
energy_consumption
))
return dataset
def engineer_features(raw_data: np.ndarray) -> tuple[np.ndarray, np.ndarray]:
"""
Takes the raw dataset and generates a new feature matrix X and target y.
Raw Columns: [Temp, Occupants, Hours, Weekend, Devices, DayOfYear, Energy]
Indices: 0 1 2 3 4 5 6
"""
X_engineered = []
y = []
i = 0
for row in raw_data:
temp = row[0]
occupants = row[1]
hours = row[2]
weekend = row[3]
devices = row[4]
day_of_year = row[5]
target_energy = row[6]
# Base features (pass them through)
new_row = [temp, occupants, hours, weekend]
# Polynomial Feature: Temp Squared
new_row.append(temp ** 2)
# Interaction Term: Occupants * Hours
new_row.append(occupants * hours)
# 4. Cyclical Encoding: Sin and Cos of DayOfYear
# YOUR CODE HERE: Calculate and append sin_day and cos_day
sin_day = math.sin(2 * math.pi * day_of_year / 365)
cos_day = math.cos(2 * math.pi * day_of_year / 365)
new_row.append(sin_day)
new_row.append(cos_day)
X_engineered.append(new_row)
y.append(target_energy)
return np.array(X_engineered), np.array(y)
def calculate_r2(y_true: np.ndarray, y_pred: np.ndarray) -> float:
"""
Calculates the R-squared metric.
"""
y_true_mean = np.mean(y_true)
tss = np.sum((y_true - y_true_mean) ** 2)
rss = np.sum((y_true - y_pred) ** 2)
r2 = 1 - (rss / tss) if tss != 0 else 0
return r2Code Breakdown
calculate_r2 function implements the Coefficient of Determination.
Notice how calculating TSS simply subtracts the mean from every single true value, acting as the baseline worst-case scenario.