Evaluation Metrics: The R-Squared Proof

Proving Your RepresentationThe Coefficient of Determination (R^2)Implementing Evaluation Metrics

🧠The Theory

AI/ML Concept: Proving Your Representation

Mean Squared Error (MSE) is great for the Gradient Descent loop, but it is terrible for human evaluation. An MSE of 15,000 doesn't mean anything unless you know the scale of the dataset. $R^2$ normalizes the error into a scale-free ratio, acting much like a percentage score.

By implementing $R^2$ , we can run an A/B test to prove why feature engineering is so powerful.

When you train your model on the raw matrix, it will struggle to draw a straight line through parabolic temperatures and seasonal sine waves. Its RSS will be high, resulting in a low $R^2$ score. When you train on the engineered matrix, it will effortlessly bend to fit the data, dropping the RSS and skyrocketing the $R^2$ score. This metric scientifically proves that "Linear Regression is powerful if you give it the right representation."

📐The Math

Math: The Coefficient of Determination ( $R^2$ )

To scientifically prove how much better our engineered matrix is, we compare it against the "dumbest possible model." The dumbest possible model ignores all features ( $X$ ) and just predicts the average energy consumption ( $\bar{y}$ ) for every single day.

We calculate two things:

Total Sum of Squares (TSS): The total error of the "dumb" mean model. It measures how much the true energy values vary from their own average.
$TSS = \sum (y_i - \bar{y})^2$
Residual Sum of Squares (RSS): The total error of our model. It measures how much the true energy values vary from our predictions.
$RSS = \sum (y_i - \hat{y}_i)^2$

Finally, we calculate $R^2$ :
$R^2 = 1 - \frac{RSS}{TSS}$

If our model's error (RSS) is exactly equal to the dumb model's error (TSS), $\frac{RSS}{TSS}$ equals 1, and $1 - 1 = 0$ . Our model explains 0% of the data's variance.
If our model makes zero mistakes, RSS is 0, and $1 - 0 = 1$ . Our model explains 100% of the variance.
Because RSS cannot be negative, $R^2$ can never be greater than 1.

💡Insights and Mistakes

Developer's Insight: The Overfitting Illusion

When I wrote my $R^2$ function, I decided to test it using a tiny subset of my data (7 rows). The output completely baffled me:

Raw Matrix $R^2$ : 0.9995
Engineered Matrix $R^2$ : 0.9977

Both models scored basically 100%, and the "worse" raw matrix actually scored slightly higher! Did my feature engineering fail?

The Insight: My feature engineering didn't fail; I fell into the trap of overfitting. My engineered matrix had 8 features, but I only tested it on 7 rows of data. In linear algebra, if you have more variables than data points, the algorithm doesn't have to learn any underlying patterns—it can just solve the system of equations perfectly to memorize the data points.

When a model just connects the dots through memorization, its RSS drops to near 0, creating an artificial $R^2$ score of ~1.0. This proved to me that you can never trust an evaluation metric if you don't have significantly more data points than features.

After falling into the overfitting trap with a 7-row subset, I re-ran the $R^2$ evaluation on the full, 1000-day dataset. The metrics stabilized into their true mathematical reality:

Engineered Feature Dataset $R^2$ : 0.924086 (92.4%)
Raw Dataset $R^2$ : 0.895196 (89.5%)

The Insight: Volume exposes the truth. When the model had to generalize across 1000 days of random noise, cyclical seasons, and interacting variables, the naive straight-line matrix capped out at 89.5%.

By providing the model with a non-linear vocabulary (polynomials and sine waves), the engineered matrix successfully explained an additional ~3% of the chaotic variance. In real-world machine learning, squeezing an extra 3% out of a noisy system without changing the underlying algorithm is a massive architectural victory. It mathematically proves that linear regression's power is entirely bound by the representation of its data.

⚙️The Code

import numpy as np
import math

def generate_smart_building_data(num_days: int = 1000):
    """Generates a synthetic, non-linear dataset for energy consumption."""
    np.random.seed(42) 

    # Base Features
    day_of_year = np.arange(1, num_days + 1) % 365
    day_of_year[day_of_year == 0] = 365
    is_weekend = (np.arange(num_days) % 7 >= 5).astype(int)
    
    # Temperature follows a seasonal sine wave + daily noise
    avg_temperature = 15 + 10 * np.sin(2 * np.pi * day_of_year / 365) + np.random.normal(0, 3, num_days)
    
    working_hours = np.where(is_weekend == 1, np.random.uniform(0, 4, num_days), np.random.uniform(8, 12, num_days))
    num_occupants = np.where(is_weekend == 1, np.random.randint(0, 50, num_days), np.random.randint(200, 500, num_days))
    
    # Devices strongly correlate with occupants
    num_devices_on = (num_occupants * np.random.uniform(1.5, 2.5, num_days)).astype(int)
    
    # 2. Constructing the Target Variable (Energy Consumption)
    base_load = 500.0
    
    # Non-linear temperature effect (parabola centered at 22C)
    temp_effect = 2.5 * (avg_temperature - 22)**2 
    
    # Interaction effect
    occupancy_effect = 0.5 * (num_occupants * working_hours)
    
    # Seasonal background load
    seasonal_effect = 100 * np.sin(2 * np.pi * day_of_year / 365)
    
    # Target generation
    energy_consumption = base_load + temp_effect + occupancy_effect + seasonal_effect
    
    # Lag Dependency (Yesterday's consumption affects today's baseline)
    # Applying the lag iteratively
    for i in range(1, num_days):
        energy_consumption[i] += 0.2 * energy_consumption[i-1]
        
    # The Irreducible Error (Noise)
    noise = np.random.normal(0, 150, num_days)
    energy_consumption += noise
    
    dataset = np.column_stack((
        avg_temperature,
        num_occupants,
        working_hours,
        is_weekend,
        num_devices_on,
        day_of_year,
        energy_consumption
    ))
    
    return dataset
def engineer_features(raw_data: np.ndarray) -> tuple[np.ndarray, np.ndarray]:
    """
    Takes the raw dataset and generates a new feature matrix X and target y.
    Raw Columns: [Temp, Occupants, Hours, Weekend, Devices, DayOfYear, Energy]
    Indices:      0     1          2      3        4        5          6
    """
    X_engineered = []
    y = []
    i = 0
    for row in raw_data:
        temp = row[0]
        occupants = row[1]
        hours = row[2]
        weekend = row[3]
        devices = row[4]
        day_of_year = row[5]
        target_energy = row[6]

        # Base features (pass them through)
        new_row = [temp, occupants, hours, weekend]
        
        # Polynomial Feature: Temp Squared
        new_row.append(temp ** 2)
        
        # Interaction Term: Occupants * Hours
        new_row.append(occupants * hours)
        
        # 4. Cyclical Encoding: Sin and Cos of DayOfYear
        # YOUR CODE HERE: Calculate and append sin_day and cos_day
        sin_day = math.sin(2 * math.pi * day_of_year / 365)
        cos_day = math.cos(2 * math.pi * day_of_year / 365)
        new_row.append(sin_day)
        new_row.append(cos_day)
        
        X_engineered.append(new_row)
        y.append(target_energy)
        
    return np.array(X_engineered), np.array(y)
def calculate_r2(y_true: np.ndarray, y_pred: np.ndarray) -> float:
    """
    Calculates the R-squared metric.
    """
    y_true_mean = np.mean(y_true)
    tss = np.sum((y_true - y_true_mean) ** 2)
    rss = np.sum((y_true - y_pred) ** 2)
    r2 = 1 - (rss / tss) if tss != 0 else 0
    return r2

Code Breakdown

calculate_r2 function implements the Coefficient of Determination.
Notice how calculating TSS simply subtracts the mean from every single true value, acting as the baseline worst-case scenario.

The Calculus of Regularization: Ridge (L2)Feature Engineering: Bending the Matrix