AI Logbook
Live Learning Feed

AI Logbook

Understanding intelligent systems from first principles.

Synthetic Data Generation: Simulating Reality

Multicollinearity and Non-LinearityThe Target Function & Irreducible ErrorBuilding the Smart Building Dataset

🧠The Theory

AI/ML Concept: The Trap of Multicollinearity

When generating this dataset, we intentionally injected a fatal flaw that ruins naive linear regression: Multicollinearity.

This occurs when two features in your matrix are highly correlated with each other. For example, the num_occupants in a building heavily dictates the num_devices_on. If you know one, you basically know the other.

Why is this dangerous? Because the Gradient Descent algorithm isolates blame using Partial Derivatives. If two variables move together perfectly, the math cannot determine which variable is actually responsible for the rising energy bill. The gradients become highly unstable, and the weights wildly oscillate trying to split the blame.

By building this into our synthetic data today, we are setting up the exact problem that L2L_2 (Ridge) Regularization is designed to solve.

📐The Math

Math: The Target Function & Irreducible Error

When we train a model, we are trying to approximate a true, hidden function that governs the universe. Mathematically, the real-world value (yy) is a combination of a deterministic function (f(X)f(X)) and random, unpredictable noise (ϵ\epsilon).

y=f(X)+ϵy = f(X) + \epsilon

To simulate our Smart Building, we need to design f(X)f(X) to be intentionally messy:

  1. Non-Linearity: Energy consumption isn't a straight line with temperature. If the ideal building temperature is 22°C, energy spikes when it gets hotter (AC) and when it gets colder (Heating). This forms a parabola.
    Etemp=β1(T22)2E_{\text{temp}} = \beta_1(T - 22)^2
  2. Seasonality: The day of the year naturally cycles. We represent this continuous loop using a sine wave, where dd is the day of the year.
    Eseason=β2sin(2πd365)E_{\text{season}} = \beta_2 \sin(\frac{2\pi \cdot d}{365})
  3. Interaction Terms: A building only uses massive energy if there are many occupants and it is open for many hours. We multiply them to create a combined effect.
    Eactive=β3(occupants×hours)E_{\text{active}} = \beta_3(\text{occupants} \times \text{hours})

Finally, we add ϵ\epsilon (Gaussian noise). This is the "Irreducible Error." No matter how perfect your AI is, it can never predict the noise. It is mathematically impossible.

💡Insights and Mistakes

Developer's Insight: Reading the Gradient Output

After generating the non-linear dataset, I fed it into my LinearRegressor from Week 3 and ran a full Batch Gradient Descent to see how it would attempt to solve it.

The model converged and output a prediction, but inspecting the internal weights ([-49.29, 215.77, 97.85, 0.0, 231.04, 142.48]) revealed two massive conceptual flaws caused by the raw matrix representation.

1. The Multicollinearity Trap
The weights for Occupants (215.77) and Devices (231.04) are both massive. In my data generator, I explicitly made Devices highly correlated with Occupants. Because these two features move together perfectly, the gradient descent math couldn't isolate the blame. It assigned massive, competing weights to both of them, proving that correlated features make linear models highly unstable.

2. The Straight Line Fallacy
The model assigned a weight of 142.48 to the Day of the year feature. In linear regression, this means the model mathematically assumes that as the day number increases (from Day 1 to Day 365), energy consumption just keeps going up.

This is completely wrong. The year is a cyclical season. Day 365 and Day 1 share the exact same winter weather and should have nearly identical energy profiles. Because my input matrix simply provided raw integers (1 to 365), the linear engine tried to draw a straight line through a continuous loop. To fix this, I need to physically alter the representation of the data in the matrix before it reaches the gradient descent loop.

⚙️The Code

import numpy as np

def generate_smart_building_data(num_days: int = 1000):
    """Generates a synthetic, non-linear dataset for energy consumption."""
    np.random.seed(42) 

    # Base Features
    day_of_year = np.arange(1, num_days + 1) % 365
    day_of_year[day_of_year == 0] = 365
    is_weekend = (np.arange(num_days) % 7 >= 5).astype(int)
    
    # Temperature follows a seasonal sine wave + daily noise
    avg_temperature = 15 + 10 * np.sin(2 * np.pi * day_of_year / 365) + np.random.normal(0, 3, num_days)
    
    working_hours = np.where(is_weekend == 1, np.random.uniform(0, 4, num_days), np.random.uniform(8, 12, num_days))
    num_occupants = np.where(is_weekend == 1, np.random.randint(0, 50, num_days), np.random.randint(200, 500, num_days))
    
    # Devices strongly correlate with occupants
    num_devices_on = (num_occupants * np.random.uniform(1.5, 2.5, num_days)).astype(int)
    
    # 2. Constructing the Target Variable (Energy Consumption)
    base_load = 500.0
    
    # Non-linear temperature effect (parabola centered at 22C)
    temp_effect = 2.5 * (avg_temperature - 22)**2 
    
    # Interaction effect
    occupancy_effect = 0.5 * (num_occupants * working_hours)
    
    # Seasonal background load
    seasonal_effect = 100 * np.sin(2 * np.pi * day_of_year / 365)
    
    # Target generation
    energy_consumption = base_load + temp_effect + occupancy_effect + seasonal_effect
    
    # Lag Dependency (Yesterday's consumption affects today's baseline)
    # Applying the lag iteratively
    for i in range(1, num_days):
        energy_consumption[i] += 0.2 * energy_consumption[i-1]
        
    # The Irreducible Error (Noise)
    noise = np.random.normal(0, 150, num_days)
    energy_consumption += noise
    
    dataset = np.column_stack((
        avg_temperature,
        num_occupants,
        working_hours,
        is_weekend,
        num_devices_on,
        day_of_year,
        energy_consumption
    ))
    
    return dataset

# Let's generate it and look at a single row!
data = generate_smart_building_data(1000)
print("Temp,Occupants,Hours,Weekend,Devices,DayOfYear,EnergyConsumption")
try:
    for row in data:
        print(",".join(str(np.round(x, 2)) for x in row))
except Exception as e:
    print(f"An error occurred: {e}")

Code Breakdown

Generating the environment, used NumPy to quickly handle the random distributions and math arrays.
Systematically build the target variable energy_consumption by adding different mathematical effects together, finishing it off with random noise.