Feature Engineering: Bending the Matrix
🧠The Theory
AI/ML Concept: Linear in Weights, Non-Linear in Features
"Linear regression is not weak. It’s actually very powerful if you give it the right representation."
When you add a column or a column to your matrix, the model is still executing a strictly linear equation:
The algorithm is just scaling weights () and adding them up. It is entirely linear with respect to the weights. But because the input features are curved, the final prediction line that the model outputs is bent perfectly to match the chaos of reality.
Feature engineering is the art of giving your linear model a non-linear vocabulary.
📐The Math
Math: Expanding the Feature Space
To teach a linear model how to see curves, seasons, and interactions, we do not change the algorithm. We change the data. We physically add new columns to our matrix that contain non-linear math.
1. Polynomial Features (The Parabola)
We know energy spikes when it is very hot and when it is very cold. A straight line can't model a U-shape, but a squared number can. We engineer a new column by squaring the temperature.
2. Interaction Terms (The Multiplier)
Energy doesn't just depend on occupants or hours in isolation. The true draw happens when both are high. We create a new column by multiplying them together.
3. Cyclical Encoding (The Calendar Loop)
How do we make Day 365 sit right next to Day 1? We map the 1D timeline onto a 2D circle using Trigonometry.
We take the day of the year () and engineer two new columns:
By feeding the model both the sine and cosine, it can perfectly track its position on the calendar loop without ever assuming that December is "greater" than January.
💡Insights and Mistakes
Developer's Insight: Reading the Non-Linear Matrix
After transforming my dataset with polynomial and cyclical features, I standardized the new matrix and ran it through my Gradient Descent engine. The weights the model learned told a fascinating story about what it "sees" in the data:
[Temp: -78.8, Occupants: 236.8, Hours: -79.4, Weekend: 0.0, Temp_Sq: -71.2, Active: 202.0, Sin_Day: 247.0, Cos_Day: 0.0]
1. The Cyclical Discovery
The model assigned a massive weight of 247.0 to Sin_Day, but exactly 0.0 to Cos_Day. This is mathematically perfect. When I generated the synthetic data, I programmed the seasonal background load using only a sine wave. The linear engine perfectly reverse-engineered the exact mathematical wave hidden in the data.
2. Dropping Redundant Features
I intentionally excluded the raw devices column and the raw day_of_year integer column from the engineered matrix. If I had kept day_of_year, the model would have tried to assign a weight to it, effectively saying, "Energy follows a wave, BUT it also linearly drifts upward every single day." By dropping the raw integer representation, I forced the model to view time strictly as a seasonal loop, curing the Straight Line Fallacy.
⚙️The Code
import numpy as np
import math
def generate_smart_building_data(num_days: int = 1000):
"""Generates a synthetic, non-linear dataset for energy consumption."""
np.random.seed(42)
# Base Features
day_of_year = np.arange(1, num_days + 1) % 365
day_of_year[day_of_year == 0] = 365
is_weekend = (np.arange(num_days) % 7 >= 5).astype(int)
# Temperature follows a seasonal sine wave + daily noise
avg_temperature = 15 + 10 * np.sin(2 * np.pi * day_of_year / 365) + np.random.normal(0, 3, num_days)
working_hours = np.where(is_weekend == 1, np.random.uniform(0, 4, num_days), np.random.uniform(8, 12, num_days))
num_occupants = np.where(is_weekend == 1, np.random.randint(0, 50, num_days), np.random.randint(200, 500, num_days))
# Devices strongly correlate with occupants
num_devices_on = (num_occupants * np.random.uniform(1.5, 2.5, num_days)).astype(int)
# 2. Constructing the Target Variable (Energy Consumption)
base_load = 500.0
# Non-linear temperature effect (parabola centered at 22C)
temp_effect = 2.5 * (avg_temperature - 22)**2
# Interaction effect
occupancy_effect = 0.5 * (num_occupants * working_hours)
# Seasonal background load
seasonal_effect = 100 * np.sin(2 * np.pi * day_of_year / 365)
# Target generation
energy_consumption = base_load + temp_effect + occupancy_effect + seasonal_effect
# Lag Dependency (Yesterday's consumption affects today's baseline)
# Applying the lag iteratively
for i in range(1, num_days):
energy_consumption[i] += 0.2 * energy_consumption[i-1]
# The Irreducible Error (Noise)
noise = np.random.normal(0, 150, num_days)
energy_consumption += noise
dataset = np.column_stack((
avg_temperature,
num_occupants,
working_hours,
is_weekend,
num_devices_on,
day_of_year,
energy_consumption
))
return dataset
def engineer_features(raw_data: np.ndarray) -> tuple[np.ndarray, np.ndarray]:
"""
Takes the raw dataset and generates a new feature matrix X and target y.
Raw Columns: [Temp, Occupants, Hours, Weekend, Devices, DayOfYear, Energy]
Indices: 0 1 2 3 4 5 6
"""
X_engineered = []
y = []
i = 0
for row in raw_data:
temp = row[0]
occupants = row[1]
hours = row[2]
weekend = row[3]
devices = row[4]
day_of_year = row[5]
target_energy = row[6]
# Base features (pass them through)
new_row = [temp, occupants, hours, weekend]
# Polynomial Feature: Temp Squared
new_row.append(temp ** 2)
# Interaction Term: Occupants * Hours
new_row.append(occupants * hours)
# 4. Cyclical Encoding: Sin and Cos of DayOfYear
# YOUR CODE HERE: Calculate and append sin_day and cos_day
sin_day = math.sin(2 * math.pi * day_of_year / 365)
cos_day = math.cos(2 * math.pi * day_of_year / 365)
new_row.append(sin_day)
new_row.append(cos_day)
X_engineered.append(new_row)
y.append(target_energy)
return np.array(X_engineered), np.array(y)
# Generate the data
raw_data = generate_smart_building_data(1000)
# Engineer the features
X_smart, y_true = engineer_features(raw_data)
print("Temp,Occupants,Hours,Weekend,temp_squared,occupancy_hours,sin_day,cos_day,EnergyConsumption")
try:
for index, row in enumerate(X_smart):
print(",".join(str(np.round(x, 2)) for x in row), end=",")
print(np.round(y_true[index], 2))
except Exception as e:
print(f"An error occurred: {e}")Code Breakdown
engineer_features function physically widens our dataset. We loop through the raw data and generate new columns based on mathematical transformations (squares, products, and trigonometry).
Crucially, we drop the raw day_of_year and devices columns to prevent multicollinearity and linear drift.