Skip to content

Homework1

Praveen Kumar Anwla edited this page Feb 17, 2025 · 1 revision

Below is a step-by-step guide (with illustrative code snippets) on how to implement stochastic gradient descent (SGD) from scratch for linear regression on the California Housing Prices Dataset, as requested. We will:

  1. Fetch the dataset from sklearn (for convenience).
  2. Extract and prepare the data (selecting given features).
  3. Apply 0–1 normalization to (X) and (y).
  4. Split into train and test sets (70/30).
  5. Implement SGD manually—no direct calls to “automatic” optimization methods.
  6. Train to find the best parameters that minimize the MSE.
  7. Interpret which features matter the most.

Important: This walkthrough is meant to be thorough so you can confidently submit your assignment. However, always validate your final results (e.g., by checking MSE on training and test sets) to ensure correctness.


1. Load the California Housing Data

California Housing dataset has these features by default:

  • data has columns in order: [MedInc, HouseAge, AveRooms, AveBedrms, Population, AveOccup, Latitude, Longitude]
  • target is MedHouseVal (median house value)

Hence we do not have to reorder columns for this specific dataset—the first eight columns already match your required features. We will just confirm and rename them properly for clarity.

import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split

# 1) Fetch the dataset
cal_housing = fetch_california_housing()

# X: 8 columns in the order [MedInc, HouseAge, AveRooms, AveBedrms, Population, AveOccup, Latitude, Longitude]
X = cal_housing.data

# y: MedHouseVal
y = cal_housing.target

# Just for clarity, let's store feature names:
feature_names = cal_housing.feature_names  # ['MedInc','HouseAge','AveRooms','AveBedrms','Population','AveOccup','Latitude','Longitude']

2. Apply 0–1 Normalization to (X) and (y)

0–1 normalization (also called min–max scaling) transforms each column so that its minimum becomes 0 and its maximum becomes 1:

[ x_{\text{scaled}} = \frac{x - x_{\min}}{x_{\max} - x_{\min}} ]

We do this for each feature in (X) and for the target (y).

def min_max_scale(data):
    """
    data: a 2D array (n_samples x n_features) or 1D array (n_samples,).
    returns: scaled data, plus (min, max) pairs for possible inverse transform (if needed).
    """
    data_min = data.min(axis=0)
    data_max = data.max(axis=0)
    scaled = (data - data_min) / (data_max - data_min)
    return scaled, data_min, data_max

# Scale X
X_scaled, X_min, X_max = min_max_scale(X)

# Scale y (y is 1D)
y_scaled, y_min, y_max = min_max_scale(y)

3. Split into Train/Test Sets

As specified:

  • 70% training, 30% test
  • Use random_state=265 for reproducibility
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, 
    y_scaled, 
    test_size=0.3, 
    random_state=265
)
  • X_train will have shape (N_train, 8)
  • y_train will have shape (N_train,)

4. Implement Stochastic Gradient Descent (from scratch)

4.1 Mathematical Formulation

We want a linear regression model:

[ \hat{y}^{(i)} = w_0 + w_1 x_1^{(i)} + w_2 x_2^{(i)} + \dots + w_8 x_8^{(i)}, ]

where (w_0) is the intercept (bias) and (w_1 \dots w_8) are the weights for the 8 features. Alternatively, we can incorporate the intercept by adding a column of 1s to (X). Then the parameter vector (w) becomes ((w_0, w_1, \dots, w_8)).

Mean Squared Error (MSE) for a dataset of (n) points:

[ \text{MSE}(w) = \frac{1}{n} \sum_{i=1}^n (\hat{y}^{(i)} - y^{(i)})^2. ]

Gradient of MSE w.r.t. (w) (for a single sample ((x^{(i)}, y^{(i)}))):

[ \nabla_w \bigl((\hat{y}^{(i)} - y^{(i)})^2\bigr) = 2 (\hat{y}^{(i)} - y^{(i)}) , x^{(i)}, ]

if we are treating (\hat{y}^{(i)} = w^T x^{(i)}). Typically for the average we’d incorporate the factor (\frac{1}{n}), but in stochastic gradient descent we handle updates one sample (or mini-batch) at a time and scale the learning rate accordingly.

4.2 Creating the Design Matrix

To include a bias term (w_0) in the same weight vector, we add a column of 1’s to (X). Let’s define:

[ X_{\text{aug}} = \begin{bmatrix} 1 & x_{1,1} & x_{1,2} & \cdots & x_{1,8}\ 1 & x_{2,1} & x_{2,2} & \cdots & x_{2,8}\ \vdots & \vdots & \vdots & \ddots & \vdots\ 1 & x_{n,1} & x_{n,2} & \cdots & x_{n,8} \end{bmatrix}. ]

So the parameter vector has 9 entries: (\bigl[w_0, w_1, \dots, w_8\bigr]).

def add_bias_column(X):
    # X is (n_samples, n_features).
    n_samples = X.shape[0]
    # Create a new array with 1 extra column of all 1s at the front
    ones = np.ones((n_samples, 1))
    return np.hstack([ones, X])

X_train_aug = add_bias_column(X_train)  # shape (n_samples, 1+8)
X_test_aug  = add_bias_column(X_test)   # shape (n_samples, 1+8)

4.3 Core SGD Algorithm

Pseudocode (for batch_size = 1 “pure” SGD):

  1. Initialize weights (w) randomly (size 9).
  2. For epoch in 1 to max_epochs:
    1. Shuffle training data (so we pick samples in random order).
    2. For each training sample ((x_i, y_i)):
      1. Compute prediction (\hat{y}_i = w^T x_i).
      2. Compute error (e_i = \hat{y}_i - y_i).
      3. Compute gradient w.r.t. (w): (\nabla = e_i \times x_i) (or (2 e_i \times x_i) if we incorporate the factor 2; we can absorb constants into the learning rate).
      4. Update: [ w \leftarrow w - \eta \nabla ] where (\eta) is the learning rate.
    3. Optionally, compute MSE on the entire training set or a validation set to track progress.
  3. Return final weights (w).

Below is a minimal working code snippet. We will incorporate a small function to compute the MSE so we can track it across epochs.

Choosing “ideal” learning rate (\eta) and number of epochs: This depends on your computational resources. A typical approach is to pick something like (\eta = 0.01) and experiment.
Usually, you do a quick check if MSE is decreasing nicely. If it gets stuck or diverges, reduce (\eta); if it shrinks too slowly, increase (\eta). You can do early stopping or just pick a certain number of epochs (e.g., 10–50) that suits your environment.

def mse_loss(X, y, w):
    """
    X: (n_samples, 1+8)
    y: (n_samples,)
    w: (1+8,) - parameter vector
    returns: MSE (scalar)
    """
    y_pred = X.dot(w)  # shape (n_samples,)
    errors = y_pred - y
    return np.mean(errors**2)

def stochastic_gradient_descent(X, y, learning_rate=0.01, max_epochs=20):
    """
    X: (n_samples, 1+8) after adding bias column
    y: (n_samples,)
    returns: (w, mse_history) 
      w -> learned weights
      mse_history -> list of (epoch, MSE) for debugging
    """
    n_samples, n_features = X.shape
    
    # 1) Initialize weights, e.g. small random values or zeros
    w = np.zeros(n_features)
    
    mse_history = []
    
    for epoch in range(max_epochs):
        # Shuffle data each epoch
        indices = np.arange(n_samples)
        np.random.shuffle(indices)
        
        for i in indices:
            x_i = X[i]         # shape: (1+8,)
            y_i = y[i]         # scalar
            
            # 1. Prediction
            y_pred_i = np.dot(x_i, w)  # scalar
            
            # 2. Error
            error_i = y_pred_i - y_i   # scalar
            
            # 3. Gradient for MSE (with the factor of 2 omitted or included – we'll incorporate it in lr if needed)
            grad = error_i * x_i  # shape: (1+8,)
            
            # 4. Update weights
            w = w - learning_rate * grad
        
        # Compute MSE over the entire training set to track
        train_mse = mse_loss(X, y, w)
        mse_history.append((epoch, train_mse))
        
        # (Optional) print occasionally
        if (epoch+1) % 5 == 0:
            print(f"Epoch {epoch+1}/{max_epochs}, Train MSE = {train_mse:.6f}")
    
    return w, mse_history

5. Train and Find the Best Parameters

5.1 Run the Training

We can pick some “ideal” hyperparameters (you can tweak them as needed). For instance:

  • learning_rate = 0.01
  • max_epochs = 50
learning_rate = 0.01
max_epochs = 50

w_final, history = stochastic_gradient_descent(
    X_train_aug,
    y_train,
    learning_rate=learning_rate,
    max_epochs=max_epochs
)

Check final training MSE:

train_mse_final = mse_loss(X_train_aug, y_train, w_final)
print(f"Final Training MSE: {train_mse_final:.6f}")

5.2 Evaluate on Test Set

Now compute MSE on the test set:

test_mse = mse_loss(X_test_aug, y_test, w_final)
print(f"Test MSE: {test_mse:.6f}")

Your final model parameters are in w_final (array of length 9, corresponding to [bias, w_MedInc, w_HouseAge, w_AveRooms, ...]).

5.3 Inspect the Learned Weights

Since we appended a column of 1s to the front, the first weight is w0 (bias), and the rest align to:

  1. MedInc
  2. HouseAge
  3. AveRooms
  4. AveBedrms
  5. Population
  6. AveOccup
  7. Latitude
  8. Longitude
print("Final learned parameters:")
print(f"Bias (intercept): {w_final[0]:.6f}")
for i, f_name in enumerate(feature_names, start=1):
    print(f"{f_name:>10s}: {w_final[i]:.6f}")

Remember these weights are in the scaled domain. If you need to interpret them in the original domain, you’d have to invert the scaling carefully. However, relative magnitude can still give you a sense of which factors have a stronger effect on the scaled output.


6. Interpretation: Which Factors Explain the House Prices the Most?

  • Median Income (MedInc) typically shows a large positive weight—houses in areas with higher median income tend to cost more.
  • House Age might have varying positive/negative effect, but in many real estate contexts, older houses can be either cheaper or more expensive depending on location.
  • Average Rooms / Bedrooms can be correlated with home size, thus positively influencing price.
  • Population often has weaker direct correlation once you already have location factors.
  • Latitude / Longitude can show how strongly location (e.g., near coastline) influences prices—Latitude near coastal areas in California might push prices higher.

After training:

  • You can compare the absolute values of the learned weights in the scaled domain to see which features had the largest positive or negative impact on price.
  • A large positive weight means as that scaled feature increases (from its min to max), the predicted price (scaled) tends to go up more strongly.
  • A large negative weight means the opposite.

Example: You might find something like:

Bias (intercept): 0.1500
MedInc: 0.6500
HouseAge: 0.0900
AveRooms: 0.2000
AveBedrms: -0.0500
Population: 0.0100
AveOccup: -0.0200
Latitude: 0.3000
Longitude: -0.1000

(This is purely illustrative—your actual numbers will differ.)

Here, MedInc likely stands out, followed by Latitude and AveRooms. Population or AveOccup might have smaller coefficients.


Full Summary

  1. Data Preparation:

    • We selected the 8 features [MedInc, HouseAge, AveRooms, AveBedrms, Population, AveOccup, Latitude, Longitude] and the target (MedHouseVal).
    • We applied 0–1 normalization to all.
    • Split 70/30 using train_test_split with random_state=265.
  2. SGD Implementation (from scratch):

    • Created an augmented design matrix to incorporate bias.
    • Iteratively updated weights sample by sample.
  3. Best Parameters:

    • Found by minimizing the MSE via multiple epochs.
    • Final weights are in w_final.
  4. Interpretation:

    • By looking at the magnitude and sign of the learned weights, we see which features (in scaled domain) correlate strongest with house prices.
    • Typically, Median Income is the strongest positive driver of house value.
    • Geographical location (latitude/longitude) also often has a strong effect.
  5. Performance:

    • Report both training MSE and test MSE.
    • If test MSE is close to training MSE, the model generalizes well.

This completes the assignment’s requirements: we used no direct optimization package, we performed SGD from scratch, normalized features, and reported/interpreted results.


Potential Next Steps / Checks

  • You may experiment with different learning rates (0.001, 0.005, 0.01, 0.05, etc.) or different epochs to see how quickly the MSE converges and to avoid overshooting.
  • You could also do “mini-batch” SGD (e.g., batch size = 32 or 64) instead of pure SGD.
  • Because the data is normalized, a moderate learning rate often works well.

Done! You now have a fully operational stochastic gradient descent pipeline for linear regression on the California Housing dataset, and you’ve explored which features matter most for predicting house prices.