October 22, 2024 37 min to read

Machine Learning for Regression

A Comprehensive Overview of Regression Techniques in Machine Learning

Machine Learning for Regression

Part 1

Car price prediction project

This project focuses on predicting car prices using a dataset from Kaggle. The objective is to build a predictive model through structured phases, each covered in individual blog posts. The main steps include:

Project Plan

Prepare Data and Exploratory Data Analysis (EDA)
Use Linear Regression for Predicting Price
Understand the Internals of Linear Regression
Evaluate the Model with RMSE (Root Mean Squared Error)
Feature Engineering
Regularization
Using the Model

Part 2

Data Preparation

Key Considerations:

Data Cleaning: Handle missing values using techniques like mean/median imputation, and address outliers by removing or transforming them. Ensure consistent data formats.
Data Integration: If multiple datasets exist, they may need to be merged based on common identifiers or shared attributes.
Data Transformation: Feature engineering can be applied, such as transforming categorical variables into numerical ones using one-hot or label encoding.
Feature Scaling: Apply standardization or normalization to ensure features are on a similar scale.
Train-Validation Split: Split the dataset into training and validation sets to better evaluate the model.

Pandas attributes and methods:

pd.read_csv() -> read csv files df.head() -> take a look of the dataframe df.columns -> retrieve colum names of a dataframe df.columns.str.lower() -> lowercase all the letters df.columns.str.replace(' ', '_') -> replace the space separator df.dtypes -> retrieve data types of all features df.index -> retrieve indices of a dataframe

Example Code:

import pandas as pd
import numpy as np

# Loading the data
df = pd.read_csv('data.csv')

# First overview of the data
df.head()

# Standardizing column names: lowercase and replace spaces with underscores
df.columns = df.columns.str.lower().str.replace(' ', '_')
df.head()

Cleaning String Columns

String columns are standardized similarly to column names. First, we identify columns of type object:

# Identify string columns
strings = list(df.dtypes[df.dtypes == 'object'].index)

# Apply cleaning to all string columns
for col in strings:
    df[col] = df[col].str.lower().str.replace(' ', '_')
df.head()

Modeling: Linear Regression for Car Price Prediction

The goal is to predict MSRP (Manufacturer’s Suggested Retail Price) using a linear regression model. This includes:

Understanding Linear Regression: Learn how linear regression operates internally.

Feature Engineering: Create new features to improve model performance. Regularization: Apply regularization techniques to prevent overfitting.

Model Evaluation

RMSE (Root Mean Squared Error): RMSE will be used as the evaluation metric to measure the accuracy of the model. Using the Model Once the model is trained and evaluated, it will be used for predicting car prices based on the input features.

Part 3

Exploratory Data Analysis (EDA)

General Information

Exploratory data analysis (EDA) is an essential step in the data analysis process. It involves summarizing and visualizing the main characteristics of a dataset to gain insights and identify patterns or trends. By exploring the data, researchers can uncover hidden relationships between variables and make informed decisions.

Common techniques in EDA include calculating summary statistics such as mean, median, and standard deviation to understand data distribution. These statistics help identify potential outliers or unusual patterns.

Visualizations play a crucial role in EDA. Graphical representations like histograms, scatter plots, and box plots help visualize data distribution, identify clusters, and detect unusual patterns or trends. They are particularly useful for understanding relationships between variables.

Data cleaning is another important aspect of EDA. This involves handling missing values, outliers, and inconsistencies. By carefully examining the data, researchers can decide how to handle missing values and address outliers or errors.

EDA is an iterative process. As researchers delve deeper into the data, they may uncover additional questions or areas of interest that require further exploration. This iterative approach helps refine understanding and uncover valuable insights.

In conclusion, EDA is crucial in the data analysis process. By summarizing, visualizing, and cleaning data, researchers can uncover patterns, identify relationships, and make informed decisions, providing a foundation for more advanced data analysis techniques.

EDA for Car Price Prediction Project

Getting an Overview

To understand the data, we examine each column and print some values. We can also look at unique values in each column to gain further insights.

Distribution of Price

Visualizing the price column is essential. We can use histograms to observe the distribution of prices. The initial histogram may reveal a long-tail distribution, with many cars at lower prices and few at higher prices. Zooming in on prices under a certain threshold can help clarify the distribution.

Applying a logarithmic transformation can address issues with long-tail distributions, resulting in a more normal distribution that is ideal for machine learning models.

Missing Values

Identifying missing values is critical. We can use functions to find and sum missing values across columns, providing insights into which columns may need attention during model training.

Notes

Pandas attributes and methods:
- df[col].unique() returns a list of unique values in the series.
- df[col].nunique() returns the number of unique values in the series.
- df.isnull().sum() returns the number of null values in the dataframe.
Matplotlib and seaborn methods:
- %matplotlib inline ensures that plots are displayed in Jupyter notebook’s cells.
- sns.histplot() shows the histogram of a series.
Numpy methods:
- np.log1p() applies a log transformation to a variable, after adding one to each input value.

Long-tail distributions can confuse machine learning models, so it is recommended to transform the target variable distribution to a normal one whenever possible.

Part 4

Setting Up the Validation Framework

To validate a model, the dataset is split into three parts: training (60%), validation (20%), and test (20%). The model is trained on the training dataset, validated on the validation dataset, and the test dataset is used occasionally to evaluate overall performance. The feature matrix (X) and target variable (y) are created for each partition: Xtrain, ytrain, Xval, yval, Xtest, and ytest.

To calculate the sizes of the partitions:

Determine the total number of records in the dataset.
Calculate 20% of the total records for validation and test datasets.
The training dataset size is computed by subtracting the sizes of the validation and test datasets from the total.

The data is then split sequentially into three datasets. However, to avoid issues arising from any inherent order in the dataset, the indices are shuffled. Shuffling ensures that all partitions contain a mix of records, preventing bias.

After shuffling, the datasets are created using the shuffled indices, and the old index is dropped to reset the index for each partition. A log1p transformation is applied to the target variable (msrp) to improve model performance.

Finally, the msrp values are removed from the dataframes to avoid accidental usage during training.

Important Methods

Pandas:
- df.iloc[]: Returns subsets of records selected by numerical indices.
- df.reset_index(): Resets the original indices.
- del df[col]: Eliminates a column variable.
Numpy:
- np.arange(): Returns an array of numbers.
- np.random.shuffle(): Returns a shuffled array.
- np.random.seed(): Sets a seed for reproducibility.

The entire code for this project is available in the provided Jupyter notebook.

Part 5

Linear Regression

Overview

Linear regression is a statistical method used to model the relationship between one or more input features and a continuous outcome variable. The objective is to find the best-fitting line that represents this relationship.

Linear Regression Formula

The linear regression model can be expressed as:

$g(x_i) = w_0 + x_{i1} \cdot w_1 + x_{i2} \cdot w_2 + … + x_{in} \cdot w_n$.

And that can be further simplified as:

$g(x_i) = w_0 + \displaystyle\sum_{j=1}^{n} w_j \cdot x_{ij}$

Implementation in Python

A simple implementation of linear regression can be done as follows:

w0 = 7.1
def linear_regression(xi):
    n = len(xi)
    pred = w0
    w = [0.01, 0.04, 0.002]
    for j in range(n):
        pred = pred + w[j] * xi[j]
    return pred

Objective

The main goal of linear regression is to estimate the coefficients https://github.com/user-attachments/assets/17d1bd4c-0a89-4140-93f4-bfe6884bb306 such that the sum of squared differences between the predicted and actual values is minimized. This is achieved using the ordinary least squares method.

Single Observation Analysis

For a single observation the function can be simplified as: https://github.com/user-attachments/assets/6f3dcaf8-ad54-4504-9883-748d3be709ca is a vector of characteristics for one instance, and https://github.com/user-attachments/assets/5e0cab46-49d3-4c0e-9cfd-f8ca82049b53 is the corresponding target value.

Example of Feature Extraction

Given a dataset, one can extract features:

xi = [138, 24, 1385]  # Example features

Full Function Implementation

The implementation can be expressed as:

def linear_regression(xi):
    n = len(xi)    
    pred = w0
    for j in range(n):
        pred = pred + w[j] * xi[j]
    return pred

Inverse Transformation

Since the target variable is log-transformed, predictions must be converted back to the original scale using:

np.expm1(predicted_value)

This process provides a comprehensive understanding of how linear regression works, its implementation, and the considerations for transforming predictions back to their original scale.

Part 6

Linear Regression vector form

The formula for linear regression can be represented using the dot product between a feature vector and a weight vector. The feature vector includes a bias term with an x value of one, denoted as $( w_0 x_{i0} )$, where ( x_{i0} = 1 ) for ( w_0 ).

When considering all records, linear regression predictions are derived from the dot product between a feature matrix ( X ) and a weight vector ( w ). This can be expressed as ( g(x_i) = w_0 + x_i^T w ).

To implement the dot product, a function can be defined:

def dot(xi, w):
    n = len(xi)
    res = 0.0
    for j in range(n):
        res += xi[j] * w[j]
    return res

The linear regression function can then be defined as:

def linear_regression(xi):
    return w0 + dot(xi, w)

To simplify, we can introduce an additional feature always set to 1, leading to: $g(xi) = w0 + xiTw -> g(xi) = w0xi0 + xiTw$

This implies the weight vector 𝑤 expands to an n+1 dimensional vector: $w = [w0, w1, w2, … wn]$ $xi = [xi0, xi1, xi2, …, xin] = [1, xi1, xi2, …, xin]$ $wTxi = xiTw = w0 + …$

The dot product can now be used for the entire regression.

Given sample values:

xi = [138, 24, 1385]
w0 = 7.17
w = [0.01, 0.04, 0.002]
w_new = [w0] + w

The updated linear regression function becomes:

def linear_regression(xi):
    xi = [1] + xi
    return dot(xi, w_new)

For a matrix 𝑋 with dimensions 𝑚 × (𝑛 + 1), predictions can be calculated as follows:

X = [[1, 148, 24, 1385], [1, 132, 25, 2031], [1, 453, 11, 86]]
X = np.array(X)

Predictions for each car price can be obtained using:

y = X.dot(w_new)

To adjust the output for the actual price:

np.expm1(y)

Finally, an adapted linear regression function can be expressed as:

def linear_regression(X):
    return X.dot(w_new)

Part 7

Training linear regression: Normal equation

Obtaining predictions as close as possible to ( y ) target values requires the calculation of weights from the general LR equation. The feature matrix does not have an inverse because it is not square, so it is required to obtain an approximate solution, which can be obtained using the Gram matrix (multiplication of feature matrix ( X ) and its transpose ( X^T )). The vector of weights or coefficients ( w ) obtained with this formula is the closest possible solution to the LR system.

Normal Equation:

w = (X^T X)^{-1} X^T y

Where ( X^T X ) is the Gram Matrix.

Training a linear regression model, we know that we need to multiply the feature matrix ( X ) with weights vector ( w ) to get ( y ) (the prediction for price).

g(X) = Xw \approx y

To achieve this, we need to find a way to compute ( w ). The equation ( Xw = y ) can be transformed into ( Iw = X^{-1}y ) when multiplied by ( X^{-1} ). However, ( X^{-1} ) exists only for squared matrices, and ( X ) is of dimension ( m \times (n+1) ), which is not square in almost every case.

We need to approximate ( X^T X w = X^T y ). The matrix ( X^T X ) is squared ( (n+1) \times (n+1) ) and its inverse exists.

(X^T X)^{-1} X^T X w = (X^T X)^{-1} X^T y

Iw = (X^T X)^{-1} X^T y

Thus, the value obtained is the closest possible solution:

w = (X^T X)^{-1} X^T y

We need to implement the function train_linear_regression, that takes the feature matrix ( X ) and the target variable ( y ) and returns the vector ( w ).

def train_linear_regression(X, y):
    pass

To approach this implementation, we first use a simplified example:

X = [
    [148, 24, 1385],
    [132, 25, 2031],
    [453, 11, 86],
    [158, 24, 185],
    [172, 25, 201],
    [413, 11, 83],
    [38, 54, 185],
    [142, 25, 431],
    [453, 31, 86]
]

From the last article, we know that we need to add a new column with ones to the feature matrix 𝑋 X for the multiplication with WO. We can use np.ones() to create a vector of ones.

ones = np.ones(X.shape[0])

Now we need to stack this vector of ones with our feature matrix 𝑋 using np.column_stack().

X = np.column_stack([ones, X])
y = [10000, 20000, 15000, 25000, 10000, 20000, 15000, 25000, 12000]

Next, we compute the Gram matrix and its inverse.

XTX = X.T.dot(X)
XTX_inv = np.linalg.inv(XTX)

To check if the multiplication of 𝑋𝑇𝑋 with 𝑋𝑇𝑋 inv produces the Identity matrix 𝐼 :

XTX.dot(XTX_inv).round(1)

Now we can implement the formula to obtain the full weight vector.

w_full = XTX_inv.dot(X.T).dot(y)

From that vector 𝑤full, we can extract 𝑤0 and the other weights.

w0 = w_full[0]
w = w_full[1:]

Finally, we implement the function train_linear_regression.

def train_linear_regression(X, y):
    ones = np.ones(X.shape[0])
    X = np.column_stack([ones, X])
     
    XTX = X.T.dot(X)
    XTX_inv = np.linalg.inv(XTX)
    w_full = XTX_inv.dot(X.T).dot(y)
     
    return w_full[0], w_full[1:]

Testing the implemented function:

X = [
    [148, 24, 1385],
    [132, 25, 2031],
    [453, 11, 86],
    [158, 24, 185],
    [172, 25, 201],
    [413, 11, 83],
    [38, 54, 185],
    [142, 25, 431],
    [453, 31, 86]
]
y = [10000, 20000, 15000, 25000, 10000, 20000, 15000, 25000, 12000]

train_linear_regression(X, y)

Part 8

Building a Baseline Model for Car Price Prediction

In this lesson, we build a baseline model using the df_train dataset to derive weights for the bias (w0) and features (w). We utilize the train_linear_regression(X, y) function, focusing only on numerical features due to the nature of linear regression. Missing values in df_train are set to 0 for simplicity, although using non-zero values like the mean would be more appropriate.

The model’s prediction function is defined as ( g(X) = w_0 + X \cdot w ). We then plot both predicted and actual values on a histogram for visual comparison.

Car Price Baseline Model

We begin by constructing a model for car price prediction, extracting only numerical columns from the dataset. The relevant columns selected for the model include engine_hp, engine_cylinders, highway_mpg, city_mpg, and popularity.

To prepare for training, we extract the values from these columns. It’s crucial to check for missing values, as they can adversely affect model performance. In df_train, we find missing values in engine_hp and engine_cylinders. While filling these with zeros is a simple solution, it may not be the most accurate representation of the data. Nonetheless, we proceed with this approach for the current example.

After addressing the missing values, we reassign the updated values to X_train. We also prepare our target variable, y_train.

We then use the train_linear_regression function to obtain values for w0 and the weight vector w. These variables allow us to apply the model to the training dataset to assess its performance.

To evaluate the model’s accuracy, we calculate predicted values using the derived weights. Finally, we visualize the comparison between actual and predicted values using histograms, illustrating that while the model is not perfect, it serves as a foundational step for further improvement.

The next lesson will focus on more objective methods to evaluate regression model performance.

Part 9

RMSE for Model Evaluation

In the previous lesson, we noted that our predictions were somewhat inaccurate compared to the actual target values. To quantify the model’s performance, we introduce Root Mean Squared Error (RMSE), a metric used to evaluate regression models by measuring the error associated with the predictions. RMSE enables comparison between models to determine which offers better predictions.

The formula for RMSE is given by:

[ RMSE = \sqrt{\frac{1}{m} \sum_{i=1}^{m} (g(x_i) - y_i)^2} ]

where (g(x_i)) is the prediction, (y_i) is the actual value, and (m) is the number of observations.

Root Mean Squared Error (RMSE)

To calculate RMSE, we utilize the predictions and actual values from our model. The process involves calculating the difference between the predicted and actual values, squaring this difference to obtain the squared error, and then averaging these squared errors to compute the Mean Squared Error (MSE). Finally, we take the square root of the MSE to find the RMSE.

We can implement RMSE in code, which allows us to obtain a numerical value representing the model’s performance.

Using our training data, we calculate the RMSE and find a value of approximately 0.746.

Validating the Model

Evaluating the model performance solely on the training data does not provide a reliable indication of its ability to generalize to unseen data. Therefore, we proceed to validate the model using a separate validation dataset. We apply the RMSE metric again to assess performance on this unseen data.

To prepare the dataset consistently, we implement a prepare_X function that handles both training and validation sets. After preparing the datasets, we train the model and compute predictions for the validation data.

Upon calculating the RMSE for the validation dataset, we obtain a value of approximately 0.733. When comparing this RMSE with the training RMSE (0.746), we observe similar performance on both seen and unseen data, which aligns with our expectations for a well-generalized model.

Part 10

Computing RMSE on validation data

Summary of RMSE Calculation for Car Price Prediction Model

RMSE as a Performance Metric

RMSE (Root Mean Squared Error) is introduced as a metric to evaluate model performance.
It is calculated using the predictions and actual values from the dataset.

Calculation Steps

Prediction vs Actual Values: Calculate the difference between predicted values ( g(x_i) ) and actual values ( y_i ).
Squared Errors: Square the differences to obtain the squared errors.
Mean Squared Error: Compute the mean of the squared errors.
Root Mean Squared Error: Take the square root of the mean squared error to obtain RMSE.

Example Calculation

Given predicted values and actual values, differences are computed, squared, averaged, and then the square root is taken to get RMSE.

Implementation

RMSE can be implemented in code with a function that calculates the squared errors, averages them, and returns the square root.

Model Validation

Evaluating the model on training data does not provide an accurate indication of its performance on unseen data.
The model is applied to a validation dataset after training to assess performance using RMSE.

Data Preparation

A function is implemented to prepare datasets consistently across training, validation, and test sets.

Results

The RMSE is calculated for both training and validation datasets, showing similar performance on seen (training) and unseen (validation) data, indicating the model’s robustness.

Part 11

Feature Engineering

The feature “age” of the car was derived from the dataset by subtracting the year of each car from the maximum year (2017). This new feature enhanced model performance, evidenced by a decrease in RMSE and improved distributions of the target variable and predictions.

Simple Feature Engineering

To create the “age” feature, the following calculation was performed:

This resulted in a series representing the age of each car in the dataset. The new feature “age” was added to the prepare_X function, ensuring a copy of the dataframe was used to avoid modifying the original data.

Implementation of `prepare_X`

The function prepare_X was defined to:

Copy the dataframe.
Calculate the “age” feature.
Compile a list of features to extract numerical values for model training.

The essential base features included:

engine_hp
engine_cylinders
highway_mpg
city_mpg
popularity

Model Training and Evaluation

The prepared training data was used to train a linear regression model:

X_train = prepare_X(df_train)
w0, w = train_linear_regression(X_train, y_train)
X_val = prepare_X(df_val)
y_pred = w0 + X_val.dot(w)

The RMSE was calculated

rmse(y_val, y_pred)

The results indicated a decrease in RMSE from approximately 0.733 to 0.515, demonstrating significant improvement in model performance.

Visualization Histograms comparing predicted values and actual values showed a clear enhancement in prediction accuracy, although further improvement was still possible.

sns.histplot(y_pred, color='red', alpha=0.5, bins=50)
sns.histplot(y_val, color='blue', alpha=0.5, bins=50)

Part 12

Categorical Variables and One-Hot Encoding in Machine Learning

Introduction

Categorical variables are often represented as strings in pandas, typically identified as object types. Some variables that seem numerical, like the number of doors in a car, are actually categorical. For machine learning (ML) models to interpret these variables, they need to be converted into a numerical format. This transformation is known as One-Hot Encoding.

Categorical Variables

In the dataset, categorical variables include:

make
model
engine_fuel_type
transmission_type
driven_wheels
market_category
vehicle_size
vehicle_style

Special Case: Number of Doors

The number_of_doors variable appears numerical but is categorical, as shown in the data type output where it is classified as float64.

One-Hot Encoding Process

One-hot encoding creates binary columns for each category in the variable. For example, for number_of_doors, we generate:

num_doors_2
num_doors_3
num_doors_4

The implementation uses boolean conditions to create new binary features.

Implementation Example

for v in [2, 3, 4]:
    df_train['num_doors_%s' % v] = (df_train.number_of_doors == v).astype('int')

Feature Preparation Function

The prepare_X function processes the dataframe to create numerical features, including:

Calculating the car’s age.

Applying one-hot encoding for the number_of_doors. The function fills missing values and extracts the feature array.

Output Example

The processed output from the function shows the newly created features for the number of doors.

Model Training and Evaluation

After preparing the features, the model is trained, and performance is evaluated using RMSE. An initial run shows a minor improvement in performance with new features added.

Expanding Categorical Features

Next, we consider other categorical variables like make, where there are 48 unique values. We focus on the top 5 most common makes to avoid dimensionality issues.

Updated Prepare Function

The prepare_X function is modified to include one-hot encoding for the make variable and others.

Evaluating Additional Features

Adding the new features again improves the model slightly.

Comprehensive Categorical Encoding

To enhance performance, a comprehensive list of categorical variables is created, and a loop is used to generate one-hot encodings for each.

Final Implementation

The final implementation of prepare_X incorporates two loops—one for the categorical variable names and one for their respective values.

Results and Issues

Upon re-evaluating the model with all features, a significant increase in RMSE suggests a possible issue, indicating that the approach or the features may need to be reassessed.

Conclusion

The transition from categorical to numerical variables is crucial for ML model performance, as seen in the various implementations. However, care must be taken to ensure that the added complexity genuinely benefits model accuracy.

Part 13

Regularization in Linear Regression

Introduction

In linear regression, the feature matrix may contain duplicate columns or columns that can be expressed as linear combinations of others. This results in a singular matrix when calculating the inverse, leading to poor model performance. A common approach to address this issue is through regularization.

Problem with Duplicate Columns

When duplicate columns exist in the feature matrix (X), the Gram matrix (X^TX) becomes singular, making its inverse non-existent. For example, if two columns are identical, attempting to compute ( \text{np.linalg.inv}(X^TX) ) results in a “Singular matrix” error.

Example

Given a matrix (X):

X = [
    [4, 4, 4],
    [3, 5, 5],
    [5, 1, 1],
    [5, 4, 4],
    [7, 5, 5],
    [4, 5, 5]
]

The Gram matrix calculated is: XTX = X.T.dot(X) This results in duplicate columns, leading to a singular matrix.

Regularization Technique

To mitigate the effects of duplicate columns, a small value (alpha) can be added to the diagonal of the Gram matrix: XTX = XTX + \alpha \cdot I Where 𝐼 is the identity matrix. This addition improves the likelihood of obtaining a non-singular matrix and stabilizes the computation of the weights.

Impact of Noise

Introducing slight noise to the duplicate columns can also make the columns no longer identical, thus allowing the computation of the inverse:

X = [
    [4, 4, 4],
    [3, 5, 5],
    [5, 1, 1],
    [5, 4, 4],
    [7, 5, 5],
    [4, 5, 5.0000001],
]

This adjustment results in a computable Gram matrix 𝑋𝑇𝑋 with non-singular properties.

Practical Implementation

To incorporate regularization in the linear regression training function:

def train_linear_regression_reg(X, y, r=0.001):
    ones = np.ones(X.shape[0])
    X = np.column_stack([ones, X])
    
    XTX = X.T.dot(X)
    XTX = XTX + r * np.eye(XTX.shape[0])
    
    XTX_inv = np.linalg.inv(XTX)
    w_full = XTX_inv.dot(X.T).dot(y)
    
    return w_full[0], w_full[1:]

Results

After applying the regularization technique with a regularization parameter 𝑟=0.01, the root mean square error (RMSE) improved significantly:

rmse(y_val, y_pred)  # Output: 0.45685446091134857

This demonstrates the effectiveness of regularization in controlling the weights and improving model performance.

Conclusion

Regularization is an essential technique in linear regression to address issues arising from duplicate features. By adding a small value to the diagonal of the Gram matrix, we can stabilize the inverse calculation, resulting in better model performance. Future work will involve optimizing the regularization parameter

Part 14

Model Tuning

Model Tuning The process of tuning the linear regression model involved identifying the optimal regularization parameter r using a validation set. The goal was to determine how this parameter impacts model performance.

Hyperparameter Search

A range of values for r was tested:

for r in [0.0, 0.00001, 0.0001, 0.001, 0.01, 0.1, 1, 10]:
    X_train = prepare_X(df_train)
    w0, w = train_linear_regression_reg(X_train, y_train, r=r)
    
    X_val = prepare_X(df_val)
    y_pred = w0 + X_val.dot(w)
    
    score = rmse(y_val, y_pred)
    print("reg parameter: ", r, "bias term: ", w0, "rmse: ", score)

Results Summary

r = 0.0: Huge bias term and high RMSE. Optimal r = 0.001: RMSE was found to be 0.4568807317131709, indicating good model performance.

Final Model Training

After identifying the optimal r, the model was retrained on the combined training and validation datasets.

Combining Datasets

The datasets were concatenated using:

df_full_train = pd.concat([df_train, df_val])
y_full_train = np.concatenate([y_train, y_val])
df_full_train = df_full_train.reset_index(drop=True)

Preparing Features

The feature matrix was prepared using:

X_full_train = prepare_X(df_full_train)

Final Training

The model was trained on the full dataset:

w0, w = train_linear_regression_reg(X_full_train, y_full_train, r=0.001)

Testing the Model

The final model was evaluated on a test dataset to check its performance:

X_test = prepare_X(df_test)
y_pred = w0 + X_test.dot(w)
score = rmse(y_test, y_pred)
print("rmse: ", score)

Results

Test RMSE: 0.5094518818513973, indicating good generalization as it was close to the validation RMSE. Using the Model for Predictions The final model can be utilized to predict the price of an unseen car by extracting features and applying the model.

Feature Extraction

For instance, extracting features from a car in the test dataset:

car = df_test.iloc[20].to_dict()
df_small = pd.DataFrame([car])
X_small = prepare_X(df_small)

Price Prediction

The model was applied to the feature vector:

y_pred = w0 + X_small.dot(w)
y_pred = np.expm1(y_pred[0])  # Undoing the logarithm

Final Predicted Price

The predicted price was approximately $21,044.36.

Actual Price Comparison

Comparing with the actual price:

actual_price = np.expm1(y_test[20])  # Output: 34975.0

This comprehensive approach illustrates the steps involved in tuning a linear regression model, training it on a combined dataset, and making predictions on unseen data. The model performed well, showing generalization capabilities through consistent RMSE values.

Part 15

Using the model

Using the model involves two main steps:

Feature Extraction: Extracting the feature vector from a car’s attributes.
Price Prediction: Applying the trained model to the feature vector to predict the car’s price.

Feature Extraction

To demonstrate the model’s functionality, we take a specific car from the test dataset as if it were a new car. For example, the selected car has the following features:

Make: Saab
Model: 9-3 Griffin
Year: 2012
Engine Fuel Type: Premium Unleaded (Recommended)
Engine HP: 220.0
Engine Cylinders: 4.0
Transmission Type: Manual
Driven Wheels: All Wheel Drive
Number of Doors: 4.0
Market Category: Luxury
Vehicle Size: Compact
Vehicle Style: Wagon
Highway MPG: 30
City MPG: 20
Popularity: 376

This information can be represented as a Python dictionary, simulating data input from a user on a website or app.

Creating a DataFrame for the Model

To prepare the data for the model, we convert the dictionary into a single-row DataFrame:

df_small = pd.DataFrame([car])

This DataFrame is then passed to the prepare_X() function to generate the feature matrix (feature vector).

Price Prediction

Once we have the feature vector, we apply the final model to predict the price:

y_pred = w0 + X_small.dot(w)

To obtain the actual price in dollars, we must undo the logarithm transformation applied during training:

predicted_price = np.expm1(y_pred)

For our example, this results in a predicted price of approximately $21,044.36.

Model Performance Evaluation

Finally, we can evaluate the model’s performance by comparing the predicted price to the actual price of the car:

actual_price = np.expm1(y_test[20])

The actual price of the selected car was $34,975.00, highlighting the discrepancy between the predicted and actual values.

Summary of Linear Regression Process

Data Preparation

Import necessary libraries, including NumPy and Pandas.
Load the dataset containing information about cars or relevant features.
Identify the feature columns to be used in the regression model.

Pre-Processing

Fill missing values in the dataset with zeros.
Calculate new features, such as the age of the vehicle based on the manufacturing year.
Create dummy variables for categorical features.

Building the Linear Regression Model

Develop the train_linear_regression function to calculate model weights (coefficients) using the Least Squares method.
Implement regularization with the train_linear_regression_reg function to address multicollinearity by adding a regularization parameter (r).

Model Training

Prepare the feature matrix (X) and target (y) from the training and validation data.
Train the model using the training and validation data.
Calculate predictions using the trained model and compute errors using the Root Mean Square Error (RMSE) function.

Model Evaluation

Use the full training data (combined training and validation data) to train the final model.
Compute predictions on the test dataset and calculate the RMSE score to evaluate model performance.

Individual Prediction

Extract a single entry from the test data for individual prediction.
Calculate the predicted value and convert it back to the original scale using the np.expm1 function.

Machine Learning for Regression

Machine Learning for Regression

Part 1

Car price prediction project

Project Plan

Part 2

Data Preparation

Key Considerations:

Pandas attributes and methods:

Example Code:

Cleaning String Columns

Modeling: Linear Regression for Car Price Prediction

Understanding Linear Regression: Learn how linear regression operates internally.

Model Evaluation

Part 3

Exploratory Data Analysis (EDA)

General Information

EDA for Car Price Prediction Project

Getting an Overview

Distribution of Price

Missing Values

Notes

Part 4

Setting Up the Validation Framework

Important Methods

Part 5

Linear Regression

Overview

Linear Regression Formula

Implementation in Python

Objective

Single Observation Analysis

Example of Feature Extraction

Full Function Implementation

Inverse Transformation

Part 6

Linear Regression vector form

Part 7

Training linear regression: Normal equation

Part 8

Building a Baseline Model for Car Price Prediction

Car Price Baseline Model

Part 9

RMSE for Model Evaluation

Root Mean Squared Error (RMSE)

Validating the Model

Part 10

Computing RMSE on validation data

Summary of RMSE Calculation for Car Price Prediction Model

RMSE as a Performance Metric

Calculation Steps

Example Calculation

Implementation

Model Validation

Data Preparation

Results

Part 11

Feature Engineering

Simple Feature Engineering

Implementation of prepare_X

Model Training and Evaluation

Part 12

Categorical Variables and One-Hot Encoding in Machine Learning

Introduction

Categorical Variables

Special Case: Number of Doors

One-Hot Encoding Process

Implementation Example

Feature Preparation Function

Calculating the car’s age.

Output Example

Model Training and Evaluation

Expanding Categorical Features

Updated Prepare Function

Evaluating Additional Features

Comprehensive Categorical Encoding

Final Implementation

Results and Issues

Conclusion

Part 13

Implementation of `prepare_X`