Notes on "Deep Learning Bookcamp"

1. Introduction
2. Chapter 2: Predicting car prices
- 2.1. Exercises and code
  - 2.1.1. Utility functions
  - 2.1.2. Car prices

1 Introduction

2 Chapter 2: Predicting car prices

2.1 Exercises and code

2.1.1 Utility functions

Convert the code from chapter 2 to a set of functions:

import numbers

import pandas as pd
import numpy as np

Read the data and clean up the column names / alphanumeric data:

def read_data(file):
    """Read a dataframe from a CSV file.

    Parameters:
    file (string): path to a CSV file.

    Returns:
    DataFrame holding the contents of the file.
    """
    df = pd.read_csv(file)
    return df


def clean_alphanum_data(df):
    """Clean up alphanumeric data in a dataframe.

    Convert all strings to lower case and replace spaces with underscores.

    Parameters:
    df (DataFrame): the dataframe to be cleaned.

    Returns:
    None.
    """
    df.columns = df.columns.str.lower().str.replace(' ', '_')

    string_columns = list(df.dtypes[df.dtypes == 'object'].index)
    for col in string_columns:
        df[col] = df[col].str.lower().str.replace(' ', '_')

Split the data frame into a train, validation and test set:

def split_data_frame(df, split=0.2, seed=None):
    """Split a dataframe into a train, validation and test set.

    The dataframe is first randomized and then split into three parts.

    Parameters:
    df (DataFrame): the dataframe to split.
    split (float):  fraction of the dataframe to use for validation and test sets.
    seed (int):     the seed used for randomization.

    Returns:
    3-tuple of DataFrame, DataFrame, DataFrame (train, validation, test).

    """
    n = len(df)

    n_val = int(split * n)
    n_test = int(split * n)
    n_train = n - (n_val + n_test)

    idx = np.arange(n)
    if isinstance(seed, numbers.Number):
        np.random.seed(seed)
    np.random.shuffle(idx)

    df_shuffled = df.iloc[idx]

    df_train = df_shuffled.iloc[:n_train].copy()
    df_val = df_shuffled.iloc[n_train:n_train+n_val].copy()
    df_test = df_shuffled.iloc[n_train+n_val:].copy()

    return df_train, df_val, df_test

Prepare the data. I add two arguments to prepare_X: base, which is a list of the features in the dataframe that should be used, and fns, a list of functions to extract additional features.

def prepare_X(df, base, fns=[]):
    """Prepare a dataframe for learning.

    Convert the dataframe to a Numpy array:

    - Extract the features in `base`.

    - Apply the functions in `fns` to the dataframe to derive new features from
      existing ones (e.g., for binary encoding).

      The elements of `fns` should be lists `(fn, arg, arg, arg, ...)`. Before
      calling each function, `df` is prepended to the list of arguments. The
      functions should add the new features to `df`, and they should return a
      list of the names of the new feature(s) as strings.

    - Fill any missing data with 0.

    Note that `df` is not modified. The functions in `fns` should modify their
    dataframe argument, but they operate on a copy of `df`.

    Parameters:
    df (DataFrame): dataframe to convert.
    base (list of strings): list of fields in the dataframe to be used for the array.
    fns (list of tuples (function, arg list)): feature engineering functions.

    Returns:
    ndarray of the prepared data.

    """
    df = df.copy()
    features = base.copy()

    for fn, *args in fns:
        args = [df] + args
        new_features = fn(*args) # Note: `fn` should also modify the local copy of `df`!
        features += new_features

    df_num = df[features]
    df_num = df_num.fillna(0)
    X = df_num.values

    return X

Two functions for feature engineering:

def binary_encode(df, feature, n=5):
    """Binary encode a categorical feature.

    Take the top `n` values of `feature` and add features to `df` to binary
    encode `feature`. The dataframe is modified in place.

    Parameters:
    df (DataFrame): the dataframe to add the feature to.
    feature (string): feature in df to be binary encoded.
    n (int): number of values for feature to encode.

    Returns:
    List of new features.

    """
    assert feature in df

    top_values = df[feature].value_counts().head(n)
    new_features = []
    for v in top_values.keys():
        binary_feature = feature + '_%s' % v
        df[binary_feature] = (df[feature] == v).astype(int)
        new_features.append(binary_feature)

    return new_features


def encode_age(df, year_field, current_year):
    """Encode the age of an item as a feature.

    The age is calculated on the basis of the contents of `year_field` and
    `current_year`.

    Parameters:
    df (DataFrame): dataframe to encode the age in.
    year_feature (string): the feature that encodes the relevant year.
    current_year (int): the year used to calculate the age.

    Returns:
    Constant value ['age'].

    """
    assert year_field in df
    assert df[year_field].dtype == 'int64'

    df['age'] = current_year - df[year_field]

    return ['age']

binary_encode can be generalized to a function that loops over a list of features:

def binary_encodes(df, features, n=5):
    """Binary encode a list of features.

    Each feature is passed to `binary_encode`. See there for details. Note that
    `df` is modified in place.

    Parameters:
    df (DataFrame): the dataframe to engineer features from.
    features (list of strings): list of features to binary encode.
    n (int): number of values for feature to encode.

    Returns:
    A list of features added to `df`.

    """
    all_new_features = []
    for feature in features:
        new_features = binary_encode(df, feature, n)
        all_new_features += new_features

    return all_new_features

The linear_regression and rmse functions. These weren't modified:

def linear_regression(X, y, r=0.0):
    """Perform linear regression.

    Parameters:
    X (ndarray): array of input values.
    y (ndarray): target values.
    r (float): regularization amount.

    Returns:
    Tuple of float, ndarray (bias, array of weights)

    """
    ones = np.ones(X.shape[0])
    X = np.column_stack([ones, X])

    XTX = X.T.dot(X)
    reg = r * np.eye(XTX.shape[0])
    XTX = XTX + reg

    XTX_inv = np.linalg.inv(XTX)
    w = XTX_inv.dot(X.T).dot(y)

    return w[0], w[1:]


def rmse(y, y_pred):
    """Compute the root mean square error.

    Parameters:
    y (ndarray): target values.
    y_pred (ndarray): predicted values.

    Returns:
    float

    """
    error = y_pred - y
    mse = (error ** 2).mean()
    return np.sqrt(mse)

2.1.2 Car prices

The goal is to see if more feature engineering improves the model. The RMSE of the model as developed in chapter 2 is 0.46. Can this be improved?

Let us set up the model. First, read the data and clean it up:

df = read_data('../data/cars.csv')
clean_alphanum_data(df)
df.head()

  make       model  year             engine_fuel_type  engine_hp  engine_cylinders transmission_type  ...                        market_category  vehicle_size vehicle_style highway_mpg city_mpg  popularity   msrp
0  bmw  1_series_m  2011  premium_unleaded_(required)      335.0               6.0            manual  ...  factory_tuner,luxury,high-performance       compact         coupe          26       19        3916  46135
1  bmw    1_series  2011  premium_unleaded_(required)      300.0               6.0            manual  ...                     luxury,performance       compact   convertible          28       19        3916  40650
2  bmw    1_series  2011  premium_unleaded_(required)      300.0               6.0            manual  ...                luxury,high-performance       compact         coupe          28       20        3916  36350
3  bmw    1_series  2011  premium_unleaded_(required)      230.0               6.0            manual  ...                     luxury,performance       compact         coupe          28       18        3916  29450
4  bmw    1_series  2011  premium_unleaded_(required)      230.0               6.0            manual  ...                                 luxury       compact   convertible          28       18        3916  34500

[5 rows x 16 columns]

Split the data set into a train, validation and test set:

df_train, df_val, df_test = split_data_frame(df, split=0.2, seed=2)

y_train = np.log1p(df_train.msrp.values)
y_val = np.log1p(df_val.msrp.values)
y_test = np.log1p(df_test.msrp.values)

Remove the target value ("msrp" or "manufacturer's suggested retail price") from the data set:

del df_train['msrp']
del df_val['msrp']
del df_test['msrp']

Prepare the data. To confirm the results in the book (and make sure my code is working), I'll first use the same parameters:

# Prepare the training data.
base = ['engine_hp', 'engine_cylinders', 'highway_mpg', 'city_mpg', 'popularity']
fns = [[encode_age, 'year', 2017],
       [binary_encodes, ["number_of_doors",
                         "make",
                         "engine_fuel_type",
                         "transmission_type",
                         "driven_wheels",
                         "market_category",
                         "vehicle_size",
                         "vehicle_style"],
        5]]
X_train = prepare_X(df_train, base, fns)

Now train the model:

w_0, w = linear_regression(X_train, y_train, 0.01)

If we apply the model to the training data, we should get the original prices again. In reality, we don't.

from matplotlib import pyplot as plt
import seaborn as sns

y_pred = w_0 + X_train.dot(w)
plt.clf()
sns.histplot(y_pred, label='pred')
sns.histplot(y_train, label='y', color='red')
plt.legend()
plt.savefig('figures/figure2-06.png')
'figures/figure2-06.png'

We can compute the RMSE for the model:

rmse(y_train, y_pred)

0.46020995201980425

We should of course compute the RMSE on the validation set:

X_val = prepare_X(df_val, base, fns)
y_pred = w_0 + X_val.dot(w)
rmse(y_val, y_pred)

0.476510145790575

Let's follow the suggestion in exercise 2.5.1 and include more values in the binary encoded features:

fns = [[encode_age, 'year', 2017],
       [binary_encodes, ["number_of_doors",
                         "make",
                           "engine_fuel_type",
                         "transmission_type",
                         "driven_wheels",
                         "market_category",
                         "vehicle_size",
                         "vehicle_style"],
        8]]

X_train = prepare_X(df_train, base, fns)

w_0, w = linear_regression(X_train, y_train, 0.01)

y_pred = w_0 + X_train.dot(w)

plt.clf()
sns.histplot(y_pred, label='pred')
sns.histplot(y_train, label='y', color='red')
plt.legend()
plt.savefig('figures/figure2-07.png')
'figures/figure2-07.png'

Evaluating against the validation set:

X_val = prepare_X(df_val, base, fns)
y_pred = w_0 + X_val.dot(w)
rmse(y_val, y_pred)

0.4850113357947244

The performance seems to have degraded, not improved, although only by a little.

Note that trying to use 10 values for binary encoding fails, because the validation set then gains an extra feature. The error reported is:

ValueError: shapes (2382,61) and (60,) not aligned: 61 (dim 1) != 60 (dim 0)

I assume that in the validation set, one of the features has one value more than in the training set.