Predicting Bike Rentals

While I was was taking the Data Scientist stream of courses on DataQuest.io I worked on several excercise projects for gaining hands on experience on Data Wrangling, Exploratory Data Analysis and some basic Machine Learning algorithms.

In this post I have captured the work that I did in one such excercise project: Predicting Bike Rentals. The intention was to gain a better understanding of the Data Science Process and also gain some experience working with the tools.

I have used Pandas for Data Wrangling, matplotlib + seaborn for visualization and sklearn for building and training predictive models. I have tried to bring out interesting insights from the data and use my understanding of basic Machine Learning algorithms to build a good predictive model.

The dataset is taken from http://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset. It "contains the hourly and daily count of rental bikes between years 2011 and 2012 in Capital bikeshare system with the corresponding weather and seasonal information". Please follow the link for more information about the data set including attribute information.

At many places in this post (and my other posts) you will find that I use 'conversational' language, like "Let's", "We will". I find it easier to write posts in conversational style. I hope it doesn't bother you.

So, here we go. Let's begin by getting the dataset into memory and take a sneak peek.

In [3]:
import numpy as np
import pandas as pd
from IPython.display import display
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from scipy.sparse import hstack
from sklearn.cross_validation import train_test_split, KFold
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.grid_search import GridSearchCV

# Read the data into a DataFrame. Use datatypes that are just big enough to hold the
# range of values in each field.
data_types = {'instant': np.int16, 'season': np.int8, 'yr': np.int8, 'mnth': np.int8, 
              'hr': np.int8, 'holiday': np.int8, 'weekday': np.int8, 'workingday': np.int8, 
              'weathersit': np.int8, 'temp': np.float64, 'atemp': np.float64, 
              'hum': np.float64, 'windspeed': np.float64, 'casual': np.int16, 
              'registered': np.int16, 'cnt': np.int16}
dataset = pd.read_csv('hour.csv', parse_dates = 'dteday', dtype = data_types)

# Have a sneak peek at the data
display(dataset.iloc[:, :9].head(3))
display(dataset.iloc[:, 9:].head(3))
instant dteday season yr mnth hr holiday weekday workingday
0 1 2011-01-01 1 0 1 0 0 6 0
1 2 2011-01-01 1 0 1 1 0 6 0
2 3 2011-01-01 1 0 1 2 0 6 0
weathersit temp atemp hum windspeed casual registered cnt
0 1 0.24 0.2879 0.81 0 3 13 16
1 1 0.22 0.2727 0.80 0 8 32 40
2 1 0.22 0.2727 0.80 0 5 27 32

We have read the dataset using an appropriate datatype for each field. I always make it a point do this, irrespective of the size of the dataset. Having worked as an embedded system engineer for more than a decade, responsible use of memory has become one of my tenets.

Data looks good. All columns are labeled and numeric. Let's check if the dataset has any holes.

In [2]:
print(dataset.isnull().apply(lambda x: x.value_counts()).unstack().index.levels[1])
Index([False], dtype='object')

Cool...we are off to a quick start. The dataset has no holes. All the continuous numeric columns are already scaled. Categoricals have nice, sequential values. This is as good a dataset as they come.

Exploratory Data Analysis

Let's plot our way through the dataset. We can start off with the distributions of rentals in various seasons of the year, months of the year, days of the week, hours of the day and weekdays vs holidays. It'll also be interesting to see how the rental counts vary with weather conditions.

In [10]:
from datetime import date, time

fig, axes = plt.subplots(3, 2, figsize=(10,12))
fig.subplots_adjust(hspace=.5)

# Seasons
seasons = ['springer', 'summer', 'fall', 'winter']
(dataset.cnt.groupby(dataset.season).sum() / 1000).plot.bar(ax=axes[0, 0])
p = axes[0, 0].set(title = 'Rental count in various seasons\n', xlabel = '', 
                   ylabel = 'Rental Count (in 1000s)')
tl = axes[0, 0].set_xticklabels(seasons, rotation='horizontal')

# Weather conditions
wxConditions = ['Clear', 'Misty', 'Light\nSnow/Rain', 'Heavy\nSnow/Rain']
(dataset.cnt.groupby(dataset.weathersit).sum() / 1000).plot.bar(ax=axes[0, 1])
p = axes[0, 1].set(title = 'Rental count in various weather conditions\n',  xlabel = '', 
                   ylabel = 'Rental Count (in 1000s)')
tl = axes[0, 1].set_xticklabels(wxConditions, rotation='horizontal')

# Months of the year
months = [date(2016, i, 1).strftime('%B') for i in range(1,13)]
(dataset.cnt.groupby(dataset.mnth).sum() / 1000).plot.bar(ax=axes[1, 0])
p = axes[1, 0].set(title = 'Rental count in different months\n', xlabel = '', 
                   ylabel = 'Rental Count (in 1000s)')
tl = axes[1, 0].set_xticklabels(months, rotation='60')

# Days of the week
days = [date(2016, 2, i).strftime('%A') for i in range(1,8)]
(dataset.cnt.groupby(dataset.weekday).sum() / 1000).plot.bar(ax=axes[1, 1])
p = axes[1, 1].set(title = 'Rental count on different weekdays\n', xlabel = '', 
                   ylabel = 'Rental Count (in 1000s)')
tl = axes[1, 1].set_xticklabels(days, rotation='60')

# Hours of the day
hours = [time(i).strftime('%I %p') for i in range(24)]
(dataset.cnt.groupby(dataset.hr).sum() / 1000).plot.bar(ax=axes[2, 0])
p = axes[2, 0].set(title = 'Rental count in different hours\n', xlabel = '', 
                   ylabel = 'Rental Count (in 1000s)')
tl = axes[2, 0].set_xticklabels(hours, rotation='vertical')

# Holiday vs Working day
dayTypes = ['Weekend', 'Working Day', 'Holiday']
(dataset.cnt.groupby([dataset.holiday, dataset.workingday]).sum() / 1000).plot.bar(ax=axes[2, 1])
p = axes[2, 1].set(title = 'Rental count - Working days / Weekends \n/ Holidays\n', 
                   xlabel = '', ylabel = 'Rental Count (in 1000s)')
tl = axes[2, 1].set_xticklabels(dayTypes, rotation='60')

All the graphs suggest common-sense observations, like rentals peaking on clear days and in the morning/evening hours, dropping to near zero on holidays etc. A quick glance of the "Rental Count - Working days / Weekends / Holidays" may give a wrong impression that working-day rentals are much higher than weekend rentals. The "Working Day" bar is a cumulative count of the five weekdays and the "Weekend" bar is a cumulative count of the two weekend days; hence, the difference. This is further supported by the "Rental count on different weekdays" plot, that shows a fairly uniform distribution on all the days of the week.

Let's find out how the rental counts are distributed between casual and registered users.

In [11]:
fig, axes = plt.subplots(1, 2, figsize=(10,4))

regUsers = (dataset.registered.groupby([dataset.holiday, 
                                        dataset.workingday]).sum() / 1000)
casUsers = (dataset.casual.groupby([dataset.holiday, 
                                    dataset.workingday]).sum() / 1000)
userType_dayType_df = pd.DataFrame({'Registered': regUsers, 'Casual': casUsers})
userType_dayType_df.plot.bar(ax=axes[0])
p = axes[0].set(title = 'Distribution of User-Type (Casual vs Registered Users)\n
                on Working days, Weekends and Holidays\n', xlabel = '', 
                ylabel = 'Rental Count (in 1000s)')
tl = axes[0].set_xticklabels(dayTypes, rotation='horizontal')

regUsers = (dataset.registered.groupby(dataset.season).sum() / 1000)
casUsers = (dataset.casual.groupby(dataset.season).sum() / 1000)
userType_season_df = pd.DataFrame({'Registered': regUsers, 'Casual': casUsers})
userType_season_df.plot.bar(ax=axes[1])
p = axes[1].set(title = 'Distribution of User-Type (Casual vs Registered Users)\n
                during the four seasons\n', xlabel = '', 
                ylabel = 'Rental Count (in 1000s)')
tl = axes[1].set_xticklabels(seasons, rotation='horizontal')

plt.tight_layout()
In [12]:
print("\nProportion of Casual Rentals on Weekends, Working Days and Holidays")
caslProp_dayType = pd.DataFrame(((userType_dayType_df.Casual / 
                                  userType_dayType_df.Registered) * 100).tolist(), 
                        index=dayTypes, columns=["Casual Rentals %"])
display(caslProp_dayType)

print("\nProportion of Casual Rentals in different seasons")
caslProp_seasons = pd.DataFrame(((userType_season_df.Casual / 
                                  userType_season_df.Registered) * 100).tolist(), 
                        index=seasons, columns=["Casual Rentals %"])
display(caslProp_seasons)
Proportion of Casual Rentals on Weekends, Working Days and Holidays
Casual Rentals %
Weekend 46.914948
Working Day 15.247156
Holiday 39.872673
Proportion of Casual Rentals in different seasons
Casual Rentals %
springer 14.759718
summer 28.461948
fall 27.075534
winter 18.232137

As expected the casual rentals are higher on weekends and holidays compared to working days. With respect to seasons the casual-rentals trend follows the total-rentals trend - higher in summer/fall and lower in winter/spring. In other words, casual-rental counts and total-rental counts are highly correlated, speaking of which it'll be interesting to find out what other features have a high correlation with total-rental counts.

In [13]:
plt.figure().set_size_inches(8, 5)
ax = dataset.corr().cnt.plot.bar()
plt.title('Pairwise correlation (Pearson) of columns')
plt.ylabel('Correlation Coefficient');

The features, "casual' and "registered", are components of "cnt", and hence, are highly correlated with cnt. Climate features - "temp"/"atemp" (Temperature) and "hum" (Humidity) have good correlations with cnt for obvious reasons - cycling is more pleasurable on sunny and less-humid days than on other days. "hr" is another highly correlated feature with rental traffic growing with increasing hours of the day.

Given these correlations let's find out how good a Linear Regression model would be in predicting rental counts. But, before we can this all the categoricals have to be converted to thier one-hot vector equivalents.

In [103]:
# Label-Encode categoricals that don't start with a zero
dataset['season'] = LabelEncoder().fit_transform(dataset.season)
dataset['mnth'] = LabelEncoder().fit_transform(dataset.mnth)
dataset['weathersit'] = LabelEncoder().fit_transform(dataset.weathersit)

# One-Hot-Encode all categoricals
cat_cols = ['season', 'mnth', 'hr', 'weekday', 'weathersit']
cat_csrs = OneHotEncoder().fit_transform(dataset[cat_cols])
dataset.drop(cat_cols, axis=1, inplace=True)

# Segregate predictors and labels. Since 'casual' and 'registered' are directly 
# correlated to 'cnt' ('casual'+'registered'='cnt'), we will exclude them from 
# the feature set. 
X = dataset[dataset.columns.difference(['dteday', 'casual', 'registered', 
                                        'cnt', 'instant'])]
y = dataset['cnt']

# Stack them to create the feature matrix
X = hstack((X, cat_csrs), format='csr')

print('Shape of the Feature Matrix: {0}'.format(X.shape))
Shape of the Feature Matrix: (17379, 58)

Linear Regression Model

We now have a feature matrix that is ready for training the models. The feature matrix contains all the numeric features and one-hot-encoded categorical features.

Let's find out how LinearRegression performs on this feature set. We will split the feature matrix into 'train' and 'test' sets, perform K-Fold Cross-Validation (with n=10) on the training set and predict R2 scores for both cross-validation and test.

To make it interesting, let's calculate the scores at increasing split sizes and observe how CV score and test score changes with sample size.

In [140]:
cv_score_lr
Out[140]:
array([ 0.5001337 ,  0.60624393,  0.62146945,  0.63154472,  0.63141294,
        0.63794557,  0.63665087,  0.63177766,  0.61829577,  0.60979509,
        0.60858148,  0.60529454,  0.58738359,  0.58033705,  0.55903025,
        0.56002477,  0.53534464,  0.5650372 ,  0.60387806])
In [149]:
from sklearn.cross_validation import cross_val_predict
from sklearn.metrics import r2_score

# Compute a list of increasing sample split sizes. 
# This is used to check how cross-validation and test scores change with sample sizes.
train_sizes = [1.0 - 1.0/float(i) for i in range(2, 21)]

# Lists for holding cross-validation and test scores
cv_score_lr = np.zeros(len(train_sizes))
test_score_lr = np.zeros(len(train_sizes))

# For each sample split size compute CV and test scores
for i, sz in enumerate(train_sizes):
    testIdx = int(len(dataset) * sz)
    X_train, X_test = X[:testIdx, :], X[testIdx:, :]
    y_train, y_test = y[:testIdx], y[testIdx:]

    lin_regr = LinearRegression()
    val_preds = cross_val_predict(lin_regr, X_train, y_train, cv=10)
    cv_score_lr[i] = r2_score(y_train, val_preds)

    lin_regr.fit(X_train, y_train)
    test_preds = lin_regr.predict(X_test)
    test_score_lr[i] = r2_score(y_test, test_preds)

# Plot CV and test scores
plt.figure().set_size_inches(8, 5)
plt.plot(cv_score_lr)
plt.plot(test_score_lr)
plt.ylim(0.4, 1.0)
plt.yticks(np.arange(0.4, 1.05, 0.05))
plt.legend(['CV Score', 'test Score'])
plt.title('CV and test R2 scores at increasing train:test split ratios')
plt.xlabel('Proportion of total samples used for training')
plt.ylabel('R2 score')
tick_lbls = ['{0:0.3f}'.format(sz) for sz in train_sizes]
plt.xticks(range(19), tick_lbls, rotation=60);

CV score increases from 0.63 to 0.66 and plateaus at 80% split size. This indicates that the model has learnt all it can with 80% samples.

Test score increases from 0.5 and peaks at 0.64 at 85.7% split size. It stays in the 0.63 to 0.64 range 88.9% split size before starting to plummet as the test-set size continues to shrink below 10%. This may be due to the non-uniform distribution of samples, i.e. the last 10% of the samples are different from the first 90%, from which the model has learnt. Doing a shuffle split may change this behavior but we won't do that now since we already know the best performance of this model.

0.66 is not a very good score. This is probably because the dataset has non-linearities that the LinearRegression couldn't model. So, let's find out how well a Non-Linear model performs on this dataset.

Modeling nonlinearity using Decision Trees

Decision Trees are known to be fairly accurate and pick non-linearities in data pretty well. So, let's go green.

In [151]:
from sklearn.tree import DecisionTreeRegressor

train_sizes = [1.0 - 1.0/float(i) for i in range(2, 21)]

# For each sample split size compute CV and test scores
for i, sz in enumerate(train_sizes):
    testIdx = int(len(dataset) * sz)
    X_train, X_test = X[:testIdx, :], X[testIdx:, :]
    y_train, y_test = y[:testIdx], y[testIdx:]

    lin_regr = DecisionTreeRegressor()
    val_preds = cross_val_predict(lin_regr, X_train, y_train, cv=10)
    cv_score[i] = r2_score(y_train, val_preds)

    lin_regr.fit(X_train, y_train)
    test_preds = lin_regr.predict(X_test)
    test_score[i] = r2_score(y_test, test_preds)

# Plot CV and test scores
plt.figure().set_size_inches(8, 5)
plt.plot(cv_score)
plt.plot(test_score)
plt.ylim(0.4, 1.0)
plt.yticks(np.arange(0.4, 1.05, 0.05))
plt.legend(['CV Score', 'test Score'])
plt.title('CV and test R2 scores at increasing train:test split ratios')
plt.xlabel('Proportion of total samples used for training')
plt.ylabel('R2 score')
tick_lbls = ['{0:0.3f}'.format(sz) for sz in train_sizes]
plt.xticks(range(19), tick_lbls, rotation=60);

Not bad. As expected, Decision Trees did perform much better than Linear Regression. R2 score has jumped from 0.64 (in Linear Regression) to 0.78. But, we want more from this algorithm. It has a bunch of parameters that must be tuned to get a better performance.

Tuning Decision Tree parameters

We will apply a Grid Search on the following parameters and determine values that give the best score.

  • max_depth: 3, 5, 10, None
  • min_samples_split: 25, 10, 5, 2
  • min_samples_leaf: 5, 3, 1

This results is a grid of 4 x 4 x 3 = 36 parameter combinations. Using sklearn.GridSearchCV we can train a model for each parameter combination and determine the mean of cross-validation R2 scores for each combination. The model with the best score will be marked and can be used for predicting test score.

In [173]:
split_sz = 0.8
testIdx = int(len(dataset) * split_sz)
X_train, X_test = X[:testIdx, :], X[testIdx:, :]
y_train, y_test = y[:testIdx], y[testIdx:]

param_grid = {'max_depth': [3, 5, 10, None], 
              'min_samples_split': [25, 10, 5, 2], 
              'min_samples_leaf':  [5, 3, 1]
             }

clf = GridSearchCV(DecisionTreeRegressor(), param_grid, cv=10)
clf.fit(X_train, y_train)

best_preds = clf.predict(X_test)
best_r2 = r2_score(y_test, best_preds)

print('Best parameters: {0}'.format(clf.best_params_))
print('R2 Score from the best parameters: {0:0.4f}'.format(best_r2))
Best parameters: {'min_samples_split': 25, 'max_depth': None, 'min_samples_leaf': 1}
R2 Score from the best parameters: 0.8103

There we go. The best score is 0.81. Better than the 0.78 that the model with default parameter values returned. The parameter that made the difference was min_samples_split with a parameter of 25.

While Decision Tree did its job by improving the score from 0.64 to 0.81, we are still a long shot from the perfect score, which is 1.0. Ok, it may be too ambitious to aim for the perfect score. But we can certainly aim for a score above 0.9 or even above 0.95.

So, how do we get there. The answer is Ensembling Methods.

Scoring high with Ensembling Methods

Following is an extract from sklearn's documentation. "The goal of ensemble methods is to combine the predictions of several base estimators built with a given learning algorithm in order to improve generalizability / robustness over a single estimator."

Gradient Tree Boosting is known to be a highly robust ensembling method with a good predictive power. The only listed disadvantage is its poor scalability due to its sequential nature of building base estimators. Since our dataset is pretty small this is really not a disadvantage for us. So, let's apply this method to our dataset and find out how well it performs.

Like we did with Decision Trees we will begin by training a model with default parameters and then search its parameter space to find the parameters that give the best score.

In [175]:
from sklearn.ensemble import GradientBoostingRegressor

gbr_model = GradientBoostingRegressor()
gbr_model.fit(X_train.toarray(), y_train)
gbr_preds = gbr_model.predict(X_test.toarray())
test_score_gbr = r2_score(y_test, gbr_preds)

print('R2 Score from the GBR model trained on default parameters: 
      {0:0.4f}'.format(test_score_gbr))
R2 Score from the GBR model trained on default parameters: 0.6882

It's definitely time for parameter tuning. Let's perform a Grid Search on the following parameters.

  • max_depth: 3, 5, 10, None
  • n_estimators: 100, 300, 500
In [177]:
param_grid = {'max_depth': [3, 5, 10, None], 'n_estimators': [100, 300, 500]}

clf_gbr = GridSearchCV(GradientBoostingRegressor(), param_grid, cv=5)
clf_gbr.fit(X_train.toarray(), y_train)

best_preds_gbr = clf_gbr.predict(X_test.toarray())
best_r2_gbr = r2_score(y_test, best_preds_gbr)

print('Best parameters: {0}'.format(clf_gbr.best_params_))
print('R2 Score from the best parameters: {0:0.4f}'.format(best_r2_gbr))
Best parameters: {'n_estimators': 500, 'max_depth': 5}
R2 Score from the best parameters: 0.8799

Conclusion

In this post we built three different types of models on a small and clean dataset: a linear model, a non-linear model and an ensemble model. We experienced the simplicity of linear models through a LinearRegression model, the power of non-linear models through a DecisionTree model and the promise of ensembling through a GradientBoostingRegressor model.

One always starts with a simple linear model, whenever possible, because they are simple and give us the base score against which the performance of complex, non-linear models can be compared. The LinearRegression model was able to explain two third of the variance (R2 = 0.66).

The Decision Tree model showed that it can explain 81% of the variance by learning some non-linearities in the dataset Then came GradientBoostingRegressor with its army of weak models that joined hands and displayed the power of teamwork by explaining 88% of the variance.

In the process we also did some parameter tuning to help the models do their best on our dataset.


That's it for this post. Thank you for spending the time. I hope you enjoyed reading it as much as I enjoyed writing it. You may find my posts in "Kaggle Series" equally interesting. In the Kaggle Series I adopt a systematic Data Science Process to describe my solutions of the Kaggle competitions that I participated in. Following are the links to a couple of posts in the series.

- Vinay Huddar