In the first part of this series I briefly described my prespective of the Data Science Process and illustrated it with a case study from Kaggle's 'Predicting RedHat Business Value' competition. Since the post was getting too long and I didn't want many TL;DRs, I stopped after the "Exploratory Data Analysis" (EDA) stage with a promise to complete it in this post. So, here we are. Let's continue with case study and develop it through the remaining stages of the Data Science Process:

  1. Machine Learning / Statistical Modeling
  2. Visualising / Reporting

This is the stage where "Learning happens". Machine Learning Algorithms and/or Statistical Modeling Techniques are used to find patterns in data and predict the outcome. Machine Learning and Statistical Modeling is used interchangeably in this post, although there are some differences between the two. Statistical Modeling Techniques are used to formulate relationships between variables. They deal with mathematical equations. Machine Learning Algorithms on the other hand are not as much concerned about mathematical formulations as they are with relating observations to outcomes irrespective of the complexity of the relationship.

Before we can start talking about algorithms/models and thier parameters we need to do some ground work. Although the data was preprocessed and analysed in the earlier stages it is still not completely ready for modeling. The first thing we need to do is to get all the data into a single place. For our case study this means joining the People and Activity tables.

We will read the preprocessed data into memory using Pandas read_csv() function as we did at the beginning of the preprocessing stage.

In [32]:
import numpy as np
import pandas as pd
import seaborn as sns

import time
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split 
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import LabelEncoder 
from scipy.sparse import csr_matrix, hstack

import xgboost as xgb
In [33]:
# Get Preprocessed Activity and People Data into memory
act_train_df = pd.read_csv('../processed_data/activity_table.csv',parse_dates=['date_act'])
people_df = pd.read_csv('../processed_data/people_table.csv', parse_dates=['date_ppl'])

Let's get the test data also into memory and preprocess it.

In [34]:
act_test_df = pd.read_csv('../input/act_test.csv', parse_dates=['date'])
In [36]:
def preprocess_act(act_df):
    # Impute missing data
    act_df = act_df.fillna(value = 'type 0')
    
    # Convert categorical string type columns to their integer equivalents
    cat_cols = act_df.columns.difference(['people_id', 'activity_id', 'outcome', 'date'])
    act_df[cat_cols] = act_df[cat_cols].apply(
        lambda x: x.str.split(' ').apply(lambda y: y[1])).astype(int)
    
    # Process activity_id - split id into category of activity (act1 / act2) and id number
    act_id_parts = pd.DataFrame.from_records(
        act_df['activity_id'].apply(lambda x: x.split('_')).tolist(), 
        columns=['act_cat', 'act_id'], index=act_df.index)
    act_id_parts['act_cat'] = act_id_parts['act_cat'].map({'act1': 1, 'act2': 2})
    act_df = pd.concat([act_df, act_id_parts['act_cat']], axis=1)

    # Let's append '_act' to char_# & date cols to differentiate them from People 'char's
    char_old_name_new_name_map = {}
    for i in range(1, 11):
        char_old_name_new_name_map['char_' + str(i)] = 'char_' + str(i) + '_act'
    act_df.rename(columns=char_old_name_new_name_map, inplace = True)
    act_df.rename(columns={'date': 'date_act'}, inplace = True)

    # Convert all columns except activity_id, people_id and char_10 to np.int8 type.
    cast_cols = act_df.columns.difference(['activity_id', 'people_id', 'date_act', 'char_10_act'])
    act_df[cast_cols] = act_df[cast_cols].apply(lambda x: x.astype(np.int8))
    # Convert char_10 to np.int16 type
    act_df['char_10_act'] = act_df['char_10_act'].astype(np.int16)
    
    return act_df
    
# Preprocess test dataset
act_test_df = preprocess_act(act_test_df)
print('\nShape of act_test table: %s\n' % (act_test_df.shape, ))
print('First five rows and seven columns of act_test...')
act_test_df[act_test_df.columns[:7]].head()
Shape of act_test table: (498687, 15)

First five rows and seven columns of act_test...
Out[36]:
people_id activity_id date_act activity_category char_1_act char_2_act char_3_act
0 ppl_100004 act1_249281 2022-07-20 1 5 10 5
1 ppl_100004 act2_230855 2022-07-20 5 0 0 0
2 ppl_10001 act1_240724 2022-10-14 1 12 1 5
3 ppl_10001 act1_83552 2022-11-27 1 20 10 5
4 ppl_10001 act2_1043301 2022-10-15 5 0 0 0

Encoding categorical features

Most of the features in the dataset are categorical. Activity category and all activity characteristics are categorical. group_1 and char_2 to char_9 in People table are categorical. The rest of the characteristic features in People table are boolean. All categoricals have to be one-hot-encoded before we can use them training.

Before we do one hot encoding let's create some features from date columns. We get three features by splitting date into year, month and day. We will also add day-of-the-week. So, we get 4 features for the activity's date column and 4 features for people's date column, a total of 8 new features.

In [11]:
# extract year, month, day of month and day of week from date
def create_date_features(df, append_str):
    date_col_name = 'date' + append_str
    df['year' + append_str] = df[date_col_name].apply(lambda x: x.year)
    df['month' + append_str] = df[date_col_name].apply(lambda x: x.month)
    df['day' + append_str] = df[date_col_name].apply(lambda x: x.day)
    df['day_week' + append_str] = df[date_col_name].apply(lambda x: x.dayofweek)
    
    df.drop(date_col_name, axis = 1, inplace = True)
    return df

act_train_df = create_date_features(act_train_df, '_act')
act_test_df = create_date_features(act_test_df, '_act')
people_df = create_date_features(people_df, '_ppl')

print('\nShape of act_train after adding date features: %s' % (act_train_df.shape, ))
print('Shape of act_test after adding date features: %s' % (act_test_df.shape, ))
print('Shape of People after adding date features: %s' % (people_df.shape, ))
Shape of act_train after adding date features: (2197291, 19)
Shape of act_test after adding date features: (498687, 18)
Shape of People after adding date features: (189118, 44)

One Hot Encoding results in sparse matrices, the size of which determines the training time. The number of new columns added is directly proportional to the distinct number of values in each feature being encoded. So, any opportunity of reducing the number of distinct values in one or more features without affecting the accuracy of the predictions should be utilized. This will result in faster training time.

We have such an opportunity in group_1. Of the 34K distinct values more than 20K are occur only once, which means these groups have only one member in them. Let's assume that the number of activities performed by one-person groups is insignificant compared to other groups and can be safely discarded without major impact on prediction accuarcy. Post training and scoring if we see a good score our assumption will be true. If the score is bad or if we want to check whether adding one-person groups also into the mix increases teh score we can them and retrain the model, which will take much longer than the previous training due the increase in feature count.

In [13]:
# Trim group_1 - drop all single-occurance groups from people table
# this reduces the group count from 34k to 13.8k
grp_cnts = people_df.group_ppl.value_counts()
one_person_grps = grp_cnts[grp_cnts == 1].index
one_person_grps_lbl = people_df[people_df.group_ppl.isin(one_person_grps)].index
people_df.drop(one_person_grps_lbl, axis=0, inplace = True)

print('\nNumber of groups before trimming: %d' % (grp_cnts.shape[0], ))
print('Number of groups after trimming: %d'%(people_df.group_ppl.value_counts().shape[0],))
Number of groups before trimming: 34224
Number of groups after trimming: 13807

It's now time to stack the train and test sets, and then, merge it with people. We will use this stacked and merged entity for further processing.

In [21]:
# Create a marker that differentiates train samples from test samples. This will be useful later.
act_train_df['is_train'] = True
act_test_df['is_train'] = False

act_whole_df = pd.concat([act_train_df, act_test_df])
whole_dataset = pd.merge(act_whole_df, people_df, how = 'left', on = 'people_id')
whole_dataset.fillna(value = 0, inplace = True)

# Release memory that is no longer needed
del act_train_df
del act_test_df
import gc
gc.collect()

print('\nShape of the merged entity: %s' % (whole_dataset.shape, ))
Shape of the merged entity: (2695978, 63)

One Hot Encoding

Now we are ready for one-hot encoding. We will one-hot encode each feature and create a sparse matrix for each. We will then horizontally stack them with non-categorical features to create the final feature matrix that we can train the model on.

In [22]:
# Get the list of features and segregate categorical and non-categorical features
features = whole_dataset.select_dtypes(exclude = ['object']).columns.difference(
    ['char_10_act', 'is_train', 'outcome'])
cat_feats = features.difference(['char_' + str(i)for i in range(11, 38)] + 
                                ['act_cat', 'char_10_ppl']) # Keep out boolean features
non_cat_feats = features.difference(cat_feats)

# Encode categoricals
enc = LabelEncoder()
whole_dataset[cat_feats] = whole_dataset[cat_feats].apply(lambda x: enc.fit_transform(x))

# Train and test samples have to be encoded separately since it would be difficult to 
# separate them from the encoded and sparsified whole_dataset later
X_train = whole_dataset[whole_dataset.is_train == True]
X_test = whole_dataset[whole_dataset.is_train == False]

# Build and stack CSR matrices
def build_csr(df):
    num_smpls = df.shape[0]
    row_nums, csr_data_col = range(num_smpls), np.ones(num_smpls)

    csr_mats = csr_matrix((csr_data_col, (row_nums, df[cat_feats[0]].values)), 
                         shape = (num_smpls, whole_dataset[cat_feats[0]].max() + 1))
    for col in cat_feats[1:]:
        csr = csr_matrix((csr_data_col, (row_nums, df[col].values)), 
                         shape = (num_smpls, whole_dataset[col].max() + 1))
        csr_mats = hstack([csr_mats, csr], format = 'csr')        
    return csr_mats

# Create train and test feature matrices
X_train_sparse = hstack((X_train[non_cat_feats], build_csr(X_train)))
X_test_sparse = hstack((X_test[non_cat_feats], build_csr(X_test)))

print('\nShape of one-Hot-Encoded training data: {0}'.format(X_train_sparse.shape))
print('Shape of one-Hot-Encoded test data: {0}'.format(X_test_sparse.shape))
Shape of one-Hot-Encoded training data: (2197291, 14388)
Shape of one-Hot-Encoded test data: (498687, 14388)

2.7 million sample and 14.4K features. That's a large amount of data; our machine has some work to do.

Let's do some ML!

We will begin with a simple algorithm that is suitable for the problem at hand. We are dealing with a Binary Classification problem, since the outcome can take only one of two values: either a 0 or a 1. Logistic Regression is the standard algorithm used for this kind of problem. So, let's begin with it and see how far we can get with this algorithm.

We will train a LogisticRegression model (from sklearn's linear_model module) with the default parameter values and see how that fares. The only parameter that we explicitly set is the random_state parameter. This is required to make sure every time we train this model the Random Number Generator used internally by the algorithm returns the same set of values. This is required when we compare outcomes of multiple model prediction runs.

We will use a hold-out set Cross Validation method. We will use sklearn's train_test_split function from its cross_validation module.

In [25]:
# Let's split the dataset into train and test sets
y = whole_dataset[whole_dataset.is_train == True].outcome
X_train, X_test, y_train, y_test = train_test_split(X_train_sparse, y, test_size = 0.25, 
                                                    random_state = 1)

# Train a LogisticRegression model
tic = time.time()
model_lr = LogisticRegression(random_state=1)
model_lr.fit(X_train, y_train)
toc = time.time()

print("\nTraining time: %dm:%ds" % ((toc-tic) / 60, (toc-tic) % 60))
Training time: 11m:19s

We have trained our first model. Let's see how it fares on the validation set. We will use AUC (Area Under Curve) evaluation metric. Again, sklearn has a function defined for this in its 'metrics' module called roc_auc_score. Let's use that.

In [22]:
y_pred = model_lr.predict(X_test)
auc_1 = roc_auc_score(y_test, y_pred)

print("AUC Score: %.4f" % (auc_1, ))
AUC Score: 0.9277

Cool. Not a bad score at all for the first try.

Every model has a set of parameters that need to be tuned to achieve the best perfromance on the data at hand. This activity is called Parameter tuning. For LogisticRegression the parameter to tune is the regularization value. We have to find the regularization value that gives us the best score.

Parameter values are usually picked from the logarithmic space. The numpy function np.logspace comes in handy here. We will train the model on five different values of regularization value - 0.001, 0.01, 0.1, 0.0, 1.0. We will then plot the auc score and pick the regularization value that shows the best score.

In [2]:
def train_model(reg_val):
    model = LogisticRegression(random_state=1, C=reg_val)
    model.fit(X_train, y_train)
    
    y_preds = model.predict(X_test)
    return model, roc_auc_score(y_test, y_preds)

reg_vals = np.logspace(-3, 1, num = 5)
auc_list = []
best_auc, best_model = auc_1, model_lr

for reg in reg_vals:
    tic = time.time()
    model, auc = train_model(reg)
    toc = time.time()
    print("Model for C=%.3f trained in %dm:%ds time, AUC Score = %.04f" % 
          (reg, (toc-tic) / 60, (toc-tic) % 60, auc))
    
    if auc > best_auc:
        best_auc = auc
        best_model = model
    
    auc_list.append(auc)

plt.plot(auc_list)
plt.xticks(range(5), reg_vals);
plt.xlabel('Regularization Value')
plt.ylabel('AUC Score')
plt.title('Regularization Value Vs AUC Plot');

Let us use 1.0 as the regularization value since it is faster to train and is very close to the best model score, which was with a regularization value of 10. We will now train the model on the entire training set, including the samples that we had held out for cross-validation and predict the outcomes for the test set - the data on which Kaggle computes the Public and Private Leader Board Scores.

In [26]:
model = LogisticRegression(C=1.0)
model.fit(X_train_sparse, y)

y_preds = model.predict(X_test_sparse)

output = pd.DataFrame({ 'activity_id' : whole_dataset[whole_dataset.is_train == False]['activity_id'], 'outcome': y_preds })
output.to_csv('log_regr_lr_1.csv', index = False)

The score on the testset was 0.91174. Not bad. This shows that the model is generalized enough to predict the right outcomes on unseen data.

Non-Linear Methods

LogisticRegression is a linear model. Non-linear relationships between features and outcome cannot be captured with this algorithm. Non-linear algorithms like Random-Forest, Gradient Boosting Machine (GBM) and Neural Networks usually do a much better job of finding the intricate patterns hidden in data, and hence, end up performing much better than linear algorithms.

It will be nice to see how two of the most widely used non-linear methods, GBM and Neural Networks, perform on our dataset. We will first fit a GBM model on the training data, get the predictions on the test data, submit them to Kaggle and check how it scores. We will then do the same with Neural Networks.

The purpose is to compare the performance of non-linear models with Logistic Regression, and not fit the best GBM/Neural-Network models, i.e. we will not do parameter tuning. Instead I will train a single model for each non-linear method and pick the parameters for them based on what has worked for me before.

Let's start with Gradient Boosting Machine (GBM).

Gradient Boosting with XGBoost

GBM is a tree-based ensembling method, i.e. it combines the predictions of multiple Decision Trees in such a way that the performance of the ensemble is better than the individual trees that make the ensemble. Random Forest is another type of tree-based ensembling method that is similar to GBM but not as effective. GBM learns from mistakes as it goes about building the ensemble one tree at a time. Random Forest doesn't. I don't want to talk more about these methods here since there is tons of material on these algorithms out there.

We will use XGBoost implementation of GBM. XGBoost is a fast and scalable GBM library. It supports distributed processing in the cloud. It is the best in business today and is a popular choice among Kagglers. Like other algorithms it has parameters that must to tuned. Maximum depth of the trees and number of trees are two of the important parameters. However, we will not search the parameter space for this model in the interest of keeping the post from getting too big. I will pick parameter values based on my experience.

In [ ]:
dtrain = xgb.DMatrix(X_train_sparse, label=y)
dtest = xgb.DMatrix(X_test_sparse)

param = {'max_depth': 10, 'eta': 0.02, 'silent': 1, 'objective': 'binary:logistic', 
         'nthread': -1, 'eval_metric': 'auc', 'subsample': 0.7, 'colsample_bytree': 0.7, 
         'min_child_weight': 0, 'booster': "gblinear"}

watchlist  = [(dtrain,'train')]
num_round = 300
model = xgb.train(param, dtrain, num_round, watchlist, early_stopping_rounds = 10)

ypred = model.predict(dtest)

output = pd.DataFrame({ 'activity_id' : whole_dataset[whole_dataset.is_train == False]['activity_id'], 'outcome': ypred })
output.to_csv('xgboost_preds.csv', index = False)

This model scored 0.97944, a 0.083 improvement over the LogisticRegression model.

Neural Network using Keras + Tensorflow

Like XGBoost, a neural network can learn complex non-linear relationships between the predictors and the labels. It is a highly versatile method that has been used in a wide range of domains.

A type of neural networks called Convolutional Neural Networks (CNNs) are used in visual recognition applications, like image classification/labeling, video analysis etc. I plan to write a post on CNN based on my experience of implementing excercise problems of Stanford University's Course, called CS231n: Convolutional Neural Networks for Visual Recognition. Watch out for the post.

Recurrent Neural Network (RNN) is a second type of neural network. It has been very successfully used in Natural Language Processing (NLP) applications like Sentiment Analysis, Information Retrieval and Language Translation. CNNs, however, have also been used in NLP to some extent. So, there's no clear boundary of applicability of these methods to application domains.

CNNs and RNNs may not be very suitable for our case study. A fully connected single layer network network should be good enough for our application. We will use Keras python library with TensorFlow backend for building and training the network.

In [ ]:
from keras.models import Sequential, load_model
from keras.layers import Dense
from keras.callbacks import ModelCheckpoint

# Create Neural Net
model = Sequential()
model.add(Dense(100, input_dim = X_train_sparse.shape[1], init = 'uniform', 
                activation = 'relu'))
model.add(Dense(1, init='uniform', activation='sigmoid'))

# Compile the model
model.compile(loss = 'binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Train the model
checkpointer = ModelCheckpoint('./kerasNNModel_100interm.hdf5')
chunk_size = 10000
num_chunks = input_samples.shape[0] / chunk_size
cum_time = 0
for chunk in range(num_chunks):
    start_idx, end_idx = chunk * chunk_size, (chunk + 1) * chunk_size
    tic = time.time()
    model.fit(input_samples[start_idx: end_idx].toarray(), labels[start_idx: end_idx],
              nb_epoch = 10, batch_size = 100, callbacks = [checkpointer])
    toc = time.time()
    cum_time += (toc - tic)
    print('Trained chunk {0}/{1} in {2}mins : {3}secs'.format(chunk, num_chunks, 
                                                              int(cum_time / 60), 
                                                              int(cum_time % 60)))

# Train the model on the remaining samples
start_idx, end_idx = num_chunks * chunk_size, input_samples.shape[0]
tic = time.time()
model.fit(input_samples[start_idx: end_idx].toarray(), labels[start_idx: end_idx],
          nb_epoch=10, batch_size=100, callbacks=[checkpointer])
toc = time.time()
cum_time += (toc - tic)
print('Trained chunk {0}/{1} in {2}mins : {3}secs'.format(num_chunks, num_chunks, 
                                                          int(cum_time / 60), 
                                                          int(cum_time % 60)))

# Make predictions
y_preds = []
num_chunks = test_data.shape[0] / chunk_size
for chunk in range(num_chunks):
    start_idx, end_idx = chunk * chunk_size, (chunk + 1) * chunk_size
    y_preds[start_idx: end_idx] = model.predict(test_data[start_idx: end_idx].toarray())
start_idx, end_idx = num_chunks * chunk_size, test_data.shape[0]
y_preds[start_idx: end_idx] = model.predict(test_data[start_idx: end_idx].toarray())

# Create submission file
output = pd.DataFrame({'activity_id': whole_dataset[whole_dataset.is_train == False]['activity_id'], 'outcome': y_preds})
output.to_csv('neural_net.csv', index=False)

The code is comparitively bigger than the XGBoost model since Keras doesn't handle sparse data very well. Data has to be converted to arrays before it can be fed to the network. Converting 2.7 million records of 14388 features to an array would require tens of gigabytes of memory for holding it, which is not possible on a single machine. So, I had to train the network in small chunks (10000 records at a time). This additional logic increased the code size.

Anyway, this model resulted in a score of 0.95153 - better than LogisticRegression score but worse than XGBoost score. XGBoost seems to be a better fit for our problem.

Simplifying the network by removing the intermediate layer and connecting the input nodes directly to the output node surprisingly resulted in a better score - 0.9631. So, bigger is not always better. It is always good to try out multiple architectures and choose the best among them, rather than arbitrarily choosing the network structure.

We have now observed first hand that non-linear models generally perform much better than linear models. We can try many more things to push the score further. I have listed some below.

  • Tune model parameters, like the two I mentioned earlier and others like, learning_rate, subsample, colsample_bytree, colsample_bylevel, gamma, reg_gamma, reg_alpha
  • Add more features (Feature Engineering)
  • Try ensembling with a mix of different types of models
  • Look for a leak, i.e. some direct relationship between one or more predictors and label that reveals the outcome.

The last one isn't Data Science and doesn't add much value in real world. It does help in winning machine learning competitions, though. Infact, the competition on which this case study was based had a leak that resulted in a greater than 0.98 auc score by data analytics only, sans modeling.

Since the focus of this post is the Data Science Process and not Machine Learning I will stop here and move to the next stage.

We have come long way beginning from getting data from the Kaggle Server, preprocessing it, exploring it and learning from it. It is now time to report / communicate out findings. There is not much we as participants in a competition can do with respect to visualization, reporting and communication. The most we can do is tweet out scores and brag about it if we manage to get a good score, assuming our only intention was to get the best score in the competition.

However, while everyone strives to perform well in the competition, the intentions vary. The intentions generally can be generally mapped to the 'level' of the players. Grandmasters, Masters and Experts mostly play to win. Contributors generally have a dual purpose of learning and winning. While Novices are mostly learning and feeling the water. There are, of course, exceptions. I have seen quite a few Novices beating the Grandmasters and winning competitions.

I am a 'Competitions Contributor' in Kaggle. While I generally end up in the top 20%, my intention atleast as of the writing of this post is to learn Data Science. And share my learnings, like I am doing in this post.

For the case study, summarizing the scores of the machine learning models that we built would qualify as a reporting / communication task. The following table lists all the models we tried, along with their configurations and their submission scores.

Model Score Configuration
LogisticRegression 0.91174 Regularization Value - 1.0
GBM - XGBoost 0.97944 'max_depth': 10, 'eta': 0.02, 'silent': 1, 'objective': 'binary:logistic', 'nthread': -1, 'eval_metric': 'auc', 'subsample': 0.7, 'colsample_bytree': 0.7, 'min_child_weight': 0, 'booster': "gblinear"
Neural Networks - 1 layer 0.96310 Input Layer (143388 nodes) -> Output layer (1 node, activation - 'Sigmoid'), loss = 'binary_crossentropy', optimizer = 'adam'
Neural Networks - 2 layer 0.95153 Input Layer (143388 nodes) -> Intermediate Layer (100 nodes, activation = 'ReLU') -> Output layer (1 node, activation = 'Sigmoid'), loss = 'binary_crossentropy', optimizer = 'adam'

That brings us to the end of this post. I hope you found it interesting and/or helpful. If you liked the post you my "Kaggle Series" posts may also interest you. In the Kaggle Series I write about my solutions to the Kaggle Competitions that I participated in. I follow the same systematic staged approach that you saw in this post. Following are the links to these posts.

Thank you for reading this post. I hope to see you in my other posts. :)

- Vinay Huddar