In the previous post I described the problem and the dataset, and performed a comprehensive EDA. In this post we will develop Machine Learning models that will learn from the seven weeks of sales data and predict product demands of the following two weeks.

We will begin by reading-in the preprocessed data from the previous post.

In [53]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import gc
import time

import xgboost as xgb

def get_preprocessed_data():
    # Read train data
    # Since the data is in gigabytes it helps to use data types of just the right size
    data_types = {'Week': np.int8, 'Agency_id': np.int16, 'Channel_id': np.int8, 
           'Route_id': np.int16, 'Client_id': np.int64,'Product_id': np.uint16, 
           'Units_sold': np.int16, 'Earnings': np.float64, 'Units_returned': np.int32,
           'Money_returned': np.float64, 'Adj_demand': np.int16}
    train_data = pd.read_csv('train_enc.csv', dtype=data_types)

    # Read test data
    data_types = {'Week': np.int8, 'Agency_id': np.int16, 'Channel_id': np.int8, 
           'Route_id': np.int16, 'Client_id': np.int64,'Product_id': np.uint16}
    test_data = pd.read_csv('test_enc.csv', dtype=data_types)
    
    return train_data, test_data

X_train, X_test = get_preprocessed_data()
X_train.head()
Out[53]:
Week Channel_id Route_id Units_sold Earnings Units_returned Money_returned Adj_demand Product_id Agency_id Client_id
0 0 6 1360 3 25.14 0 0 3 103 0 4755
1 0 6 1360 4 33.52 0 0 4 104 0 4755
2 0 6 1360 4 39.32 0 0 4 109 0 4755
3 0 6 1360 4 33.52 0 0 4 110 0 4755
4 0 6 1360 3 22.92 0 0 3 111 0 4755

Model Building

Our goal is to predict product demands based on historical sales data. This is a Regression probem. The predictions have to be made on a per customer and per product level, i.e. what would be the demand from a particular customer for a particular particular product given his purchase behavior of the past seven weeks. His purchase behavior reflects in the frequency of purchase - whether the product was bought in all the weeks or in every alternate weeks or only once a month, and in the quantum of purchase - how many items were ordered in each transaction and whether there was any pattern.

We will look at two solutions to this problem:

  1. Statistical Solution
  2. Machine Learning Solution

Statistical Solution

Towards the end of the EDA in the last post we saw that there was no definite pattern in the purchase behavior of customers. However, looking at the frequency and quantum of orders we could make rough predictions of upcoming orders. For example, if a customer has been purchasing a product in all the weeks and quantity of purchase has always been in the 20 to 30 range we could say that he would most probably buy that product in the coming weeks too and the order size would be in the range 20 to 30.

A simple solution

A simple way of formulating this behavior is to take the mean of all the past purchases of a product by a customer and use it as the predicted demand. This would capture the general purchasing behavior of a customer w.r.t a certain product.

A drawback

A drawback of this solution is that it doesn't make use of timing data. For example, if a customer has been steadily increasing the size of purchase over the weeks it makes sense for the next order to be higher than the previous one. The mean-based model, however, would predict a smaller order (compared to the last order) since the mean would be smaller than the last order.

A challenge

We also saw in the EDA that there is some churn in the customer mix week-over-week. We saw that, week over week, about 90% of the customers are repeat customers, a small percentage of the customers drop out and new ones get added. The solution has to address this volatility.

A way of addressing the challenge

An intuitive way of addressing this volatility is to look for purchase patterns of customers along the route of a customer if the customer doesn't have a history of purchasing a product, who's demand we are predicting. If no other customers along the route have purchased that product before, which is possible if the product is newly introduced in that route, the next best thing we can do is to look for purchase patterns of customers served by the same Agency that serves the customer we are predicting the demand for.

We will implement this solution and check out how close we can get to the ground truth. The score of this model will serve as a baseline score for the Machine Learning model.

In [60]:
import time
prod_mean_dem = X_train.groupby('Product_id')['Adj_demand'].mean().to_dict()
def predict_mean_dem(prod_id):
    X_train_prod = X_train[X_train.Product_id == prod_id]
    X_test_prod = X_test[X_test.Product_id == prod_id]

    client_prod_mean_dem = X_train_prod.groupby('Client_id')['Adj_demand'].mean().to_dict()
    route_prod_mean_dem = X_train_prod.groupby('Route_id')['Adj_demand'].mean().to_dict()
    agency_prod_mean_dem = X_train_prod.groupby('Agency_id')['Adj_demand'].mean().to_dict()

    for row_id in range(X_test_prod.shape[0]):
        row = X_test_prod.iloc[row_id]
        dem = 1  # default demand if there is no sales history for the product
        if client_prod_mean_dem.has_key(row.Client_id):
            dem = client_prod_mean_dem[row.Client_id]
        elif route_prod_mean_dem.has_key(row.Route_id):
            dem = route_prod_mean_dem[row.Route_id]
        elif agency_prod_mean_dem.has_key(row.Agency_id):
            dem = agency_prod_mean_dem[row.Agency_id]
        elif prod_mean_dem.has_key(prod_id):
            dem = prod_mean_dem[prod_id]
        pred_dems[row.id] = dem
        
pred_dems = np.zeros(X_test.shape[0])
tic = time.time()
for i, prod in enumerate(X_test.Product_id.unique()):
    predict_mean_dem(prod)
    if (i != 0) & (i%100 == 0):
        toc = time.time()
        print('Processed Products {0}:{1} in {2}m:{3}s'.format(i-100, i, (toc-tic)/60, (toc-tic)%60))
        tic = time.time()

pd.DataFrame({'id': X_test.id, 'Demanda_uni_equil': pred_dems}).to_csv('Subm_mean_soln_blog.csv', index=False)
Processed Products 0:100 in 11.9472861846m:56.8371710777s
Processed Products 100:200 in 3.63568364779m:38.1410188675s
Processed Products 200:300 in 1.63313018481m:37.9878110886s
Processed Products 300:400 in 0.576172216733m:34.570333004s
Processed Products 400:500 in 0.448251700401m:26.8951020241s
Processed Products 500:600 in 0.260913948218m:15.6548368931s
Processed Products 600:700 in 0.201130084197m:12.0678050518s
Processed Products 700:800 in 0.185407332579m:11.1244399548s
Processed Products 800:900 in 0.170010081927m:10.2006049156s
Processed Products 900:1000 in 0.154214433829m:9.25286602974s
Processed Products 1000:1100 in 0.144817614555m:8.68905687332s
Processed Products 1100:1200 in 0.140769934654m:8.44619607925s
Processed Products 1200:1300 in 0.136869800091m:8.21218800545s
Processed Products 1300:1400 in 0.134838501612m:8.09031009674s
Processed Products 1400:1500 in 0.135490584373m:8.12943506241s

It's a very simple solution. We lookup the mean sales of a product hierarchically in the following order: Client -> Route -> Agency -> Product, and take the mean of Adj_demand in the first level that we find a sales history in. For example, if the client had bought the product before we would take the mean of Adj_demand of those transactions.

If the client has not bought the product before, we would go looking for transactions of other clients along the same route for that product and take the mean of Adj_demand of those transactions. The same logic is applied at the next two levels (Agency and Product). If the product has no sales history at all, we simply predict a 1.

How well does the algorithm score?

This simplest of solutions scored 0.5467 and would have gotten us into top 60% of the Leader Board on Kaggle.:) So, this is our baseline score. Any model that we train should score better than this one.

Machine Learning Solution

Problem with the previous solution

There was no "learning" in the previous solution. And, the solution did not make use of timing data. The predictions for product sale volumes [1, 2, 3, 4, 5, 6, 7] and [7, 6, 5, 4, 3, 2, 1] would be the same, which is 4 - the mean. A better prediction would be 8 in the first case and 0 in the second.

An obvious solution

A simple way of addressing this problem is to fit a line for each unique and valid Client + Product combination in the dataset. A valid Client + Product combination is one that has atleast one transaction containing both of them.

Why the obvious solution may not work

We know that there are about 880,000 clients and 1800 products. For simplicity let's assume no transactions with new Client + Product pairs are added after the first week; i.e. all valid Client + Product combinations show up in the first week itself. I made this assumption to drive home the point that there would be more than 11 million lines to fit (first week has more than 11 million transaction records) if we take this approach. And, that would be a very costly affair, both w.r.t time and computation resrouces.

An alternative solution that works

A better alternative is to embed time-based product sales information into the feature set and feed it to a Machine Learning model to learn from it. So, how do we come up with features that capture time-based sales information? We can directly use weekly product sales (i.e. Adj_demand) as features. For example, second week transactions can have first week sales data as a feature. Third week transactions can have first two weeks sales data as two features. And, so on.

An implementation challenge in the alternative solution

While this makes perfect sense implementing it as described is not possible due to the limitations of the algorithms and tools. The problem arises due to the fact that transactions in different weeks will have different number of time-based features. Week X will have X number of features. Week-0 will have zero time-based features because there is no sales history before Week 0 and Week-6 will have 6 features (sales data of weeks 0 to 5).

A feature is a column in a Dataframe. And rows in a Dataframe are all of the same size. So, the design constraints of the Dataframe prevents different transactions from having different number of features.

A way to overcome the implementation challenge

One solution is to train the model only on the latter half of the weeks (Weeks 3, 4, 5 and 6), using Adj_demand of weeks prior to them as Lag features. I will let the following table speak for itself.

Lag-1 Lag-2 Lag-3
Week 3 transactions Week-2 Adj_demand Week-1 Adj_demand Week-0 Adj_demand
Week 4 transactions Week-3 Adj_demand Week-2 Adj_demand Week-1 Adj_demand
Week 5 transactions Week-4 Adj_demand Week-3 Adj_demand Week-2 Adj_demand
Week 6 transactions Week-5 Adj_demand Week-4 Adj_demand Week-3 Adj_demand
Week 7 transactions Week-6 Adj_demand Week-5 Adj_demand Week-4 Adj_demand
Week 8 transactions Week-7 Predictions Week-6 Adj_demand Week-5 Adj_demand

What we are doing is very simple Feature Engineeering. We will do more of this later. Let's add lag features now.

In [70]:
# Stack train and test data and remove the older dataframes
X_data = pd.concat([X_train, X_test], axis=0, copy=True)

del X_train
del X_test
gc.collect()

X_data.shape
Out[70]:
(81179715, 12)

I will be defining functions for holding most of the logic of this solution since we will be using the same logic twice - first for trying-out Three-Lag feature solution and then for implementing Five-Lag feature solution.

In [128]:
def add_lag_features(lag_depth):
    global X_data
    for lagnum in range(1, lag_depth+1):
        tic = time.time()

        lag_data = X_data[['Week', 'Client_id', 'Product_id', 'Adj_demand']]
        lag_data.Week = lag_data.Week + lagnum
        lag_data = lag_data.groupby(['Week', 'Client_id', 'Product_id']).mean().reset_index()

        lag_feature_name = 'Lag' + str(lagnum)
        lag_data.rename(columns={'Adj_demand': lag_feature_name}, inplace=True)
        X_data = pd.merge(X_data, lag_data, how='left', sort=True, copy=False)

        del lag_data
        gc.collect()
        print("Added %s feature in %d secs" % (lag_feature_name, time.time()-tic))
    X_data = X_data[X_data.columns - ['Units_sold', 'Units_returned', 'Earnings', 'Money_returned']]
    X_data = X_data[X_data.Week >= lag_depth]
In [72]:
def add_lag_features(lag_depth):
    for lagnum in range(1, lag_depth+1):
        tic = time.time()

        lag_data = X_data[['Week', 'Client_id', 'Product_id', 'Adj_demand']]
        lag_data.Week = lag_data.Week + lagnum
        lag_data = lag_data.groupby(['Week', 'Client_id', 'Product_id']).mean().reset_index()

        lag_feature_name = 'Lag' + str(lagnum)
        lag_data.rename(columns={'Adj_demand': lag_feature_name}, inplace=True)
        X_data = pd.merge(X_data, lag_data, how='left', sort=True, copy=False)

        del lag_data
        gc.collect()
        print("Added %s feature in %d secs" % (lag_feature_name, time.time()-tic))
    X_data = X_data[X_data.columns - ['Units_sold', 'Units_returned', 'Earnings', 'Money_returned']]
    X_data = X_data[X_data.Week >= lag_depth]
    
add_lag_features(3)
Added Lag1 feature in 183 secs
Added Lag2 feature in 161 secs
Added Lag3 feature in 177 secs
/Users/vinayhuddar/anaconda/lib/python2.7/site-packages/pandas/core/generic.py:2387: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value
In [75]:
print('X_data.shape: {0}\n'.format(X_data.shape))
X_data.head(3)
X_data.shape: (48389518, 11)

/Users/vinayhuddar/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:1: FutureWarning: using '-' to provide set differences with Indexes is deprecated, use .difference()
  if __name__ == '__main__':
Out[75]:
Adj_demand Agency_id Channel_id Client_id Lag1 Lag2 Lag3 Product_id Route_id Week id
32790197 15 392 1 26 15 16 20 614 2279 3 NaN
32790198 7 392 1 26 NaN NaN NaN 919 2279 3 NaN
32790199 10 498 1 26 10 30 15 970 1656 3 NaN

Train XGBoost on Lag features

We will fit an XGBRegressor model on Week-3 to Week-6 samples with Lag features. The parameters of the model were chosen after a brief search of the parameter space. In the interest of brevity of the post I will not be describing the parameter tuning process.

In [146]:
# Build and train an XGBRegressor
def train_XGBRegr():
    xgbRegr = xgb.XGBRegressor(max_depth=10, n_estimators=100, objective='reg:linear',
                               nthread=4, subsample=0.85, seed=1440)

    X_train = X_data[X_data.Week < 7]
    X_test = X_data[X_data.Week >= 7]

    X = X_train[X_train.columns.difference(['Week', 'Adj_demand', 'id'])]
    y = X_train.Adj_demand

    tic = time.time()
    xgbRegr.fit(X, y)
    toc = time.time()
    print('Model trained in %dm:%ds' % ((toc-tic)/60, (toc-tic)%60))
    
    return xgbRegr
In [94]:
train_XGBRegr()
Model trained in 44m:48s

We will now do the following.

  • predict Week-7 demands using the model that we trained above
  • add Week-7 predicted demands as Lag1 feature of Week 8, which are NAs otherwise
  • predict Week-8 demands
  • submit the stacked Week-7 and Week-8 predictions for scoring.
In [159]:
def predict_demands(xgbRegr): 
    X_train = X_data[X_data.Week < 7]
    X_test = X_data[X_data.Week >= 7]
    X_test['id'] = np.arange(X_test.shape[0])

    # Predict Week-7 demands
    X_week7 = X_test[X_test.Week == 7]
    preds_wk7 = xgbRegr.predict(X_week7[X_week7.columns - ['Week', 'Adj_demand', 'id']]) #['Lag1', 'Lag2', 'Lag3']]) #
    preds_wk7[preds_wk7 < 0] = 0
    X_week7.Adj_demand = preds_wk7

    # Process Week 8 data. Fill Lag1 data from Week 10 predictions and compute predictions
    X_week8 = X_test[X_test.Week == 8]
    # Compute Lag1 from Week 10 predictions
    wk8_mean_dems = X_week8.groupby(['Client_id', 'Product_id'])['Adj_demand'].mean().reset_index()
    wk8_mean_dems.rename(columns = {'Adj_demand': 'Lag1'}, inplace = True)
    X_week8.drop(['Lag1'], axis = 1, inplace = True)
    X_week8 = pd.merge(X_week8, wk8_mean_dems, how = 'left', sort = True, copy = False)

    # Predict Week-8 demands
    preds_wk8 = xgbRegr.predict(X_week8[X_week8.columns - ['Week', 'Adj_demand', 'id']]) #['Lag1', 'Lag2', 'Lag3']]) #
    preds_wk8[preds_wk8 < 0] = 0
    X_week8.Adj_demand = preds_wk8

    # Prepare the submission file
    preds = pd.merge(X_week7[['id', 'Adj_demand']], X_week8[['id', 'Adj_demand']], 
                           how='outer').sort_values(by='id')
    subm = pd.DataFrame({'id':range(6999251), 'Demanda_uni_equil': preds.Adj_demand.tolist()})
    
    # Clean up 
    del X_week7, preds_wk7, X_week8, preds_wk8, wk8_mean_dems, preds
    gc.collect()
    
    return subm
In [97]:
subm_df = predict_demands()
subm_df.to_csv('Lag3_with_ids_blog.csv', index=False)
/Users/vinayhuddar/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:3: FutureWarning: using &apos;-&apos; to provide set differences with Indexes is deprecated, use .difference()
  app.launch_new_instance()
/Users/vinayhuddar/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:12: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
/Users/vinayhuddar/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:19: FutureWarning: using &apos;-&apos; to provide set differences with Indexes is deprecated, use .difference()

The model scored: 0.57296. This is worse than the Lookup-based solution. This is testimony to our observations from EDA about the large variances in product sales data. When 'mean' performs better than a model looking for patterns it is a strong indication of high variance data.

A similar model with 5-Lag features should perform better than the 3-Lag one since it has more historical data to train on. We will have fewer samples to train on since only Weeks 5 and 6 will have all the 5 lag features. Even so we will have about 20 million samples, which is quite a handful.

Before we try-out the 5-lag feature model let's find out if we can engineer some features from the product table.

More Feature Engineering

Description of products has a lot more information than just their names. Following is an example.

                                   41,Bimbollos Ext sAjonjoli 6p 480g BIM 41

41 Bimbollos Ext sAjonjoli 6p 480g BIM 41
id Name Pieces Weight Brand id

The low hanging fruits among these are pieces and weights since they can be easily converted to numerics, are continuous variables and may turn out to be good predictors. Of the top five selling products three of them have 4 or 6 pieces and are 105g or 125g. The fourth one has 2 pieces and weighs 55g. This may be an indication that products with fewer pieces and lower weights sell more than others. So, let's add pieces and weights as two more features.

In [140]:
def add_product_features():
    global X_data
    
    # Get product table into memory
    products = pd.read_csv('../input/producto_tabla.csv')

    # Extract Weight and Pieces data from product table
    w = products.NombreProducto.str.extract('(\d+)(Kg|g)')
    products['Weight'] = w[0].astype('float') * w[1].map({'Kg':1000, 'g':1})
    products['Pieces'] =  products.NombreProducto.str.extract('(\d+)p ').astype('float')

    # Add Weight and Pieces features to the feature set
    X_data = pd.merge(X_data, products[['Producto_ID', 'Weight', 'Pieces']], how='left', 
                      left_on = 'Product_id', right_on = 'Producto_ID')
    X_data.drop('Producto_ID', axis = 1, inplace = True)
    X_data.Pieces.fillna(value=1, inplace=True)
    X_data.Weight.fillna(value=0, inplace=True)

    # Add Weight per Piece as an additional feature
    X_data['WtPerPc'] = X_data.Weight / X_data.Pieces
    
    # Rearrange columns
    X_data = X_data[['Week', 'Product_id', 'Client_id', 'Agency_id', 'Route_id', 
                     'Channel_id', 'Lag1', 'Lag2', 'Lag3', 'Lag4', 'Lag5', 
                     'Weight', 'Pieces', 'WtPerPc', 'Adj_demand']]

The Five Lag Solution

Lag-1 Lag-2 Lag-3 Lag-4 Lag-5
Week 5 transactions Week-4 demand Week-3 demand Week-2 demand Week-1 demand Week-0 demand
Week 6 transactions Week-5 demand Week-4 demand Week-3 demand Week-2 demand Week-1 demand
Week 7 transactions Week-6 demand Week-5 demand Week-4 demand Week-3 demand Week-2 demand
Week 8 transactions Week-7 Preds Week-6 demand Week-5 demand Week-4 demand Week-3 demand

We will now recreate the feature matrix from initial preprocessed data by creating the five lag features and product features (pieces and weight).

In [ ]:
# Let's make space for the new data by deleting some variables.
del X_data , client_prod_dem, predict_mean_dem, prod_mean_dem
gc.collect()

X_train, X_test = get_preprocessed_data()
X_data = pd.concat([X_train, X_test], axis=0, copy=True)

# Add lag features
lag_depth = 5
add_lag_features(lag_depth)

# Add product features
add_product_features()

print('X_data.shape: {0}\n'.format(X_data.shape))
X_data.head(3)
Added Lag1 feature in 195 secs
Added Lag2 feature in 179 secs
Added Lag3 feature in 193 secs
Added Lag4 feature in 204 secs
In [149]:
xgbRegr = train_XGBRegr()
Model trained in 62m:42s
In [ ]:
subm_df = predict_demands(xgbRegr)
subm_df.to_csv('Lag5_with_prod_feats_blog.csv', index=False)

This model scored: 0.50499. A much better score compared to three lag model.

My final submission to Kaggle was the average of results from the three solutions: Statistical solution, XGBRegressor solution with three Lag features and XGBRegressor solution with five Lag features. And, it scored 0.46362 on the Public Leader Board and 0.48378 on the Private Leader Board. Averaging results from multiple well-constructed models generally improves the score a notch.

Summary

That was a lot of code for one post. Excuse me for that. I didn't want to create a third post.

In this two-post series we understood the business problem behind the competition and went through the Data Science Process to solve it.

After collecting and preprocessing the data we did a comprehensive Exploratory Data Analysis and made some interesting observations about the data.

Having understood the data well we implemented a very simple solution that gave us a good baseline score. We then did some Feature Engineering and implemented a second solution using XGBoost that was based on historical weekly product sales figures and product attributes.

Due to large variance in product sales data we saw that the first Machine Learning solution couldn't perform a lot better than the simple mean-based solution. But adding more lag features and product features resulted in a better scoring model.


I hope you enjoyed reading the post as much as I did writing it. I welcome you to read my other posts in the 'Kaggle Series', where I share my solutions to other competitions that I participated in. Following are the links:

The second link is to a post where I describe my perspective of the Data Science Process. I am sure you will find it interesting. I have used my solution to another Kaggle competition (RedHat Business Value Prediction) as a case study to describe the process.

Thank you for visiting.

- Vinay Huddar