In the previous post I described the problem and the dataset, and performed a comprehensive EDA. In this post we will develop Machine Learning models that will learn from the seven weeks of sales data and predict product demands of the following two weeks.
We will begin by reading-in the preprocessed data from the previous post.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import gc
import time
import xgboost as xgb
def get_preprocessed_data():
# Read train data
# Since the data is in gigabytes it helps to use data types of just the right size
data_types = {'Week': np.int8, 'Agency_id': np.int16, 'Channel_id': np.int8,
'Route_id': np.int16, 'Client_id': np.int64,'Product_id': np.uint16,
'Units_sold': np.int16, 'Earnings': np.float64, 'Units_returned': np.int32,
'Money_returned': np.float64, 'Adj_demand': np.int16}
train_data = pd.read_csv('train_enc.csv', dtype=data_types)
# Read test data
data_types = {'Week': np.int8, 'Agency_id': np.int16, 'Channel_id': np.int8,
'Route_id': np.int16, 'Client_id': np.int64,'Product_id': np.uint16}
test_data = pd.read_csv('test_enc.csv', dtype=data_types)
return train_data, test_data
X_train, X_test = get_preprocessed_data()
X_train.head()
Model Building¶
Our goal is to predict product demands based on historical sales data. This is a Regression probem. The predictions have to be made on a per customer and per product level, i.e. what would be the demand from a particular customer for a particular particular product given his purchase behavior of the past seven weeks. His purchase behavior reflects in the frequency of purchase - whether the product was bought in all the weeks or in every alternate weeks or only once a month, and in the quantum of purchase - how many items were ordered in each transaction and whether there was any pattern.
We will look at two solutions to this problem:
- Statistical Solution
- Machine Learning Solution
Statistical Solution¶
Towards the end of the EDA in the last post we saw that there was no definite pattern in the purchase behavior of customers. However, looking at the frequency and quantum of orders we could make rough predictions of upcoming orders. For example, if a customer has been purchasing a product in all the weeks and quantity of purchase has always been in the 20 to 30 range we could say that he would most probably buy that product in the coming weeks too and the order size would be in the range 20 to 30.
A simple solution¶
A simple way of formulating this behavior is to take the mean of all the past purchases of a product by a customer and use it as the predicted demand. This would capture the general purchasing behavior of a customer w.r.t a certain product.
A drawback¶
A drawback of this solution is that it doesn't make use of timing data. For example, if a customer has been steadily increasing the size of purchase over the weeks it makes sense for the next order to be higher than the previous one. The mean-based model, however, would predict a smaller order (compared to the last order) since the mean would be smaller than the last order.
A challenge¶
We also saw in the EDA that there is some churn in the customer mix week-over-week. We saw that, week over week, about 90% of the customers are repeat customers, a small percentage of the customers drop out and new ones get added. The solution has to address this volatility.
A way of addressing the challenge¶
An intuitive way of addressing this volatility is to look for purchase patterns of customers along the route of a customer if the customer doesn't have a history of purchasing a product, who's demand we are predicting. If no other customers along the route have purchased that product before, which is possible if the product is newly introduced in that route, the next best thing we can do is to look for purchase patterns of customers served by the same Agency that serves the customer we are predicting the demand for.
We will implement this solution and check out how close we can get to the ground truth. The score of this model will serve as a baseline score for the Machine Learning model.
import time
prod_mean_dem = X_train.groupby('Product_id')['Adj_demand'].mean().to_dict()
def predict_mean_dem(prod_id):
X_train_prod = X_train[X_train.Product_id == prod_id]
X_test_prod = X_test[X_test.Product_id == prod_id]
client_prod_mean_dem = X_train_prod.groupby('Client_id')['Adj_demand'].mean().to_dict()
route_prod_mean_dem = X_train_prod.groupby('Route_id')['Adj_demand'].mean().to_dict()
agency_prod_mean_dem = X_train_prod.groupby('Agency_id')['Adj_demand'].mean().to_dict()
for row_id in range(X_test_prod.shape[0]):
row = X_test_prod.iloc[row_id]
dem = 1 # default demand if there is no sales history for the product
if client_prod_mean_dem.has_key(row.Client_id):
dem = client_prod_mean_dem[row.Client_id]
elif route_prod_mean_dem.has_key(row.Route_id):
dem = route_prod_mean_dem[row.Route_id]
elif agency_prod_mean_dem.has_key(row.Agency_id):
dem = agency_prod_mean_dem[row.Agency_id]
elif prod_mean_dem.has_key(prod_id):
dem = prod_mean_dem[prod_id]
pred_dems[row.id] = dem
pred_dems = np.zeros(X_test.shape[0])
tic = time.time()
for i, prod in enumerate(X_test.Product_id.unique()):
predict_mean_dem(prod)
if (i != 0) & (i%100 == 0):
toc = time.time()
print('Processed Products {0}:{1} in {2}m:{3}s'.format(i-100, i, (toc-tic)/60, (toc-tic)%60))
tic = time.time()
pd.DataFrame({'id': X_test.id, 'Demanda_uni_equil': pred_dems}).to_csv('Subm_mean_soln_blog.csv', index=False)
It's a very simple solution. We lookup the mean sales of a product hierarchically in the following order: Client -> Route -> Agency -> Product, and take the mean of Adj_demand in the first level that we find a sales history in. For example, if the client had bought the product before we would take the mean of Adj_demand of those transactions.
If the client has not bought the product before, we would go looking for transactions of other clients along the same route for that product and take the mean of Adj_demand of those transactions. The same logic is applied at the next two levels (Agency and Product). If the product has no sales history at all, we simply predict a 1.
How well does the algorithm score?¶
This simplest of solutions scored 0.5467 and would have gotten us into top 60% of the Leader Board on Kaggle.:) So, this is our baseline score. Any model that we train should score better than this one.
Machine Learning Solution¶
Problem with the previous solution¶
There was no "learning" in the previous solution. And, the solution did not make use of timing data. The predictions for product sale volumes [1, 2, 3, 4, 5, 6, 7] and [7, 6, 5, 4, 3, 2, 1] would be the same, which is 4 - the mean. A better prediction would be 8 in the first case and 0 in the second.
An obvious solution¶
A simple way of addressing this problem is to fit a line for each unique and valid Client + Product combination in the dataset. A valid Client + Product combination is one that has atleast one transaction containing both of them.
Why the obvious solution may not work¶
We know that there are about 880,000 clients and 1800 products. For simplicity let's assume no transactions with new Client + Product pairs are added after the first week; i.e. all valid Client + Product combinations show up in the first week itself. I made this assumption to drive home the point that there would be more than 11 million lines to fit (first week has more than 11 million transaction records) if we take this approach. And, that would be a very costly affair, both w.r.t time and computation resrouces.
An alternative solution that works¶
A better alternative is to embed time-based product sales information into the feature set and feed it to a Machine Learning model to learn from it. So, how do we come up with features that capture time-based sales information? We can directly use weekly product sales (i.e. Adj_demand) as features. For example, second week transactions can have first week sales data as a feature. Third week transactions can have first two weeks sales data as two features. And, so on.
An implementation challenge in the alternative solution¶
While this makes perfect sense implementing it as described is not possible due to the limitations of the algorithms and tools. The problem arises due to the fact that transactions in different weeks will have different number of time-based features. Week X will have X number of features. Week-0 will have zero time-based features because there is no sales history before Week 0 and Week-6 will have 6 features (sales data of weeks 0 to 5).
A feature is a column in a Dataframe. And rows in a Dataframe are all of the same size. So, the design constraints of the Dataframe prevents different transactions from having different number of features.
A way to overcome the implementation challenge¶
One solution is to train the model only on the latter half of the weeks (Weeks 3, 4, 5 and 6), using Adj_demand of weeks prior to them as Lag features. I will let the following table speak for itself.
Lag-1 | Lag-2 | Lag-3 | |
---|---|---|---|
Week 3 transactions | Week-2 Adj_demand | Week-1 Adj_demand | Week-0 Adj_demand |
Week 4 transactions | Week-3 Adj_demand | Week-2 Adj_demand | Week-1 Adj_demand |
Week 5 transactions | Week-4 Adj_demand | Week-3 Adj_demand | Week-2 Adj_demand |
Week 6 transactions | Week-5 Adj_demand | Week-4 Adj_demand | Week-3 Adj_demand |
Week 7 transactions | Week-6 Adj_demand | Week-5 Adj_demand | Week-4 Adj_demand |
Week 8 transactions | Week-7 Predictions | Week-6 Adj_demand | Week-5 Adj_demand |
What we are doing is very simple Feature Engineeering. We will do more of this later. Let's add lag features now.
# Stack train and test data and remove the older dataframes
X_data = pd.concat([X_train, X_test], axis=0, copy=True)
del X_train
del X_test
gc.collect()
X_data.shape
I will be defining functions for holding most of the logic of this solution since we will be using the same logic twice - first for trying-out Three-Lag feature solution and then for implementing Five-Lag feature solution.
def add_lag_features(lag_depth):
global X_data
for lagnum in range(1, lag_depth+1):
tic = time.time()
lag_data = X_data[['Week', 'Client_id', 'Product_id', 'Adj_demand']]
lag_data.Week = lag_data.Week + lagnum
lag_data = lag_data.groupby(['Week', 'Client_id', 'Product_id']).mean().reset_index()
lag_feature_name = 'Lag' + str(lagnum)
lag_data.rename(columns={'Adj_demand': lag_feature_name}, inplace=True)
X_data = pd.merge(X_data, lag_data, how='left', sort=True, copy=False)
del lag_data
gc.collect()
print("Added %s feature in %d secs" % (lag_feature_name, time.time()-tic))
X_data = X_data[X_data.columns - ['Units_sold', 'Units_returned', 'Earnings', 'Money_returned']]
X_data = X_data[X_data.Week >= lag_depth]
def add_lag_features(lag_depth):
for lagnum in range(1, lag_depth+1):
tic = time.time()
lag_data = X_data[['Week', 'Client_id', 'Product_id', 'Adj_demand']]
lag_data.Week = lag_data.Week + lagnum
lag_data = lag_data.groupby(['Week', 'Client_id', 'Product_id']).mean().reset_index()
lag_feature_name = 'Lag' + str(lagnum)
lag_data.rename(columns={'Adj_demand': lag_feature_name}, inplace=True)
X_data = pd.merge(X_data, lag_data, how='left', sort=True, copy=False)
del lag_data
gc.collect()
print("Added %s feature in %d secs" % (lag_feature_name, time.time()-tic))
X_data = X_data[X_data.columns - ['Units_sold', 'Units_returned', 'Earnings', 'Money_returned']]
X_data = X_data[X_data.Week >= lag_depth]
add_lag_features(3)
print('X_data.shape: {0}\n'.format(X_data.shape))
X_data.head(3)
Train XGBoost on Lag features¶
We will fit an XGBRegressor model on Week-3 to Week-6 samples with Lag features. The parameters of the model were chosen after a brief search of the parameter space. In the interest of brevity of the post I will not be describing the parameter tuning process.
# Build and train an XGBRegressor
def train_XGBRegr():
xgbRegr = xgb.XGBRegressor(max_depth=10, n_estimators=100, objective='reg:linear',
nthread=4, subsample=0.85, seed=1440)
X_train = X_data[X_data.Week < 7]
X_test = X_data[X_data.Week >= 7]
X = X_train[X_train.columns.difference(['Week', 'Adj_demand', 'id'])]
y = X_train.Adj_demand
tic = time.time()
xgbRegr.fit(X, y)
toc = time.time()
print('Model trained in %dm:%ds' % ((toc-tic)/60, (toc-tic)%60))
return xgbRegr
train_XGBRegr()
We will now do the following.
- predict Week-7 demands using the model that we trained above
- add Week-7 predicted demands as Lag1 feature of Week 8, which are NAs otherwise
- predict Week-8 demands
- submit the stacked Week-7 and Week-8 predictions for scoring.
def predict_demands(xgbRegr):
X_train = X_data[X_data.Week < 7]
X_test = X_data[X_data.Week >= 7]
X_test['id'] = np.arange(X_test.shape[0])
# Predict Week-7 demands
X_week7 = X_test[X_test.Week == 7]
preds_wk7 = xgbRegr.predict(X_week7[X_week7.columns - ['Week', 'Adj_demand', 'id']]) #['Lag1', 'Lag2', 'Lag3']]) #
preds_wk7[preds_wk7 < 0] = 0
X_week7.Adj_demand = preds_wk7
# Process Week 8 data. Fill Lag1 data from Week 10 predictions and compute predictions
X_week8 = X_test[X_test.Week == 8]
# Compute Lag1 from Week 10 predictions
wk8_mean_dems = X_week8.groupby(['Client_id', 'Product_id'])['Adj_demand'].mean().reset_index()
wk8_mean_dems.rename(columns = {'Adj_demand': 'Lag1'}, inplace = True)
X_week8.drop(['Lag1'], axis = 1, inplace = True)
X_week8 = pd.merge(X_week8, wk8_mean_dems, how = 'left', sort = True, copy = False)
# Predict Week-8 demands
preds_wk8 = xgbRegr.predict(X_week8[X_week8.columns - ['Week', 'Adj_demand', 'id']]) #['Lag1', 'Lag2', 'Lag3']]) #
preds_wk8[preds_wk8 < 0] = 0
X_week8.Adj_demand = preds_wk8
# Prepare the submission file
preds = pd.merge(X_week7[['id', 'Adj_demand']], X_week8[['id', 'Adj_demand']],
how='outer').sort_values(by='id')
subm = pd.DataFrame({'id':range(6999251), 'Demanda_uni_equil': preds.Adj_demand.tolist()})
# Clean up
del X_week7, preds_wk7, X_week8, preds_wk8, wk8_mean_dems, preds
gc.collect()
return subm
subm_df = predict_demands()
subm_df.to_csv('Lag3_with_ids_blog.csv', index=False)
The model scored: 0.57296. This is worse than the Lookup-based solution. This is testimony to our observations from EDA about the large variances in product sales data. When 'mean' performs better than a model looking for patterns it is a strong indication of high variance data.
A similar model with 5-Lag features should perform better than the 3-Lag one since it has more historical data to train on. We will have fewer samples to train on since only Weeks 5 and 6 will have all the 5 lag features. Even so we will have about 20 million samples, which is quite a handful.
Before we try-out the 5-lag feature model let's find out if we can engineer some features from the product table.
More Feature Engineering¶
Description of products has a lot more information than just their names. Following is an example.
41,Bimbollos Ext sAjonjoli 6p 480g BIM 41
41 | Bimbollos Ext sAjonjoli | 6p | 480g | BIM | 41 |
---|---|---|---|---|---|
id | Name | Pieces | Weight | Brand | id |
The low hanging fruits among these are pieces and weights since they can be easily converted to numerics, are continuous variables and may turn out to be good predictors. Of the top five selling products three of them have 4 or 6 pieces and are 105g or 125g. The fourth one has 2 pieces and weighs 55g. This may be an indication that products with fewer pieces and lower weights sell more than others. So, let's add pieces and weights as two more features.
def add_product_features():
global X_data
# Get product table into memory
products = pd.read_csv('../input/producto_tabla.csv')
# Extract Weight and Pieces data from product table
w = products.NombreProducto.str.extract('(\d+)(Kg|g)')
products['Weight'] = w[0].astype('float') * w[1].map({'Kg':1000, 'g':1})
products['Pieces'] = products.NombreProducto.str.extract('(\d+)p ').astype('float')
# Add Weight and Pieces features to the feature set
X_data = pd.merge(X_data, products[['Producto_ID', 'Weight', 'Pieces']], how='left',
left_on = 'Product_id', right_on = 'Producto_ID')
X_data.drop('Producto_ID', axis = 1, inplace = True)
X_data.Pieces.fillna(value=1, inplace=True)
X_data.Weight.fillna(value=0, inplace=True)
# Add Weight per Piece as an additional feature
X_data['WtPerPc'] = X_data.Weight / X_data.Pieces
# Rearrange columns
X_data = X_data[['Week', 'Product_id', 'Client_id', 'Agency_id', 'Route_id',
'Channel_id', 'Lag1', 'Lag2', 'Lag3', 'Lag4', 'Lag5',
'Weight', 'Pieces', 'WtPerPc', 'Adj_demand']]
The Five Lag Solution¶
Lag-1 | Lag-2 | Lag-3 | Lag-4 | Lag-5 | |
---|---|---|---|---|---|
Week 5 transactions | Week-4 demand | Week-3 demand | Week-2 demand | Week-1 demand | Week-0 demand |
Week 6 transactions | Week-5 demand | Week-4 demand | Week-3 demand | Week-2 demand | Week-1 demand |
Week 7 transactions | Week-6 demand | Week-5 demand | Week-4 demand | Week-3 demand | Week-2 demand |
Week 8 transactions | Week-7 Preds | Week-6 demand | Week-5 demand | Week-4 demand | Week-3 demand |
We will now recreate the feature matrix from initial preprocessed data by creating the five lag features and product features (pieces and weight).
# Let's make space for the new data by deleting some variables.
del X_data , client_prod_dem, predict_mean_dem, prod_mean_dem
gc.collect()
X_train, X_test = get_preprocessed_data()
X_data = pd.concat([X_train, X_test], axis=0, copy=True)
# Add lag features
lag_depth = 5
add_lag_features(lag_depth)
# Add product features
add_product_features()
print('X_data.shape: {0}\n'.format(X_data.shape))
X_data.head(3)
xgbRegr = train_XGBRegr()
subm_df = predict_demands(xgbRegr)
subm_df.to_csv('Lag5_with_prod_feats_blog.csv', index=False)
This model scored: 0.50499. A much better score compared to three lag model.
My final submission to Kaggle was the average of results from the three solutions: Statistical solution, XGBRegressor solution with three Lag features and XGBRegressor solution with five Lag features. And, it scored 0.46362 on the Public Leader Board and 0.48378 on the Private Leader Board. Averaging results from multiple well-constructed models generally improves the score a notch.
Summary¶
That was a lot of code for one post. Excuse me for that. I didn't want to create a third post.
In this two-post series we understood the business problem behind the competition and went through the Data Science Process to solve it.
After collecting and preprocessing the data we did a comprehensive Exploratory Data Analysis and made some interesting observations about the data.
Having understood the data well we implemented a very simple solution that gave us a good baseline score. We then did some Feature Engineering and implemented a second solution using XGBoost that was based on historical weekly product sales figures and product attributes.
Due to large variance in product sales data we saw that the first Machine Learning solution couldn't perform a lot better than the simple mean-based solution. But adding more lag features and product features resulted in a better scoring model.
I hope you enjoyed reading the post as much as I did writing it. I welcome you to read my other posts in the 'Kaggle Series', where I share my solutions to other competitions that I participated in. Following are the links:
The second link is to a post where I describe my perspective of the Data Science Process. I am sure you will find it interesting. I have used my solution to another Kaggle competition (RedHat Business Value Prediction) as a case study to describe the process.
Thank you for visiting.
Comments
comments powered by Disqus