Predicting future demand based on historical sales data is one of those things that businesses have been doing for time immemorial. It is a critical part of the Inventory and Supply Chain Management.
A typical solution would look at the sales data, for example the number of items sold, for a certain period in the past and use statistics to project the demand in the near future. Reasonably good solutions would include seasonal variations and the more sophisticated ones might even take into account the demographics data and the market trend.
With the advent of Massive Computing and Big Data technologies it has now become possible to use Machine Learning to go beyond linear predictions and make much more accurate predictions than it has ever been possible.
Groupo Bimbo, a Mexican multinational bakery product manufacturing company, hosted a competition in Kaggle to develop a model to accurately forecast inventory demand based on historical sales data. More exactly, the requirement was to forecast the demand of a product for a given week, at a particular store.
The intention of this post is to explore the data shared in this competition, find some interesting insights and describe the Machine Learning solution that got me into the top 15% of Leader Board.
I will be following the systematic five staged Data Science Process that I describe in the following two posts:
You may find the case-study based approach interesting. The case-study is based on an actual business problem and it was a Kaggle competition.
Let's start with the first stage of the process - Data Collection.
Data Collection¶
Following is an extract from the competition websites 'Data' page.
"The dataset you are given consists of 9 weeks of sales transactions in Mexico. Every week, there are delivery trucks that deliver products to the vendors. Each transaction consists of sales and returns. Returns are the products that are unsold and expired. The demand for a product in a certain week is defined as the sales this week subtracted by the return next week."
There are five CSV files in the dataset: train.csv, test.csv, client_tabla.csv, producto_tabla.csv and town_state.csv. No, 'Producto' and 'tabla' are not misspelled words. File names and field names within csvs have a Mexican flavor, much like their Burritos.
train.csv and test.csv have all the transaction data. The other three CSVs are supporting files, that contain names of clientes, products and towns/states. Following are the fields in them.
train.csv | test.csv | town_state.csv | client_tabla.csv | producto_tabla.csv |
---|---|---|---|---|
Week number | Week number | Agency ID | Client ID | Product ID |
Sales Depot ID | Sales Depot ID | Town Name | Client Name | Product Name |
Sales Channel ID | Sales Channel ID | State Name | ||
Route ID | Route ID | |||
Client ID | Client ID | |||
Product ID | Product ID | |||
Units sold this week | ||||
Money from sales this week | ||||
Unsold units returned next week | ||||
Money lost from unsold product returns | ||||
Adjusted Demand |
You will find the same information in the competitions 'Data' page. It's not as well tabulated, though.
Let's have a peek at each of these files.
train.csv and test.csv¶
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.preprocessing import LabelEncoder
import seaborn as sns
# Get data into memory.
# Since the data is in gigabytes it helps to use data types of just the right size
data_types = {'Semana': np.int8, 'Agencia_ID': np.int16, 'Canal_ID': np.int8,
'Ruta_SAK': np.int16, 'Cliente_ID': np.int64,'Producto_ID': np.uint16,
'Venta_uni_hoy': np.int16, 'Venta_hoy': np.float64, 'Dev_uni_proxima': np.int32,
'Dev_proxima': np.float64, 'Demanda_uni_equil': np.int16}
train_data = pd.read_csv('../input/train.csv', dtype = data_types)
# Convert Spanish names to their English equivalents.
field_names = {'Semana': 'Week', 'Agencia_ID': 'Depot_id', 'Canal_ID': 'Channel_id',
'Ruta_SAK': 'Route_id', 'Cliente_ID': 'Client_id', 'Producto_ID': 'Product_id',
'Venta_uni_hoy': 'Units_sold', 'Venta_hoy': 'Earnings',
'Dev_uni_proxima': 'Units_returned', 'Dev_proxima': 'Money_returned',
'Demanda_uni_equil': 'Adj_demand'}
train_data.rename(columns = field_names, inplace = True)
data_types = {'Semana': np.int8, 'Agencia_ID': np.int16, 'Canal_ID': np.int8,
'Ruta_SAK': np.int16, 'Cliente_ID': np.int64,'Producto_ID': np.uint16}
test_data = pd.read_csv('../input/test.csv', dtype = data_types)
field_names = {'Semana': 'Week', 'Agencia_ID': 'Depot_id', 'Canal_ID': 'Channel_id',
'Ruta_SAK': 'Route_id', 'Cliente_ID': 'Client_id', 'Producto_ID': 'Product_id'}
test_data.rename(columns = field_names, inplace = True)
print('Shape of train_data: {0}'.format(train_data.shape))
print('Shape of test_data: {0}\n'.format(test_data.shape))
print('Sample rows of train_data')
train_data.head(3)
print('Sample rows of test_data')
test_data.head(3)
Observations:
- 74 million transaction records in training set
- 7 million transaction records in test set
- 11 fields in training set; 7 in test set; sales and demand fields missing in test set
- Adj_demand has to be predicted for the test set records
Let's know more about the columns by printing thier summary statistics.
train_data[['Week', 'Depot_id', 'Channel_id', 'Route_id', 'Client_id', 'Product_id']].describe()
Observations:
- data is for 7 weeks: Week 3 to week 9; quantiles indicate that records are uniformly distributed across all the weeks
- not much can be said about Agencies (Depot_id) other than that thier Ids range from 1110 to 25759
- there are 11 Sales channels (Channel_id) with majority of the transactions happening on Channel 1
- not much can be said about Clients (Client_id) other than that thier Ids range from 26 to some 10 digit number
- not much can be said about Products (Product_id) other than that thier Ids range from 41 to some 49997
Let's look at the Demand columns.
train_data[train_data.columns[6:]].describe()
Observations:
- the maximum units sold in a transaction is 7200; minimum is zero and average is 7.3 units per transaction
- the maximum earnings from a single transaction is 647,360 pesos; minimum is zero and average is 68.5 pesos per transaction
- very few transactions have returns, with the maximum number of returns per transaction is 250,000
- Adjusted Demand very closely follows Units_sold, since returns are close to zero
Let's now have a look at test_data.
test_data.describe()
Observations:
- transactions for weeks 10 and 11 are provided in the test set
- all the Id fields aren't very different from train_data
The number of transactions in the test set only about one third the number of transactions in the train set. All weeks in the train set have between 10 and 11 million records per week, whereas weeks in test set have about 3.5 million records. It is not clear why the weekly test set size is much smaller than the training set. The answer lies in the way the sampling was done, which is not disclosed to the participants. The samples could have been collected from the first two days of the week or a small percent of the transactions from each day was selected.
Let's take a peek at the data from the other tables: producto_tabla, cliente_tabla, town_state
producto_tabla.csv¶
products = pd.read_csv('../input/producto_tabla.csv')
print('Shape of products_tabla: {0}'.format(products.shape))
print('Max product id: {0}'.format(products.Producto_ID.max()))
products.head()
Observations:
- there are 2592 products and their ids are not sequential
- product name (NombreProducto) captures much more than just the name; there is weight, piece count and a couple of other data
cliente_tabla.csv¶
clients = pd.read_csv('../input/cliente_tabla.csv')
print('Shape of cliente_tabla: {0}'.format(clients.shape))
print('Max cliente id: {0}'.format(clients.Cliente_ID.max()))
clients.head()
Observations:
- there are 935362 clients and their ids are not sequential
town_state.csv¶
town_state = pd.read_csv('../input/town_state.csv')
print('Shape of town_state: {0}'.format(town_state.shape))
print('Number of towns: {0}'.format(town_state.Town.nunique()))
print('Number of states: {0}'.format(town_state.State.nunique()))
print('Number of agencies: {0}'.format(town_state.Agencia_ID.nunique()))
print('Max agency id: {0}'.format(town_state.Agencia_ID.max()))
town_state.head()
Observations:
- this table maps Agencies to towns and states
- there are 790 agencies serving 260 towns in 33 states
- agency ods are not sequential
Data Preprocessing¶
There is not much preprocessing to be done since the data is pretty clean and well structured. However, there is one thing that be done with regards to the id fields. We oberved that none of the ids are sequential. Their range is much larger than thier counts.
Label-Encoding will make them sequential and will also be helpful in one-hot-encoding them later.
Encode Product ID
products['Product_ID_enc'] = LabelEncoder().fit_transform(products.Producto_ID)
train_data = train_data.merge(products[['Producto_ID', 'Product_ID_enc']], how='left',
left_on = 'Product_id', right_on = 'Producto_ID')
train_data.drop(['Producto_ID', 'Product_id'], axis = 1, inplace = True)
test_data = test_data.merge(products[['Producto_ID', 'Product_ID_enc']], how='left',
left_on = 'Product_id', right_on = 'Producto_ID')
test_data.drop(['Producto_ID', 'Product_id'], axis = 1, inplace = True)
Encode Depot ID
town_state['Agency_id_enc'] = LabelEncoder().fit_transform(town_state.Agencia_ID)
train_data = train_data.merge(town_state[['Agencia_ID', 'Agency_id_enc']], how='left',
left_on = 'Depot_id', right_on = 'Agencia_ID')
train_data.drop(['Agencia_ID', 'Depot_id'], axis = 1, inplace = True)
test_data = test_data.merge(town_state[['Agencia_ID', 'Agency_id_enc']], how='left',
left_on = 'Depot_id', right_on = 'Agencia_ID')
test_data.drop(['Agencia_ID', 'Depot_id'], axis = 1, inplace = True)
Encode Client ID
clients['Client_id_enc'] = LabelEncoder().fit_transform(clients.Cliente_ID)
train_data = train_data.merge(clients[['Cliente_ID', 'Client_id_enc']], how='left',
left_on = 'Client_id', right_on = 'Cliente_ID')
train_data.drop(['Cliente_ID', 'Client_id'], axis = 1, inplace = True)
test_data = test_data.merge(clients[['Cliente_ID', 'Client_id_enc']], how='left',
left_on = 'Client_id', right_on = 'Cliente_ID')
test_data.drop(['Cliente_ID', 'Client_id'], axis = 1, inplace = True)
Encode Week, Route id and Channel_id
For encoding the remaining categoricals - Week, Route id and Channel_id, we will stack train and test data, fit a LabelEncoder and then use it to tranform train and test data separately.
enc_cols = ['Week', 'Route_id', 'Channel_id']
trns_data = train_data[enc_cols].append(test_data[enc_cols])
# Encoding Week is an overkill since we know that values are 3 to 11.
# Subtracting 3 will do the job.
train_data['Week'] = train_data['Week'] - 3
test_data['Week'] = test_data['Week'] - 3
# Encode Route_id
week_enc = LabelEncoder().fit(trns_data.Route_id)
train_data['Route_id'] = week_enc.transform(train_data.Route_id)
test_data['Route_id'] = week_enc.transform(test_data.Route_id)
# Encode Channel_id. We can't apply the trick that we applied on Week because
# Channel 3 is missing.
channel_enc = LabelEncoder().fit(trns_data.Channel_id)
train_data['Channel_id'] = week_enc.transform(train_data.Channel_id)
test_data['Channel_id'] = week_enc.transform(test_data.Channel_id)
# We will rename the categoricals that we encoded earlier to remove '_enc'
train_data.rename(columns = {'Product_ID_enc': 'Product_id',
'Agency_id_enc': 'Agency_id',
'Client_id_enc': 'Client_id'}, inplace = True)
test_data.rename(columns = {'Product_ID_enc': 'Product_id',
'Agency_id_enc': 'Agency_id',
'Client_id_enc': 'Client_id'}, inplace = True)
train_data.head()
test_data.head()
We are now ready for EDA.
Exploratory Data Analysis¶
Let's take a look at the distribution of Week, Product, Channel, Route, Agency and Client.
# Stack train and test data to understand distributions on the full dataset
X_data = train_data.append(test_data)
fig, axes = plt.subplots(2, 3)
fig.set_size_inches(12, 8)
cols = ['Week', 'Product_id', 'Channel_id', 'Route_id', 'Agency_id', 'Client_id']
for i, col in enumerate(cols):
sns.distplot(X_data[col], kde = False, ax = axes[i/3, i%3]);
plt.tight_layout()
Observations:
- Week: train-set weeks (Weeks 0 to 6) are evenly distributed and have between 10 to 11 million samples each. test-set weeks (Weeks 7 and 8) are also evenly distributed, but have fewer samples (about 3.5 million each)
- Products: The huge towers suggest that there are some products that are hugely popular compared to others
- Channels: Channel-0 is predominantly used. Other channels are rarely used, if at all.
- Routes: There are a bunch of high usage routes, some others that see moderate usage and the rest very low usage
- Agencies: Equating the number of transactions from an Agency to its size, there are a handful of big Agencies, quite a few medium-sized agencies and a bunch of small agencies.
- Clients: The graph suggests a linear distribution of clients without any outliers. Id values reflect the size of the client with some exceptions towards the end - clients with smaller id values have more transactions compared to clients with bigger id values.
Before we look at sales and demand distributions, let's plot the weekly distributions of Products, Routes, Agencies and Clients, and find out if there are any patterns.
Distribution of Products across weeks¶
weekly_prod_distrib = X_data.groupby(['Week', 'Product_id']).Product_id.agg(['size'])['size'].unstack(level=0)
top_prods = set()
for i in range(9):
top_prods = top_prods.union(set(weekly_prod_distrib[i].sort_values(ascending = False)[:50].index.values))
top_prods = np.sort(list(top_prods)[:50])
fig, axes = plt.subplots(2, 1)
fig.set_size_inches(12, 8)
weekly_prod_distrib[range(7)].ix[top_prods].plot.bar(ax = axes[0])
axes[0].set_title('Distributions of top 50 selling products in train-set Weeks (0 to 6)')
axes[0].set(xlabel = 'Product Ids', ylabel = 'Sales Count')
weekly_prod_distrib[[7,8]].ix[top_prods].plot.bar(ax = axes[1])
axes[1].set_title('Distributions of top 50 selling products in test-set Weeks (7 and 8)')
axes[1].set(xlabel = 'Product Ids', ylabel = 'Sales Count');
#plt.tight_layout();
The distributions of top 50 products in training and test sets look pretty similar, although the vertical scales are different because of the difference in the weekly sample sizes. This is a strong indicator that product demands in weeks 7 and 8 must be strongly correlated to product sales in weeks for which we have sales data (weeks 0 to 6).
There are differences in some products like 114, 357, 600, 1931, which other features might help answer.
Let's look at how clients have been behaving over the weeks. Comparing the distribution of transactions w.r.t clients week over week will tell us if clients have been consistent with thier orders or whether there's a lot of volatility.
Distribution of clients across weeks¶
fig, axes = plt.subplots(3, 3)
fig.set_size_inches(10, 8)
for i in range(9):
sns.distplot(X_data[X_data.Week == i].Client_id, kde = False, ax = axes[i/3, i%3]);
axes[i/3, i%3].set_title('Distributions of clients in Week {0}'.format(i))
fig.tight_layout()
The distribution of Clients is remarkably similar across weeks. What this means is that sales to clients didn't change much over the 9 week window. In other words, the number of transactions with clients remained consistent week over week.
Analysing weekly client transactions w.r.t products ordered¶
Let's go a level deeper and get a sense of the nature of orders from clients. We will pick some clients and try to understand the number of different products they ordered, the frequency of their orders and whether the orders are repeat orders (i.e. orders with the same products that were ordered earlier) or new orders (i.e. orders with products that were different from earlier orders).
rand_clients = np.random.choice(X_data.Client_id, 9, replace = False)
rand_client_trx = X_data[X_data.Client_id.isin(rand_clients)].copy()
rand_client_trx['Product_id'] = LabelEncoder().fit_transform(rand_client_trx.Product_id)
rand_client_trx['Client_id'] = LabelEncoder().fit_transform(rand_client_trx.Client_id)
fig, axes = plt.subplots(3,3)
fig.set_size_inches(10, 10)
for i in range(9):
recs = rand_client_trx[rand_client_trx.Client_id == i]
sns.stripplot(x="Week", y="Product_id", data=recs, ax=axes[i/3,i%3])
axes[i/3,i%3].set_title('Client - {0}'.format(i))
plt.tight_layout()
Wow, that's a colorful plot. Each plot shows products bought by clients each week. Product ids are label-encoded for products bought by all the 9 clients. The plot doesn't show how many of each products were bought. Client ids have also been label-encoded.
Generally, we see that clients have been buying some products repeatedly, week over week, and also try out different products in different weeks. We see that the dotted lines within each plot show a pattern that repeats across weeks. For example, Client-0's plot have dots in the 0 to 40 product id range and some in the 90 to 140 range. What this means is that this customer's orders are repeat orders for products in the range 0 to 40 and new orders for a mix of products in the 90 to 140 range. He doesn't order any products in the ranges 40 to 90 and 140 and above.
Each client exhibits a different ordering pattern, both w.r.t the mix of patterns and w.r.t the number of products, but there is a pattern nonetheless.
Client-1 and Client-2 show us the two extremes. Client-1 orders very few products and doesn't order anything in weeks 1, 2 and 3. Whereas, Client-2 orders a large number of products, most of them repeat orders, and has a consistent pattern across all the nine weeks.
Some clients seem to try out different products, while have a consistent set of repeat orders. Clients 3 and 4 show this behavior. They have repeat orders in range 0 to 30 and try out different products in different weeks in the rest of the range.
What we can infer from this is that, while the orders do have some patterns, they are not very consistent. Hence, demand predictions may not be very accurate.
Gauging the volatility in the dataset¶
It is mentioned in the data page of the competition that clients/products/agencies/routes slightly differ from week to week. Before we start exploring weekly demand distributions let's find out how many clients/products/agencies/routes repeat week over week, how many get dropped off and how many get added.
weekly_clients = X_data[['Week', 'Client_id']].groupby('Week').apply(lambda x: x.Client_id.unique())
weekly_products = X_data[['Week', 'Product_id']].groupby('Week').apply(lambda x: x.Product_id.unique())
weekly_agencies = X_data[['Week', 'Agency_id']].groupby('Week').apply(lambda x: x.Agency_id.unique())
weekly_routes = X_data[['Week', 'Route_id']].groupby('Week').apply(lambda x: x.Route_id.unique())
client_overlap = [[0, len(weekly_clients[0]), 0]]
product_overlap = [[0, len(weekly_products[0]), 0]]
agency_overlap = [[0, len(weekly_agencies[0]), 0]]
route_overlap = [[0, len(weekly_routes[0]), 0]]
for i in range(1, 9):
# Clients
repeat = np.intersect1d(weekly_clients[i], weekly_clients[i-1], assume_unique=True)
new = len(set(weekly_clients[i]) - set(repeat))
missing = len(set(weekly_clients[i-1]) - set(repeat))
client_overlap.append([missing, len(repeat), new])
# Products
repeat = np.intersect1d(weekly_products[i], weekly_products[i-1], assume_unique=True)
new = len(set(weekly_products[i]) - set(repeat))
missing = len(set(weekly_products[i-1]) - set(repeat))
product_overlap.append([missing, len(repeat), new])
# Agencies
repeat = np.intersect1d(weekly_agencies[i], weekly_agencies[i-1], assume_unique=True)
new = len(set(weekly_agencies[i]) - set(repeat))
missing = len(set(weekly_agencies[i-1]) - set(repeat))
agency_overlap.append([missing, len(repeat), new])
# Routes
repeat = np.intersect1d(weekly_routes[i], weekly_routes[i-1], assume_unique=True)
new = len(set(weekly_routes[i]) - set(repeat))
missing = len(set(weekly_routes[i-1]) - set(repeat))
route_overlap.append([missing, len(repeat), new])
fig, axes = plt.subplots(2, 2)
fig.set_size_inches(10, 8)
inds = ['Week'+str(i) for i in range(9)]
cols = ['Missing', 'Repeat', 'New']
pd.DataFrame(client_overlap, columns = cols, index = inds).plot.bar(ax = axes[0, 0],
stacked = True)
axes[0, 0].set_title('Week over week Clientele overlap')
pd.DataFrame(product_overlap, columns = cols, index = inds).plot.bar(ax = axes[0, 1],
stacked = True)
axes[0, 1].set_title('Week over week Products overlap')
pd.DataFrame(agency_overlap, columns = cols, index = inds).plot.bar(ax = axes[1, 0],
stacked = True)
axes[1, 0].set_title('Week over week Agencies overlap')
pd.DataFrame(route_overlap, columns = cols, index = inds).plot.bar(ax = axes[1, 1],
stacked = True)
axes[1, 1].set_title('Week over week Routes overlap');
plt.tight_layout();
del weekly_clients, weekly_products, weekly_agencies, weekly_routes,
del client_overlap, product_overlap, agency_overlap, route_overlap
import gc
gc.collect()
Clients, Products and Routes show a 5% to 10% variation week over week. Agencies are pretty consistent.
Weekly demand distributions¶
There are five sales related fields: Units_sold, Earnings, Units_returned, Money_returned and Adj_demand. Earnings and Money_returned are money equivalents of Units_sold and Units_returned, thereby making them redundant data.
We saw earlier that 'returns' transactions are very few (about 3% of the dataset). Hence, their impact on the overall units sold (Adj_demand = Units_sold - Units_returned) is insignificant. And as a consequence, Adj_demand is nearly identical to Units_sold.
We will only look at the distribution of Adj_demand. Visualizing products sales data of 74 million transactions, 1800 products and 0.88 million customers may get complicated. To keep it simple I will randomly pick 6 products, 5 clients within each of these products and show how their demands vary week over week. This is similar to what we did with weekly order analysis.
np.random.seed(127)
rand_products = np.random.choice(train_data.Product_id, 6, replace = True)
rand_product_trx = train_data[train_data.Product_id.isin(rand_products)].copy()
rand_product_trx['Product_id'] = LabelEncoder().fit_transform(rand_product_trx.Product_id)
fig, axes = plt.subplots(3,2)
fig.set_size_inches(10, 12)
for i in range(6):
recs = rand_product_trx[rand_product_trx.Product_id == i].copy()
clnts = np.random.choice(recs.Client_id, 5, replace = True)
recs = recs[recs.Client_id.isin(clnts)].copy()
recs['Client_id'] = LabelEncoder().fit_transform(recs.Client_id)
t = recs.groupby(['Week', 'Client_id']).apply(lambda x: x.Adj_demand.values[0]).unstack(level = 1).fillna(value = 0)
t.plot(marker='o', linestyle=':', ax=axes[i/2,i%2])
axes[i/2,i%2].set_title('Product - {0}'.format(i))
plt.tight_layout()
Each plot tells a different story. Most clients don't show any consistent patterns. But there are some that do have some patterns, for example Client-0 in Product-0, Client-0 in Product-2, Client-1 in Product-3 and Client-2 in Product-4.
It will be interesting to see how succesful the machine learning models will be in dealing with these variations.
I will end this post here and start developing Machine Learning models for this dataset in this follow-up post: Grupo Bimbo Demand Prediction - Part2. Please continue reading.
Comments
comments powered by Disqus