Predicting future demand based on historical sales data is one of those things that businesses have been doing for time immemorial. It is a critical part of the Inventory and Supply Chain Management.

A typical solution would look at the sales data, for example the number of items sold, for a certain period in the past and use statistics to project the demand in the near future. Reasonably good solutions would include seasonal variations and the more sophisticated ones might even take into account the demographics data and the market trend.

With the advent of Massive Computing and Big Data technologies it has now become possible to use Machine Learning to go beyond linear predictions and make much more accurate predictions than it has ever been possible.

Groupo Bimbo, a Mexican multinational bakery product manufacturing company, hosted a competition in Kaggle to develop a model to accurately forecast inventory demand based on historical sales data. More exactly, the requirement was to forecast the demand of a product for a given week, at a particular store.

The intention of this post is to explore the data shared in this competition, find some interesting insights and describe the Machine Learning solution that got me into the top 15% of Leader Board.

I will be following the systematic five staged Data Science Process that I describe in the following two posts:

You may find the case-study based approach interesting. The case-study is based on an actual business problem and it was a Kaggle competition.

Let's start with the first stage of the process - Data Collection.

Data Collection¶

Following is an extract from the competition websites 'Data' page.

"The dataset you are given consists of 9 weeks of sales transactions in Mexico. Every week, there are delivery trucks that deliver products to the vendors. Each transaction consists of sales and returns. Returns are the products that are unsold and expired. The demand for a product in a certain week is defined as the sales this week subtracted by the return next week."

There are five CSV files in the dataset: train.csv, test.csv, client_tabla.csv, producto_tabla.csv and town_state.csv. No, 'Producto' and 'tabla' are not misspelled words. File names and field names within csvs have a Mexican flavor, much like their Burritos.

train.csv and test.csv have all the transaction data. The other three CSVs are supporting files, that contain names of clientes, products and towns/states. Following are the fields in them.

train.csv	test.csv	town_state.csv	client_tabla.csv	producto_tabla.csv
Week number	Week number	Agency ID	Client ID	Product ID
Sales Depot ID	Sales Depot ID	Town Name	Client Name	Product Name
Sales Channel ID	Sales Channel ID	State Name
Route ID	Route ID
Client ID	Client ID
Product ID	Product ID
Units sold this week
Money from sales this week
Unsold units returned next week
Money lost from unsold product returns
Adjusted Demand

You will find the same information in the competitions 'Data' page. It's not as well tabulated, though.

Let's have a peek at each of these files.

train.csv and test.csv¶

In [3]:

# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.preprocessing import LabelEncoder

import seaborn as sns

# Get data into memory. 
# Since the data is in gigabytes it helps to use data types of just the right size
data_types = {'Semana': np.int8, 'Agencia_ID': np.int16, 'Canal_ID': np.int8, 
       'Ruta_SAK': np.int16, 'Cliente_ID': np.int64,'Producto_ID': np.uint16, 
       'Venta_uni_hoy': np.int16, 'Venta_hoy': np.float64, 'Dev_uni_proxima': np.int32,
       'Dev_proxima': np.float64, 'Demanda_uni_equil': np.int16}
train_data = pd.read_csv('../input/train.csv', dtype = data_types)
# Convert Spanish names to their English equivalents.
field_names = {'Semana': 'Week', 'Agencia_ID': 'Depot_id', 'Canal_ID': 'Channel_id', 
       'Ruta_SAK': 'Route_id', 'Cliente_ID': 'Client_id', 'Producto_ID': 'Product_id',
       'Venta_uni_hoy': 'Units_sold', 'Venta_hoy': 'Earnings', 
       'Dev_uni_proxima': 'Units_returned', 'Dev_proxima': 'Money_returned',
       'Demanda_uni_equil': 'Adj_demand'}
train_data.rename(columns = field_names, inplace = True)

data_types = {'Semana': np.int8, 'Agencia_ID': np.int16, 'Canal_ID': np.int8, 
       'Ruta_SAK': np.int16, 'Cliente_ID': np.int64,'Producto_ID': np.uint16}
test_data = pd.read_csv('../input/test.csv', dtype = data_types)
field_names = {'Semana': 'Week', 'Agencia_ID': 'Depot_id', 'Canal_ID': 'Channel_id', 
       'Ruta_SAK': 'Route_id', 'Cliente_ID': 'Client_id', 'Producto_ID': 'Product_id'}
test_data.rename(columns = field_names, inplace = True)

print('Shape of train_data: {0}'.format(train_data.shape))
print('Shape of test_data: {0}\n'.format(test_data.shape))
print('Sample rows of train_data')
train_data.head(3)

Shape of train_data: (74180464, 11)
Shape of test_data: (6999251, 7)

Sample rows of train_data

Out[3]:

	Week	Depot_id	Channel_id	Route_id	Client_id	Product_id	Units_sold	Earnings	Adj_demand
0	3	1110	7	3301	15766	1212	3	25.14	3
1	3	1110	7	3301	15766	1216	4	33.52	4
2	3	1110	7	3301	15766	1238	4	39.32	4

In [4]:

print('Sample rows of test_data')
test_data.head(3)

Sample rows of test_data

Out[4]:

	id	Week	Depot_id	Channel_id	Route_id	Client_id	Product_id
0	0	11	4037	1	2209	4639078	35305
1	1	11	2237	1	1226	4705135	1238
2	2	10	2045	1	2831	4549769	32940

Observations:

74 million transaction records in training set
7 million transaction records in test set
11 fields in training set; 7 in test set; sales and demand fields missing in test set
Adj_demand has to be predicted for the test set records

Let's know more about the columns by printing thier summary statistics.

In [5]:

train_data[['Week', 'Depot_id', 'Channel_id', 'Route_id', 'Client_id', 'Product_id']].describe()

Out[5]:

	Week	Depot_id	Channel_id	Route_id	Client_id	Product_id
count	74180464.000000	74180464.000000	74180464.000000	74180464.000000	7.418046e+07	74180464.000000
mean	5.950021	2536.508561	1.383181	2114.855291	1.802119e+06	20840.813993
std	2.013175	4075.123651	1.463266	1487.744180	2.349577e+06	18663.919031
min	3.000000	1110.000000	1.000000	1.000000	2.600000e+01	41.000000
25%	4.000000	1312.000000	1.000000	1161.000000	3.567670e+05	1242.000000
50%	6.000000	1613.000000	1.000000	1286.000000	1.193385e+06	30549.000000
75%	8.000000	2036.000000	1.000000	2802.000000	2.371091e+06	37426.000000
max	9.000000	25759.000000	11.000000	9991.000000	2.015152e+09	49997.000000

Observations:

data is for 7 weeks: Week 3 to week 9; quantiles indicate that records are uniformly distributed across all the weeks
not much can be said about Agencies (Depot_id) other than that thier Ids range from 1110 to 25759
there are 11 Sales channels (Channel_id) with majority of the transactions happening on Channel 1
not much can be said about Clients (Client_id) other than that thier Ids range from 26 to some 10 digit number
not much can be said about Products (Product_id) other than that thier Ids range from 41 to some 49997

Let's look at the Demand columns.

In [6]:

train_data[train_data.columns[6:]].describe()

Out[6]:

	Units_sold	Earnings	Units_returned	Money_returned	Adj_demand
count	74180464.000000	74180464.000000	74180464.000000	74180464.000000	74180464.000000
mean	7.310163	68.544523	0.130258	1.243248	7.224564
std	21.967337	338.979516	29.323204	39.215523	21.771193
min	0.000000	0.000000	0.000000	0.000000	0.000000
25%	2.000000	16.760000	0.000000	0.000000	2.000000
50%	3.000000	30.000000	0.000000	0.000000	3.000000
75%	7.000000	56.100000	0.000000	0.000000	6.000000
max	7200.000000	647360.000000	250000.000000	130760.000000	5000.000000

Observations:

the maximum units sold in a transaction is 7200; minimum is zero and average is 7.3 units per transaction
the maximum earnings from a single transaction is 647,360 pesos; minimum is zero and average is 68.5 pesos per transaction
very few transactions have returns, with the maximum number of returns per transaction is 250,000
Adjusted Demand very closely follows Units_sold, since returns are close to zero

Let's now have a look at test_data.

In [7]:

test_data.describe()

Out[7]:

	id	Week	Depot_id	Channel_id	Route_id	Client_id	Product_id
count	6999251.00000	6999251.000000	6999251.000000	6999251.000000	6999251.000000	6.999251e+06	6999251.000000
mean	3499625.00000	10.494462	2504.462764	1.401874	2138.014097	1.819128e+06	22163.069696
std	2020509.86883	0.499969	4010.228289	1.513404	1500.391868	2.938910e+06	18698.156308
min	0.00000	10.000000	1110.000000	1.000000	1.000000	2.600000e+01	41.000000
25%	1749812.50000	10.000000	1311.000000	1.000000	1159.000000	3.558290e+05	1242.000000
50%	3499625.00000	10.000000	1612.000000	1.000000	1305.000000	1.200109e+06	31507.000000
75%	5249437.50000	11.000000	2034.000000	1.000000	2804.000000	2.387881e+06	40930.000000
max	6999250.00000	11.000000	25759.000000	11.000000	9950.000000	2.015152e+09	49997.000000

Observations:

transactions for weeks 10 and 11 are provided in the test set
all the Id fields aren't very different from train_data

The number of transactions in the test set only about one third the number of transactions in the train set. All weeks in the train set have between 10 and 11 million records per week, whereas weeks in test set have about 3.5 million records. It is not clear why the weekly test set size is much smaller than the training set. The answer lies in the way the sampling was done, which is not disclosed to the participants. The samples could have been collected from the first two days of the week or a small percent of the transactions from each day was selected.

Let's take a peek at the data from the other tables: producto_tabla, cliente_tabla, town_state

producto_tabla.csv¶

In [8]:

products = pd.read_csv('../input/producto_tabla.csv')
print('Shape of products_tabla: {0}'.format(products.shape))
print('Max product id: {0}'.format(products.Producto_ID.max()))
products.head()

Shape of products_tabla: (2592, 2)
Max product id: 49997

Out[8]:

	Producto_ID	NombreProducto
0	0	NO IDENTIFICADO 0
1	9	Capuccino Moka 750g NES 9
2	41	Bimbollos Ext sAjonjoli 6p 480g BIM 41
3	53	Burritos Sincro 170g CU LON 53
4	72	Div Tira Mini Doradita 4p 45g TR 72

Observations:

there are 2592 products and their ids are not sequential
product name (NombreProducto) captures much more than just the name; there is weight, piece count and a couple of other data

cliente_tabla.csv¶

In [9]:

clients = pd.read_csv('../input/cliente_tabla.csv')
print('Shape of cliente_tabla: {0}'.format(clients.shape))
print('Max cliente id: {0}'.format(clients.Cliente_ID.max()))
clients.head()

Shape of cliente_tabla: (935362, 2)
Max cliente id: 2015152015

Out[9]:

	Cliente_ID	NombreCliente
0	0	SIN NOMBRE
1	1	OXXO XINANTECATL
2	2	SIN NOMBRE
3	3	EL MORENO
4	4	SDN SER DE ALIM CUERPO SA CIA DE INT

Observations:

there are 935362 clients and their ids are not sequential

town_state.csv¶

In [10]:

town_state = pd.read_csv('../input/town_state.csv')
print('Shape of town_state: {0}'.format(town_state.shape))
print('Number of towns: {0}'.format(town_state.Town.nunique()))
print('Number of states: {0}'.format(town_state.State.nunique()))
print('Number of agencies: {0}'.format(town_state.Agencia_ID.nunique()))
print('Max agency id: {0}'.format(town_state.Agencia_ID.max()))
town_state.head()

Shape of town_state: (790, 3)
Number of towns: 260
Number of states: 33
Number of agencies: 790
Max agency id: 25769

Out[10]:

	Agencia_ID	Town	State
0	1110	2008 AG. LAGO FILT	MÉXICO, D.F.
1	1111	2002 AG. AZCAPOTZALCO	MÉXICO, D.F.
2	1112	2004 AG. CUAUTITLAN	ESTADO DE MÉXICO
3	1113	2008 AG. LAGO FILT	MÉXICO, D.F.
4	1114	2029 AG.IZTAPALAPA 2	MÉXICO, D.F.

Observations:

this table maps Agencies to towns and states
there are 790 agencies serving 260 towns in 33 states
agency ods are not sequential

Data Preprocessing¶

There is not much preprocessing to be done since the data is pretty clean and well structured. However, there is one thing that be done with regards to the id fields. We oberved that none of the ids are sequential. Their range is much larger than thier counts.

Label-Encoding will make them sequential and will also be helpful in one-hot-encoding them later.

Encode Product ID

In [11]:

products['Product_ID_enc'] = LabelEncoder().fit_transform(products.Producto_ID)
train_data = train_data.merge(products[['Producto_ID', 'Product_ID_enc']], how='left',
                              left_on = 'Product_id', right_on = 'Producto_ID')
train_data.drop(['Producto_ID', 'Product_id'], axis = 1, inplace = True)

test_data = test_data.merge(products[['Producto_ID', 'Product_ID_enc']], how='left',
                              left_on = 'Product_id', right_on = 'Producto_ID')
test_data.drop(['Producto_ID', 'Product_id'], axis = 1, inplace = True)

Encode Depot ID

In [12]:

town_state['Agency_id_enc'] = LabelEncoder().fit_transform(town_state.Agencia_ID)
train_data = train_data.merge(town_state[['Agencia_ID', 'Agency_id_enc']], how='left',
                              left_on = 'Depot_id', right_on = 'Agencia_ID')
train_data.drop(['Agencia_ID', 'Depot_id'], axis = 1, inplace = True)

test_data = test_data.merge(town_state[['Agencia_ID', 'Agency_id_enc']], how='left',
                              left_on = 'Depot_id', right_on = 'Agencia_ID')
test_data.drop(['Agencia_ID', 'Depot_id'], axis = 1, inplace = True)

Encode Client ID

In [13]:

clients['Client_id_enc'] = LabelEncoder().fit_transform(clients.Cliente_ID)
train_data = train_data.merge(clients[['Cliente_ID', 'Client_id_enc']], how='left',
                              left_on = 'Client_id', right_on = 'Cliente_ID')
train_data.drop(['Cliente_ID', 'Client_id'], axis = 1, inplace = True)

test_data = test_data.merge(clients[['Cliente_ID', 'Client_id_enc']], how='left',
                              left_on = 'Client_id', right_on = 'Cliente_ID')
test_data.drop(['Cliente_ID', 'Client_id'], axis = 1, inplace = True)

Encode Week, Route id and Channel_id

For encoding the remaining categoricals - Week, Route id and Channel_id, we will stack train and test data, fit a LabelEncoder and then use it to tranform train and test data separately.

In [14]:

enc_cols = ['Week', 'Route_id', 'Channel_id']
trns_data = train_data[enc_cols].append(test_data[enc_cols])

# Encoding Week is an overkill since we know that values are 3 to 11. 
# Subtracting 3 will do the job.
train_data['Week'] = train_data['Week'] - 3
test_data['Week'] = test_data['Week'] - 3

# Encode Route_id
week_enc = LabelEncoder().fit(trns_data.Route_id)
train_data['Route_id'] = week_enc.transform(train_data.Route_id)
test_data['Route_id'] = week_enc.transform(test_data.Route_id)

# Encode Channel_id. We can't apply the trick that we applied on Week because 
# Channel 3 is missing.
channel_enc = LabelEncoder().fit(trns_data.Channel_id)
train_data['Channel_id'] = week_enc.transform(train_data.Channel_id)
test_data['Channel_id'] = week_enc.transform(test_data.Channel_id)

# We will rename the categoricals that we encoded earlier to remove '_enc'
train_data.rename(columns = {'Product_ID_enc': 'Product_id', 
                             'Agency_id_enc': 'Agency_id',
                             'Client_id_enc': 'Client_id'}, inplace = True)
test_data.rename(columns = {'Product_ID_enc': 'Product_id', 
                             'Agency_id_enc': 'Agency_id',
                             'Client_id_enc': 'Client_id'}, inplace = True)
train_data.head()

Out[14]:

	Channel_id	Route_id	Units_sold	Earnings	Adj_demand	Product_id	Client_id
0	6	1360	3	25.14	3	103	4755
1	6	1360	4	33.52	4	104	4755
2	6	1360	4	39.32	4	109	4755
3	6	1360	4	33.52	4	110	4755
4	6	1360	3	22.92	3	111	4755

In [13]:

test_data.head()

Out[13]:

	id	Week	Route_id	Product_id	Agency_id	Client_id
0	0	8	983	1319	520	826441
1	1	8	377	109	441	863370
2	2	7	1138	915	385	777487
3	3	8	1728	2058	59	871528
4	4	8	284	114	53	287300

We are now ready for EDA.

Exploratory Data Analysis¶

Let's take a look at the distribution of Week, Product, Channel, Route, Agency and Client.

In [25]:

# Stack train and test data to understand distributions on the full dataset
X_data = train_data.append(test_data)

fig, axes = plt.subplots(2, 3)
fig.set_size_inches(12, 8)

cols = ['Week', 'Product_id', 'Channel_id', 'Route_id', 'Agency_id', 'Client_id']
for i, col in enumerate(cols):
    sns.distplot(X_data[col], kde = False, ax = axes[i/3, i%3]);
plt.tight_layout()

Observations:

Week: train-set weeks (Weeks 0 to 6) are evenly distributed and have between 10 to 11 million samples each. test-set weeks (Weeks 7 and 8) are also evenly distributed, but have fewer samples (about 3.5 million each)
Products: The huge towers suggest that there are some products that are hugely popular compared to others
Channels: Channel-0 is predominantly used. Other channels are rarely used, if at all.
Routes: There are a bunch of high usage routes, some others that see moderate usage and the rest very low usage
Agencies: Equating the number of transactions from an Agency to its size, there are a handful of big Agencies, quite a few medium-sized agencies and a bunch of small agencies.
Clients: The graph suggests a linear distribution of clients without any outliers. Id values reflect the size of the client with some exceptions towards the end - clients with smaller id values have more transactions compared to clients with bigger id values.

Before we look at sales and demand distributions, let's plot the weekly distributions of Products, Routes, Agencies and Clients, and find out if there are any patterns.

Distribution of Products across weeks¶

In [22]:

weekly_prod_distrib = X_data.groupby(['Week', 'Product_id']).Product_id.agg(['size'])['size'].unstack(level=0)
top_prods = set()
for i in range(9):
    top_prods = top_prods.union(set(weekly_prod_distrib[i].sort_values(ascending = False)[:50].index.values))
top_prods = np.sort(list(top_prods)[:50])

fig, axes = plt.subplots(2, 1)
fig.set_size_inches(12, 8)

weekly_prod_distrib[range(7)].ix[top_prods].plot.bar(ax = axes[0])
axes[0].set_title('Distributions of top 50 selling products in train-set Weeks (0 to 6)')
axes[0].set(xlabel = 'Product Ids', ylabel = 'Sales Count')
weekly_prod_distrib[[7,8]].ix[top_prods].plot.bar(ax = axes[1])
axes[1].set_title('Distributions of top 50 selling products in test-set Weeks (7 and 8)')
axes[1].set(xlabel = 'Product Ids', ylabel = 'Sales Count');
#plt.tight_layout();

Out[22]:

[, ]

The distributions of top 50 products in training and test sets look pretty similar, although the vertical scales are different because of the difference in the weekly sample sizes. This is a strong indicator that product demands in weeks 7 and 8 must be strongly correlated to product sales in weeks for which we have sales data (weeks 0 to 6).

There are differences in some products like 114, 357, 600, 1931, which other features might help answer.

Let's look at how clients have been behaving over the weeks. Comparing the distribution of transactions w.r.t clients week over week will tell us if clients have been consistent with thier orders or whether there's a lot of volatility.

Distribution of clients across weeks¶

In [17]:

fig, axes = plt.subplots(3, 3)
fig.set_size_inches(10, 8)

for i in range(9):
    sns.distplot(X_data[X_data.Week == i].Client_id, kde = False, ax = axes[i/3, i%3]);
    axes[i/3, i%3].set_title('Distributions of clients in Week {0}'.format(i))
fig.tight_layout()

The distribution of Clients is remarkably similar across weeks. What this means is that sales to clients didn't change much over the 9 week window. In other words, the number of transactions with clients remained consistent week over week.

Analysing weekly client transactions w.r.t products ordered¶

Let's go a level deeper and get a sense of the nature of orders from clients. We will pick some clients and try to understand the number of different products they ordered, the frequency of their orders and whether the orders are repeat orders (i.e. orders with the same products that were ordered earlier) or new orders (i.e. orders with products that were different from earlier orders).

In [70]:

rand_clients = np.random.choice(X_data.Client_id, 9, replace = False)
rand_client_trx = X_data[X_data.Client_id.isin(rand_clients)].copy()
rand_client_trx['Product_id'] = LabelEncoder().fit_transform(rand_client_trx.Product_id)
rand_client_trx['Client_id'] = LabelEncoder().fit_transform(rand_client_trx.Client_id)

fig, axes = plt.subplots(3,3)
fig.set_size_inches(10, 10)
for i in range(9):
    recs = rand_client_trx[rand_client_trx.Client_id == i]
    sns.stripplot(x="Week", y="Product_id", data=recs, ax=axes[i/3,i%3])
    axes[i/3,i%3].set_title('Client - {0}'.format(i))
plt.tight_layout()

Wow, that's a colorful plot. Each plot shows products bought by clients each week. Product ids are label-encoded for products bought by all the 9 clients. The plot doesn't show how many of each products were bought. Client ids have also been label-encoded.

Generally, we see that clients have been buying some products repeatedly, week over week, and also try out different products in different weeks. We see that the dotted lines within each plot show a pattern that repeats across weeks. For example, Client-0's plot have dots in the 0 to 40 product id range and some in the 90 to 140 range. What this means is that this customer's orders are repeat orders for products in the range 0 to 40 and new orders for a mix of products in the 90 to 140 range. He doesn't order any products in the ranges 40 to 90 and 140 and above.

Each client exhibits a different ordering pattern, both w.r.t the mix of patterns and w.r.t the number of products, but there is a pattern nonetheless.

Client-1 and Client-2 show us the two extremes. Client-1 orders very few products and doesn't order anything in weeks 1, 2 and 3. Whereas, Client-2 orders a large number of products, most of them repeat orders, and has a consistent pattern across all the nine weeks.

Some clients seem to try out different products, while have a consistent set of repeat orders. Clients 3 and 4 show this behavior. They have repeat orders in range 0 to 30 and try out different products in different weeks in the rest of the range.

What we can infer from this is that, while the orders do have some patterns, they are not very consistent. Hence, demand predictions may not be very accurate.

Gauging the volatility in the dataset¶

It is mentioned in the data page of the competition that clients/products/agencies/routes slightly differ from week to week. Before we start exploring weekly demand distributions let's find out how many clients/products/agencies/routes repeat week over week, how many get dropped off and how many get added.

In [42]:

weekly_clients = X_data[['Week', 'Client_id']].groupby('Week').apply(lambda x: x.Client_id.unique())
weekly_products = X_data[['Week', 'Product_id']].groupby('Week').apply(lambda x: x.Product_id.unique())
weekly_agencies = X_data[['Week', 'Agency_id']].groupby('Week').apply(lambda x: x.Agency_id.unique())
weekly_routes = X_data[['Week', 'Route_id']].groupby('Week').apply(lambda x: x.Route_id.unique())
client_overlap = [[0, len(weekly_clients[0]), 0]]
product_overlap = [[0, len(weekly_products[0]), 0]]
agency_overlap = [[0, len(weekly_agencies[0]), 0]]
route_overlap = [[0, len(weekly_routes[0]), 0]]

for i in range(1, 9):
    # Clients
    repeat = np.intersect1d(weekly_clients[i], weekly_clients[i-1], assume_unique=True)
    new = len(set(weekly_clients[i]) - set(repeat))
    missing = len(set(weekly_clients[i-1]) - set(repeat))
    client_overlap.append([missing, len(repeat), new])
    # Products
    repeat = np.intersect1d(weekly_products[i], weekly_products[i-1], assume_unique=True)
    new = len(set(weekly_products[i]) - set(repeat))
    missing = len(set(weekly_products[i-1]) - set(repeat))
    product_overlap.append([missing, len(repeat), new])
    # Agencies
    repeat = np.intersect1d(weekly_agencies[i], weekly_agencies[i-1], assume_unique=True)
    new = len(set(weekly_agencies[i]) - set(repeat))
    missing = len(set(weekly_agencies[i-1]) - set(repeat))
    agency_overlap.append([missing, len(repeat), new])
    # Routes
    repeat = np.intersect1d(weekly_routes[i], weekly_routes[i-1], assume_unique=True)
    new = len(set(weekly_routes[i]) - set(repeat))
    missing = len(set(weekly_routes[i-1]) - set(repeat))
    route_overlap.append([missing, len(repeat), new])

fig, axes = plt.subplots(2, 2)
fig.set_size_inches(10, 8)

inds = ['Week'+str(i) for i in range(9)]
cols = ['Missing', 'Repeat', 'New']
pd.DataFrame(client_overlap, columns = cols, index = inds).plot.bar(ax = axes[0, 0], 
                                                                    stacked = True)
axes[0, 0].set_title('Week over week Clientele overlap')
pd.DataFrame(product_overlap, columns = cols, index = inds).plot.bar(ax = axes[0, 1], 
                                                                    stacked = True)
axes[0, 1].set_title('Week over week Products overlap')
pd.DataFrame(agency_overlap, columns = cols, index = inds).plot.bar(ax = axes[1, 0], 
                                                                    stacked = True)
axes[1, 0].set_title('Week over week Agencies overlap')
pd.DataFrame(route_overlap, columns = cols, index = inds).plot.bar(ax = axes[1, 1], 
                                                                    stacked = True)
axes[1, 1].set_title('Week over week Routes overlap');
plt.tight_layout();

del weekly_clients, weekly_products, weekly_agencies, weekly_routes, 
del client_overlap, product_overlap, agency_overlap, route_overlap
import gc
gc.collect()

Clients, Products and Routes show a 5% to 10% variation week over week. Agencies are pretty consistent.

Weekly demand distributions¶

There are five sales related fields: Units_sold, Earnings, Units_returned, Money_returned and Adj_demand. Earnings and Money_returned are money equivalents of Units_sold and Units_returned, thereby making them redundant data.

We saw earlier that 'returns' transactions are very few (about 3% of the dataset). Hence, their impact on the overall units sold (Adj_demand = Units_sold - Units_returned) is insignificant. And as a consequence, Adj_demand is nearly identical to Units_sold.

We will only look at the distribution of Adj_demand. Visualizing products sales data of 74 million transactions, 1800 products and 0.88 million customers may get complicated. To keep it simple I will randomly pick 6 products, 5 clients within each of these products and show how their demands vary week over week. This is similar to what we did with weekly order analysis.

In [139]:

np.random.seed(127)
rand_products = np.random.choice(train_data.Product_id, 6, replace = True)
rand_product_trx = train_data[train_data.Product_id.isin(rand_products)].copy()
rand_product_trx['Product_id'] = LabelEncoder().fit_transform(rand_product_trx.Product_id)

fig, axes = plt.subplots(3,2)
fig.set_size_inches(10, 12)
for i in range(6):
    recs = rand_product_trx[rand_product_trx.Product_id == i].copy()
    clnts = np.random.choice(recs.Client_id, 5, replace = True)
    recs = recs[recs.Client_id.isin(clnts)].copy()
    recs['Client_id'] = LabelEncoder().fit_transform(recs.Client_id)
    t = recs.groupby(['Week', 'Client_id']).apply(lambda x: x.Adj_demand.values[0]).unstack(level = 1).fillna(value = 0)
    t.plot(marker='o', linestyle=':', ax=axes[i/2,i%2])
    axes[i/2,i%2].set_title('Product - {0}'.format(i))
plt.tight_layout()

Each plot tells a different story. Most clients don't show any consistent patterns. But there are some that do have some patterns, for example Client-0 in Product-0, Client-0 in Product-2, Client-1 in Product-3 and Client-2 in Product-4.

It will be interesting to see how succesful the machine learning models will be in dealing with these variations.

I will end this post here and start developing Machine Learning models for this dataset in this follow-up post: Grupo Bimbo Demand Prediction - Part2. Please continue reading.

Groupo Bimbo Demand Prediction - Part1

Sun 25 September 2016 Kaggle Series Notebook / EDA / Seaborn / Kaggle / Regression

Data Collection¶

train.csv and test.csv¶

producto_tabla.csv¶

cliente_tabla.csv¶

town_state.csv¶

Data Preprocessing¶

Exploratory Data Analysis¶

Distribution of Products across weeks¶

Distribution of clients across weeks¶

Analysing weekly client transactions w.r.t products ordered¶

Gauging the volatility in the dataset¶

Weekly demand distributions¶

Comments