Predicting future demand based on historical sales data is one of those things that businesses have been doing for time immemorial. It is a critical part of the Inventory and Supply Chain Management.

A typical solution would look at the sales data, for example the number of items sold, for a certain period in the past and use statistics to project the demand in the near future. Reasonably good solutions would include seasonal variations and the more sophisticated ones might even take into account the demographics data and the market trend.

With the advent of Massive Computing and Big Data technologies it has now become possible to use Machine Learning to go beyond linear predictions and make much more accurate predictions than it has ever been possible.

Groupo Bimbo, a Mexican multinational bakery product manufacturing company, hosted a competition in Kaggle to develop a model to accurately forecast inventory demand based on historical sales data. More exactly, the requirement was to forecast the demand of a product for a given week, at a particular store.

The intention of this post is to explore the data shared in this competition, find some interesting insights and describe the Machine Learning solution that got me into the top 15% of Leader Board.

I will be following the systematic five staged Data Science Process that I describe in the following two posts:

You may find the case-study based approach interesting. The case-study is based on an actual business problem and it was a Kaggle competition.

Let's start with the first stage of the process - Data Collection.

Data Collection

Following is an extract from the competition websites 'Data' page.

"The dataset you are given consists of 9 weeks of sales transactions in Mexico. Every week, there are delivery trucks that deliver products to the vendors. Each transaction consists of sales and returns. Returns are the products that are unsold and expired. The demand for a product in a certain week is defined as the sales this week subtracted by the return next week."

There are five CSV files in the dataset: train.csv, test.csv, client_tabla.csv, producto_tabla.csv and town_state.csv. No, 'Producto' and 'tabla' are not misspelled words. File names and field names within csvs have a Mexican flavor, much like their Burritos.

train.csv and test.csv have all the transaction data. The other three CSVs are supporting files, that contain names of clientes, products and towns/states. Following are the fields in them.

train.csv test.csv town_state.csv client_tabla.csv producto_tabla.csv
Week number Week number Agency ID Client ID Product ID
Sales Depot ID Sales Depot ID Town Name Client Name Product Name
Sales Channel ID Sales Channel ID State Name
Route ID Route ID
Client ID Client ID
Product ID Product ID
Units sold this week
Money from sales this week
Unsold units returned next week
Money lost from unsold product returns
Adjusted Demand

You will find the same information in the competitions 'Data' page. It's not as well tabulated, though.

Let's have a peek at each of these files.

train.csv and test.csv

In [3]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.preprocessing import LabelEncoder

import seaborn as sns

# Get data into memory. 
# Since the data is in gigabytes it helps to use data types of just the right size
data_types = {'Semana': np.int8, 'Agencia_ID': np.int16, 'Canal_ID': np.int8, 
       'Ruta_SAK': np.int16, 'Cliente_ID': np.int64,'Producto_ID': np.uint16, 
       'Venta_uni_hoy': np.int16, 'Venta_hoy': np.float64, 'Dev_uni_proxima': np.int32,
       'Dev_proxima': np.float64, 'Demanda_uni_equil': np.int16}
train_data = pd.read_csv('../input/train.csv', dtype = data_types)
# Convert Spanish names to their English equivalents.
field_names = {'Semana': 'Week', 'Agencia_ID': 'Depot_id', 'Canal_ID': 'Channel_id', 
       'Ruta_SAK': 'Route_id', 'Cliente_ID': 'Client_id', 'Producto_ID': 'Product_id',
       'Venta_uni_hoy': 'Units_sold', 'Venta_hoy': 'Earnings', 
       'Dev_uni_proxima': 'Units_returned', 'Dev_proxima': 'Money_returned',
       'Demanda_uni_equil': 'Adj_demand'}
train_data.rename(columns = field_names, inplace = True)

data_types = {'Semana': np.int8, 'Agencia_ID': np.int16, 'Canal_ID': np.int8, 
       'Ruta_SAK': np.int16, 'Cliente_ID': np.int64,'Producto_ID': np.uint16}
test_data = pd.read_csv('../input/test.csv', dtype = data_types)
field_names = {'Semana': 'Week', 'Agencia_ID': 'Depot_id', 'Canal_ID': 'Channel_id', 
       'Ruta_SAK': 'Route_id', 'Cliente_ID': 'Client_id', 'Producto_ID': 'Product_id'}
test_data.rename(columns = field_names, inplace = True)

print('Shape of train_data: {0}'.format(train_data.shape))
print('Shape of test_data: {0}\n'.format(test_data.shape))
print('Sample rows of train_data')
train_data.head(3)
Shape of train_data: (74180464, 11)
Shape of test_data: (6999251, 7)

Sample rows of train_data
Out[3]:
Week Depot_id Channel_id Route_id Client_id Product_id Units_sold Earnings Units_returned Money_returned Adj_demand
0 3 1110 7 3301 15766 1212 3 25.14 0 0 3
1 3 1110 7 3301 15766 1216 4 33.52 0 0 4
2 3 1110 7 3301 15766 1238 4 39.32 0 0 4
In [4]:
print('Sample rows of test_data')
test_data.head(3)
Sample rows of test_data
Out[4]:
id Week Depot_id Channel_id Route_id Client_id Product_id
0 0 11 4037 1 2209 4639078 35305
1 1 11 2237 1 1226 4705135 1238
2 2 10 2045 1 2831 4549769 32940

Observations:

  • 74 million transaction records in training set
  • 7 million transaction records in test set
  • 11 fields in training set; 7 in test set; sales and demand fields missing in test set
  • Adj_demand has to be predicted for the test set records

Let's know more about the columns by printing thier summary statistics.

In [5]:
train_data[['Week', 'Depot_id', 'Channel_id', 'Route_id', 'Client_id', 'Product_id']].describe()
Out[5]:
Week Depot_id Channel_id Route_id Client_id Product_id
count 74180464.000000 74180464.000000 74180464.000000 74180464.000000 7.418046e+07 74180464.000000
mean 5.950021 2536.508561 1.383181 2114.855291 1.802119e+06 20840.813993
std 2.013175 4075.123651 1.463266 1487.744180 2.349577e+06 18663.919031
min 3.000000 1110.000000 1.000000 1.000000 2.600000e+01 41.000000
25% 4.000000 1312.000000 1.000000 1161.000000 3.567670e+05 1242.000000
50% 6.000000 1613.000000 1.000000 1286.000000 1.193385e+06 30549.000000
75% 8.000000 2036.000000 1.000000 2802.000000 2.371091e+06 37426.000000
max 9.000000 25759.000000 11.000000 9991.000000 2.015152e+09 49997.000000

Observations:

  • data is for 7 weeks: Week 3 to week 9; quantiles indicate that records are uniformly distributed across all the weeks
  • not much can be said about Agencies (Depot_id) other than that thier Ids range from 1110 to 25759
  • there are 11 Sales channels (Channel_id) with majority of the transactions happening on Channel 1
  • not much can be said about Clients (Client_id) other than that thier Ids range from 26 to some 10 digit number
  • not much can be said about Products (Product_id) other than that thier Ids range from 41 to some 49997

Let's look at the Demand columns.

In [6]:
train_data[train_data.columns[6:]].describe()
Out[6]:
Units_sold Earnings Units_returned Money_returned Adj_demand
count 74180464.000000 74180464.000000 74180464.000000 74180464.000000 74180464.000000
mean 7.310163 68.544523 0.130258 1.243248 7.224564
std 21.967337 338.979516 29.323204 39.215523 21.771193
min 0.000000 0.000000 0.000000 0.000000 0.000000
25% 2.000000 16.760000 0.000000 0.000000 2.000000
50% 3.000000 30.000000 0.000000 0.000000 3.000000
75% 7.000000 56.100000 0.000000 0.000000 6.000000
max 7200.000000 647360.000000 250000.000000 130760.000000 5000.000000

Observations:

  • the maximum units sold in a transaction is 7200; minimum is zero and average is 7.3 units per transaction
  • the maximum earnings from a single transaction is 647,360 pesos; minimum is zero and average is 68.5 pesos per transaction
  • very few transactions have returns, with the maximum number of returns per transaction is 250,000
  • Adjusted Demand very closely follows Units_sold, since returns are close to zero

Let's now have a look at test_data.

In [7]:
test_data.describe()
Out[7]:
id Week Depot_id Channel_id Route_id Client_id Product_id
count 6999251.00000 6999251.000000 6999251.000000 6999251.000000 6999251.000000 6.999251e+06 6999251.000000
mean 3499625.00000 10.494462 2504.462764 1.401874 2138.014097 1.819128e+06 22163.069696
std 2020509.86883 0.499969 4010.228289 1.513404 1500.391868 2.938910e+06 18698.156308
min 0.00000 10.000000 1110.000000 1.000000 1.000000 2.600000e+01 41.000000
25% 1749812.50000 10.000000 1311.000000 1.000000 1159.000000 3.558290e+05 1242.000000
50% 3499625.00000 10.000000 1612.000000 1.000000 1305.000000 1.200109e+06 31507.000000
75% 5249437.50000 11.000000 2034.000000 1.000000 2804.000000 2.387881e+06 40930.000000
max 6999250.00000 11.000000 25759.000000 11.000000 9950.000000 2.015152e+09 49997.000000

Observations:

  • transactions for weeks 10 and 11 are provided in the test set
  • all the Id fields aren't very different from train_data

The number of transactions in the test set only about one third the number of transactions in the train set. All weeks in the train set have between 10 and 11 million records per week, whereas weeks in test set have about 3.5 million records. It is not clear why the weekly test set size is much smaller than the training set. The answer lies in the way the sampling was done, which is not disclosed to the participants. The samples could have been collected from the first two days of the week or a small percent of the transactions from each day was selected.

Let's take a peek at the data from the other tables: producto_tabla, cliente_tabla, town_state

producto_tabla.csv

In [8]:
products = pd.read_csv('../input/producto_tabla.csv')
print('Shape of products_tabla: {0}'.format(products.shape))
print('Max product id: {0}'.format(products.Producto_ID.max()))
products.head()
Shape of products_tabla: (2592, 2)
Max product id: 49997
Out[8]:
Producto_ID NombreProducto
0 0 NO IDENTIFICADO 0
1 9 Capuccino Moka 750g NES 9
2 41 Bimbollos Ext sAjonjoli 6p 480g BIM 41
3 53 Burritos Sincro 170g CU LON 53
4 72 Div Tira Mini Doradita 4p 45g TR 72

Observations:

  • there are 2592 products and their ids are not sequential
  • product name (NombreProducto) captures much more than just the name; there is weight, piece count and a couple of other data

cliente_tabla.csv

In [9]:
clients = pd.read_csv('../input/cliente_tabla.csv')
print('Shape of cliente_tabla: {0}'.format(clients.shape))
print('Max cliente id: {0}'.format(clients.Cliente_ID.max()))
clients.head()
Shape of cliente_tabla: (935362, 2)
Max cliente id: 2015152015
Out[9]:
Cliente_ID NombreCliente
0 0 SIN NOMBRE
1 1 OXXO XINANTECATL
2 2 SIN NOMBRE
3 3 EL MORENO
4 4 SDN SER DE ALIM CUERPO SA CIA DE INT

Observations:

  • there are 935362 clients and their ids are not sequential

town_state.csv

In [10]:
town_state = pd.read_csv('../input/town_state.csv')
print('Shape of town_state: {0}'.format(town_state.shape))
print('Number of towns: {0}'.format(town_state.Town.nunique()))
print('Number of states: {0}'.format(town_state.State.nunique()))
print('Number of agencies: {0}'.format(town_state.Agencia_ID.nunique()))
print('Max agency id: {0}'.format(town_state.Agencia_ID.max()))
town_state.head()
Shape of town_state: (790, 3)
Number of towns: 260
Number of states: 33
Number of agencies: 790
Max agency id: 25769
Out[10]:
Agencia_ID Town State
0 1110 2008 AG. LAGO FILT MÉXICO, D.F.
1 1111 2002 AG. AZCAPOTZALCO MÉXICO, D.F.
2 1112 2004 AG. CUAUTITLAN ESTADO DE MÉXICO
3 1113 2008 AG. LAGO FILT MÉXICO, D.F.
4 1114 2029 AG.IZTAPALAPA 2 MÉXICO, D.F.

Observations:

  • this table maps Agencies to towns and states
  • there are 790 agencies serving 260 towns in 33 states
  • agency ods are not sequential

Data Preprocessing

There is not much preprocessing to be done since the data is pretty clean and well structured. However, there is one thing that be done with regards to the id fields. We oberved that none of the ids are sequential. Their range is much larger than thier counts.

Label-Encoding will make them sequential and will also be helpful in one-hot-encoding them later.

Encode Product ID

In [11]:
products['Product_ID_enc'] = LabelEncoder().fit_transform(products.Producto_ID)
train_data = train_data.merge(products[['Producto_ID', 'Product_ID_enc']], how='left',
                              left_on = 'Product_id', right_on = 'Producto_ID')
train_data.drop(['Producto_ID', 'Product_id'], axis = 1, inplace = True)

test_data = test_data.merge(products[['Producto_ID', 'Product_ID_enc']], how='left',
                              left_on = 'Product_id', right_on = 'Producto_ID')
test_data.drop(['Producto_ID', 'Product_id'], axis = 1, inplace = True)

Encode Depot ID

In [12]:
town_state['Agency_id_enc'] = LabelEncoder().fit_transform(town_state.Agencia_ID)
train_data = train_data.merge(town_state[['Agencia_ID', 'Agency_id_enc']], how='left',
                              left_on = 'Depot_id', right_on = 'Agencia_ID')
train_data.drop(['Agencia_ID', 'Depot_id'], axis = 1, inplace = True)

test_data = test_data.merge(town_state[['Agencia_ID', 'Agency_id_enc']], how='left',
                              left_on = 'Depot_id', right_on = 'Agencia_ID')
test_data.drop(['Agencia_ID', 'Depot_id'], axis = 1, inplace = True)

Encode Client ID

In [13]:
clients['Client_id_enc'] = LabelEncoder().fit_transform(clients.Cliente_ID)
train_data = train_data.merge(clients[['Cliente_ID', 'Client_id_enc']], how='left',
                              left_on = 'Client_id', right_on = 'Cliente_ID')
train_data.drop(['Cliente_ID', 'Client_id'], axis = 1, inplace = True)

test_data = test_data.merge(clients[['Cliente_ID', 'Client_id_enc']], how='left',
                              left_on = 'Client_id', right_on = 'Cliente_ID')
test_data.drop(['Cliente_ID', 'Client_id'], axis = 1, inplace = True)

Encode Week, Route id and Channel_id

For encoding the remaining categoricals - Week, Route id and Channel_id, we will stack train and test data, fit a LabelEncoder and then use it to tranform train and test data separately.

In [14]:
enc_cols = ['Week', 'Route_id', 'Channel_id']
trns_data = train_data[enc_cols].append(test_data[enc_cols])

# Encoding Week is an overkill since we know that values are 3 to 11. 
# Subtracting 3 will do the job.
train_data['Week'] = train_data['Week'] - 3
test_data['Week'] = test_data['Week'] - 3

# Encode Route_id
week_enc = LabelEncoder().fit(trns_data.Route_id)
train_data['Route_id'] = week_enc.transform(train_data.Route_id)
test_data['Route_id'] = week_enc.transform(test_data.Route_id)

# Encode Channel_id. We can't apply the trick that we applied on Week because 
# Channel 3 is missing.
channel_enc = LabelEncoder().fit(trns_data.Channel_id)
train_data['Channel_id'] = week_enc.transform(train_data.Channel_id)
test_data['Channel_id'] = week_enc.transform(test_data.Channel_id)

# We will rename the categoricals that we encoded earlier to remove '_enc'
train_data.rename(columns = {'Product_ID_enc': 'Product_id', 
                             'Agency_id_enc': 'Agency_id',
                             'Client_id_enc': 'Client_id'}, inplace = True)
test_data.rename(columns = {'Product_ID_enc': 'Product_id', 
                             'Agency_id_enc': 'Agency_id',
                             'Client_id_enc': 'Client_id'}, inplace = True)
train_data.head()
Out[14]:
Week Channel_id Route_id Units_sold Earnings Units_returned Money_returned Adj_demand Product_id Agency_id Client_id
0 0 6 1360 3 25.14 0 0 3 103 0 4755
1 0 6 1360 4 33.52 0 0 4 104 0 4755
2 0 6 1360 4 39.32 0 0 4 109 0 4755
3 0 6 1360 4 33.52 0 0 4 110 0 4755
4 0 6 1360 3 22.92 0 0 3 111 0 4755
In [13]:
test_data.head()
Out[13]:
id Week Channel_id Route_id Product_id Agency_id Client_id
0 0 8 0 983 1319 520 826441
1 1 8 0 377 109 441 863370
2 2 7 0 1138 915 385 777487
3 3 8 0 1728 2058 59 871528
4 4 8 0 284 114 53 287300

We are now ready for EDA.

Exploratory Data Analysis

Let's take a look at the distribution of Week, Product, Channel, Route, Agency and Client.

In [25]:
# Stack train and test data to understand distributions on the full dataset
X_data = train_data.append(test_data)

fig, axes = plt.subplots(2, 3)
fig.set_size_inches(12, 8)

cols = ['Week', 'Product_id', 'Channel_id', 'Route_id', 'Agency_id', 'Client_id']
for i, col in enumerate(cols):
    sns.distplot(X_data[col], kde = False, ax = axes[i/3, i%3]);
plt.tight_layout()

Observations:

  • Week: train-set weeks (Weeks 0 to 6) are evenly distributed and have between 10 to 11 million samples each. test-set weeks (Weeks 7 and 8) are also evenly distributed, but have fewer samples (about 3.5 million each)
  • Products: The huge towers suggest that there are some products that are hugely popular compared to others
  • Channels: Channel-0 is predominantly used. Other channels are rarely used, if at all.
  • Routes: There are a bunch of high usage routes, some others that see moderate usage and the rest very low usage
  • Agencies: Equating the number of transactions from an Agency to its size, there are a handful of big Agencies, quite a few medium-sized agencies and a bunch of small agencies.
  • Clients: The graph suggests a linear distribution of clients without any outliers. Id values reflect the size of the client with some exceptions towards the end - clients with smaller id values have more transactions compared to clients with bigger id values.

Before we look at sales and demand distributions, let's plot the weekly distributions of Products, Routes, Agencies and Clients, and find out if there are any patterns.

Distribution of Products across weeks

In [22]:
weekly_prod_distrib = X_data.groupby(['Week', 'Product_id']).Product_id.agg(['size'])['size'].unstack(level=0)
top_prods = set()
for i in range(9):
    top_prods = top_prods.union(set(weekly_prod_distrib[i].sort_values(ascending = False)[:50].index.values))
top_prods = np.sort(list(top_prods)[:50])

fig, axes = plt.subplots(2, 1)
fig.set_size_inches(12, 8)

weekly_prod_distrib[range(7)].ix[top_prods].plot.bar(ax = axes[0])
axes[0].set_title('Distributions of top 50 selling products in train-set Weeks (0 to 6)')
axes[0].set(xlabel = 'Product Ids', ylabel = 'Sales Count')
weekly_prod_distrib[[7,8]].ix[top_prods].plot.bar(ax = axes[1])
axes[1].set_title('Distributions of top 50 selling products in test-set Weeks (7 and 8)')
axes[1].set(xlabel = 'Product Ids', ylabel = 'Sales Count');
#plt.tight_layout();
Out[22]:
[, ]

The distributions of top 50 products in training and test sets look pretty similar, although the vertical scales are different because of the difference in the weekly sample sizes. This is a strong indicator that product demands in weeks 7 and 8 must be strongly correlated to product sales in weeks for which we have sales data (weeks 0 to 6).

There are differences in some products like 114, 357, 600, 1931, which other features might help answer.

Let's look at how clients have been behaving over the weeks. Comparing the distribution of transactions w.r.t clients week over week will tell us if clients have been consistent with thier orders or whether there's a lot of volatility.

Distribution of clients across weeks

In [17]:
fig, axes = plt.subplots(3, 3)
fig.set_size_inches(10, 8)

for i in range(9):
    sns.distplot(X_data[X_data.Week == i].Client_id, kde = False, ax = axes[i/3, i%3]);
    axes[i/3, i%3].set_title('Distributions of clients in Week {0}'.format(i))
fig.tight_layout()

The distribution of Clients is remarkably similar across weeks. What this means is that sales to clients didn't change much over the 9 week window. In other words, the number of transactions with clients remained consistent week over week.

Analysing weekly client transactions w.r.t products ordered

Let's go a level deeper and get a sense of the nature of orders from clients. We will pick some clients and try to understand the number of different products they ordered, the frequency of their orders and whether the orders are repeat orders (i.e. orders with the same products that were ordered earlier) or new orders (i.e. orders with products that were different from earlier orders).

In [70]:
rand_clients = np.random.choice(X_data.Client_id, 9, replace = False)
rand_client_trx = X_data[X_data.Client_id.isin(rand_clients)].copy()
rand_client_trx['Product_id'] = LabelEncoder().fit_transform(rand_client_trx.Product_id)
rand_client_trx['Client_id'] = LabelEncoder().fit_transform(rand_client_trx.Client_id)

fig, axes = plt.subplots(3,3)
fig.set_size_inches(10, 10)
for i in range(9):
    recs = rand_client_trx[rand_client_trx.Client_id == i]
    sns.stripplot(x="Week", y="Product_id", data=recs, ax=axes[i/3,i%3])
    axes[i/3,i%3].set_title('Client - {0}'.format(i))
plt.tight_layout()

Wow, that's a colorful plot. Each plot shows products bought by clients each week. Product ids are label-encoded for products bought by all the 9 clients. The plot doesn't show how many of each products were bought. Client ids have also been label-encoded.

Generally, we see that clients have been buying some products repeatedly, week over week, and also try out different products in different weeks. We see that the dotted lines within each plot show a pattern that repeats across weeks. For example, Client-0's plot have dots in the 0 to 40 product id range and some in the 90 to 140 range. What this means is that this customer's orders are repeat orders for products in the range 0 to 40 and new orders for a mix of products in the 90 to 140 range. He doesn't order any products in the ranges 40 to 90 and 140 and above.

Each client exhibits a different ordering pattern, both w.r.t the mix of patterns and w.r.t the number of products, but there is a pattern nonetheless.

Client-1 and Client-2 show us the two extremes. Client-1 orders very few products and doesn't order anything in weeks 1, 2 and 3. Whereas, Client-2 orders a large number of products, most of them repeat orders, and has a consistent pattern across all the nine weeks.

Some clients seem to try out different products, while have a consistent set of repeat orders. Clients 3 and 4 show this behavior. They have repeat orders in range 0 to 30 and try out different products in different weeks in the rest of the range.

What we can infer from this is that, while the orders do have some patterns, they are not very consistent. Hence, demand predictions may not be very accurate.

Gauging the volatility in the dataset

It is mentioned in the data page of the competition that clients/products/agencies/routes slightly differ from week to week. Before we start exploring weekly demand distributions let's find out how many clients/products/agencies/routes repeat week over week, how many get dropped off and how many get added.

In [42]:
weekly_clients = X_data[['Week', 'Client_id']].groupby('Week').apply(lambda x: x.Client_id.unique())
weekly_products = X_data[['Week', 'Product_id']].groupby('Week').apply(lambda x: x.Product_id.unique())
weekly_agencies = X_data[['Week', 'Agency_id']].groupby('Week').apply(lambda x: x.Agency_id.unique())
weekly_routes = X_data[['Week', 'Route_id']].groupby('Week').apply(lambda x: x.Route_id.unique())
client_overlap = [[0, len(weekly_clients[0]), 0]]
product_overlap = [[0, len(weekly_products[0]), 0]]
agency_overlap = [[0, len(weekly_agencies[0]), 0]]
route_overlap = [[0, len(weekly_routes[0]), 0]]

for i in range(1, 9):
    # Clients
    repeat = np.intersect1d(weekly_clients[i], weekly_clients[i-1], assume_unique=True)
    new = len(set(weekly_clients[i]) - set(repeat))
    missing = len(set(weekly_clients[i-1]) - set(repeat))
    client_overlap.append([missing, len(repeat), new])
    # Products
    repeat = np.intersect1d(weekly_products[i], weekly_products[i-1], assume_unique=True)
    new = len(set(weekly_products[i]) - set(repeat))
    missing = len(set(weekly_products[i-1]) - set(repeat))
    product_overlap.append([missing, len(repeat), new])
    # Agencies
    repeat = np.intersect1d(weekly_agencies[i], weekly_agencies[i-1], assume_unique=True)
    new = len(set(weekly_agencies[i]) - set(repeat))
    missing = len(set(weekly_agencies[i-1]) - set(repeat))
    agency_overlap.append([missing, len(repeat), new])
    # Routes
    repeat = np.intersect1d(weekly_routes[i], weekly_routes[i-1], assume_unique=True)
    new = len(set(weekly_routes[i]) - set(repeat))
    missing = len(set(weekly_routes[i-1]) - set(repeat))
    route_overlap.append([missing, len(repeat), new])

fig, axes = plt.subplots(2, 2)
fig.set_size_inches(10, 8)

inds = ['Week'+str(i) for i in range(9)]
cols = ['Missing', 'Repeat', 'New']
pd.DataFrame(client_overlap, columns = cols, index = inds).plot.bar(ax = axes[0, 0], 
                                                                    stacked = True)
axes[0, 0].set_title('Week over week Clientele overlap')
pd.DataFrame(product_overlap, columns = cols, index = inds).plot.bar(ax = axes[0, 1], 
                                                                    stacked = True)
axes[0, 1].set_title('Week over week Products overlap')
pd.DataFrame(agency_overlap, columns = cols, index = inds).plot.bar(ax = axes[1, 0], 
                                                                    stacked = True)
axes[1, 0].set_title('Week over week Agencies overlap')
pd.DataFrame(route_overlap, columns = cols, index = inds).plot.bar(ax = axes[1, 1], 
                                                                    stacked = True)
axes[1, 1].set_title('Week over week Routes overlap');
plt.tight_layout();

del weekly_clients, weekly_products, weekly_agencies, weekly_routes, 
del client_overlap, product_overlap, agency_overlap, route_overlap
import gc
gc.collect()

Clients, Products and Routes show a 5% to 10% variation week over week. Agencies are pretty consistent.

Weekly demand distributions

There are five sales related fields: Units_sold, Earnings, Units_returned, Money_returned and Adj_demand. Earnings and Money_returned are money equivalents of Units_sold and Units_returned, thereby making them redundant data.

We saw earlier that 'returns' transactions are very few (about 3% of the dataset). Hence, their impact on the overall units sold (Adj_demand = Units_sold - Units_returned) is insignificant. And as a consequence, Adj_demand is nearly identical to Units_sold.

We will only look at the distribution of Adj_demand. Visualizing products sales data of 74 million transactions, 1800 products and 0.88 million customers may get complicated. To keep it simple I will randomly pick 6 products, 5 clients within each of these products and show how their demands vary week over week. This is similar to what we did with weekly order analysis.

In [139]:
np.random.seed(127)
rand_products = np.random.choice(train_data.Product_id, 6, replace = True)
rand_product_trx = train_data[train_data.Product_id.isin(rand_products)].copy()
rand_product_trx['Product_id'] = LabelEncoder().fit_transform(rand_product_trx.Product_id)

fig, axes = plt.subplots(3,2)
fig.set_size_inches(10, 12)
for i in range(6):
    recs = rand_product_trx[rand_product_trx.Product_id == i].copy()
    clnts = np.random.choice(recs.Client_id, 5, replace = True)
    recs = recs[recs.Client_id.isin(clnts)].copy()
    recs['Client_id'] = LabelEncoder().fit_transform(recs.Client_id)
    t = recs.groupby(['Week', 'Client_id']).apply(lambda x: x.Adj_demand.values[0]).unstack(level = 1).fillna(value = 0)
    t.plot(marker='o', linestyle=':', ax=axes[i/2,i%2])
    axes[i/2,i%2].set_title('Product - {0}'.format(i))
plt.tight_layout()

Each plot tells a different story. Most clients don't show any consistent patterns. But there are some that do have some patterns, for example Client-0 in Product-0, Client-0 in Product-2, Client-1 in Product-3 and Client-2 in Product-4.

It will be interesting to see how succesful the machine learning models will be in dealing with these variations.

I will end this post here and start developing Machine Learning models for this dataset in this follow-up post: Grupo Bimbo Demand Prediction - Part2. Please continue reading.

- Vinay Huddar