This post is the second and final part of TalkingData Mobile User Demographics two-post series. The first part can be found here: TalkingData Mobile User Demographics - Part 1.

In this post I will describe the following three models and compare their performances based on the evaluation metric that was used in the competition for scoring.

  1. Multi-Class LogisticRegression Classifier
  2. Ensemble of XGBoost models
  3. A Neural Network solution

There is some ground work to be done before we can start with the models. We cannot feed the data that we downloaded from the competition website directly into the models. Neither can we use the version of preprocessed data that we prepared in the previous post. Because, both are not 'encoded' yet.

The data provided is categorical in nature: gender, age-group, phone-brand, device-model, app categories are all categorical data. And, the way to learn from categorical data is by converting them to their equivalent one-hot-encoded features and passing them through one or more of the afore mentioned models.

In [2]:
import numpy as np
import pandas as pd
import time
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.cross_validation import StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss
from scipy.sparse import csr_matrix, hstack
import xgboost as xgb
In [4]:
# Get CSVs into memory and perform basic preprocessing
demo_train = pd.read_csv('../Data/gender_age_train.csv', index_col='device_id')
demo_test = pd.read_csv('../Data/gender_age_test.csv', index_col='device_id')
brand_model = pd.read_csv('../Data/phone_brand_device_model.csv')

We will now encode the following categorical features and create one-hot-encoded features from them.

  1. Phone Brands
  2. Device Models
  3. Apps
  4. App Categories

The one-hot-encoded feature matrix creation and the logistic regression part of the solution is inspired by Dune Dweller's solution.

Phone Brands and Device Models

We will do the following for creating one-hot-encoded vectors for phone brands and device models.

  • Use sklearn's LabelEncoder to encode the original feature
  • Merge the encoded feature with demographic data
  • Use sklearn's OneHotEncoder to create the one-hot-encoded vectors
In [48]:
# Remove duplicate devices from the phone table
brand_model.drop_duplicates(subset='device_id', inplace=True)
brand_model.set_index(keys='device_id', drop=True, inplace=True)

# Label-encode brands and models
brand_model['brand_enc'] = LabelEncoder().fit_transform(brand_model.phone_brand)
brand_model['model_enc'] = brand_model.phone_brand.str.cat(brand_model.device_model)
brand_model['model_enc'] = LabelEncoder().fit_transform(brand_model.model_enc)

# Merge encoded brands and models to demo_train and demo_test
demo_train['brand_enc'] = brand_model.brand_enc
demo_train['model_enc'] = brand_model.model_enc
demo_test['brand_enc'] = brand_model.brand_enc
demo_test['model_enc'] = brand_model.model_enc

print('demo_train')
print(demo_train.head(3))

print('\ndemo_test')
print(demo_test.head(3))
demo_train
                     gender  age   group  brand_enc  model_enc
device_id                                                     
-8076087639492063270      M   35  M32-38         51        843
-2897161552818060146      M   35  M32-38         51        843
-8260683887967679142      M   35  M32-38         51        843

demo_test
                      brand_enc  model_enc
device_id                                 
 1002079943728939269         51        857
-1547860181818787117         51        860
 7374582448058474277         31        717
In [151]:
# One-hot-encode brand and model features
demo_data = demo_train.append(demo_test)
enc = OneHotEncoder(handle_unknown='ignore').fit(demo_data[['brand_enc', 'model_enc']])

Xtrain_brand_model_csr = enc.transform(demo_train[['brand_enc', 'model_enc']])
Xtest_brand_model_csr = enc.transform(demo_test[['brand_enc', 'model_enc']])

print('Shape of one-hot-encoded brands of train data: {0}'.format(Xtrain_brand_model_csr.shape))
print('Shape of one-hot-encoded brands of test data: {0}'.format(Xtest_brand_model_csr.shape))
Shape of one-hot-encoded brands of train data: (74645, 1798)
Shape of one-hot-encoded brands of test data: (112071, 1798)
In [125]:
events = pd.read_csv('../Data/events.csv', index_col = 'event_id', parse_dates = ['timestamp'])
print('events.shape: {0}'.format(events.shape))
print(events.head())

app_events = pd.read_csv('../Data/app_events.csv', usecols = ['event_id', 'app_id', 'is_active'], dtype = {'event_id': np.int64, 'app_id': np.int64, 'is_active': np.int8})
print('\napp_events.shape: {0}'.format(app_events.shape))
print(app_events.head())
events.shape: (3252950, 4)
                    device_id           timestamp  longitude  latitude
event_id                                                              
1           29182687948017175 2016-05-01 00:55:25     121.38     31.24
2        -6401643145415154744 2016-05-01 00:54:12     103.65     30.97
3        -4833982096941402721 2016-05-01 00:08:05     106.60     29.70
4        -6815121365017318426 2016-05-01 00:06:40     104.27     23.28
5        -5373797595892518570 2016-05-01 00:07:18     115.88     28.66

app_events.shape: (32473067, 4)
   event_id               app_id  is_active    app
0         2  5927333115845830913          1  15408
1         2 -5720078949152207372          0   3384
2         2 -1633887856876571208          0   7620
3         2  -653184325010919369          1   8902
4         2  8693964245073640147          1  18686
In [130]:
app_encoder = LabelEncoder().fit(app_events.app_id)
app_events['app_enc'] = app_encoder.transform(app_events.app_id)
napps = len(app_encoder.classes_)
device_apps = events[['device_id']].merge(app_events[['event_id', 'app_enc']], how = 'left', left_index = True, right_on = 'event_id')
device_apps = device_apps.groupby(['device_id', 'app_enc']).app_enc.agg(['size']).reset_index()
device_apps = device_apps.set_index(['device_id'])[['app_enc']]

# These row ids will be required for creating csr_matrices
demo_train['train_row'] = np.arange(demo_train.shape[0])
demo_test['test_row'] = range(len(demo_test))
device_apps['train_row'] = demo_train['train_row']
device_apps['test_row'] = demo_test['test_row']

print(device_apps.shape)
device_apps.head()
(2369025, 3)
Out[130]:
app_enc train_row test_row
device_id
-9222956879900151005 548 21594 NaN
-9222956879900151005 1096 21594 NaN
-9222956879900151005 1248 21594 NaN
-9222956879900151005 1545 21594 NaN
-9222956879900151005 1664 21594 NaN

Now we are ready to create one hot encoded feature matrix for the 'app' feature. We will do this for both train and test.

In [141]:
d = device_apps.dropna(subset=['train_row'])
Xtrain_app_csr = csr_matrix((np.ones(d.shape[0]), (d.train_row, d.app_enc)), 
                      shape=(demo_train.shape[0], napps))

d = device_apps.dropna(subset=['test_row'])
Xtest_app_csr = csr_matrix((np.ones(d.shape[0]), (d.test_row, d.app_enc)), 
                      shape=(demo_test.shape[0], napps))

print('Apps data: train shape {}, test shape {}'.format(Xtrain_app_csr.shape, Xtest_app_csr.shape))
Apps data: train shape (74645, 19237), test shape (112071, 19237)

Let's apply the same set of transformations for creating one hot encoded feature matrix for the 'label' feature.

In [142]:
app_labels = pd.read_csv('../Data/app_labels.csv', dtype = {'app_id': np.int64, 'label_id': np.int16})
app_labels = app_labels.loc[app_labels.app_id.isin(app_events.app_id.unique())]
app_labels['app_enc'] = app_encoder.transform(app_labels.app_id)

label_encoder = LabelEncoder().fit(app_labels.label_id)
app_labels['label_enc'] = label_encoder.transform(app_labels.label_id)
nlabels = len(label_encoder.classes_)

device_labels = device_apps[['app_enc']].reset_index().merge(app_labels[['app_enc', 'label_enc']])
device_labels = device_labels.groupby(['device_id','label_enc'])['app_enc'].agg(['size']).reset_index()
device_labels = device_labels.set_index('device_id')[['label_enc']]
device_labels['train_row'] = demo_train['train_row']
device_labels['test_row'] = demo_test['test_row']

device_labels.head()
Out[142]:
label_enc train_row test_row
device_id
-9222956879900151005 117 21594 NaN
-9222956879900151005 120 21594 NaN
-9222956879900151005 126 21594 NaN
-9222956879900151005 138 21594 NaN
-9222956879900151005 147 21594 NaN
In [148]:
d = device_labels.dropna(subset=['train_row'])
Xtrain_label_csr = csr_matrix((np.ones(d.shape[0]), (d.train_row, d.label_enc)), 
                      shape=(demo_train.shape[0],nlabels))

d = device_labels.dropna(subset=['test_row'])
Xtest_label_csr = csr_matrix((np.ones(d.shape[0]), (d.test_row, d.label_enc)), 
                      shape=(demo_test.shape[0],nlabels))

print('Labels data: train shape {}, test shape {}'.format(Xtrain_label_csr.shape, Xtest_label_csr.shape))
Labels data: train shape (74645, 492), test shape (112071, 492)

We are now ready to concatenate all the one-hot-encoded features. This is the data that we will be feeding into the learning algorithms.

In [152]:
Xtrain = hstack((Xtrain_brand_model_csr, Xtrain_app_csr, Xtrain_label_csr), format='csr')
Xtest =  hstack((Xtest_brand_model_csr, Xtest_app_csr, Xtest_label_csr), format='csr')

print('All features: train shape {}, test shape {}'.format(Xtrain.shape, Xtest.shape))
All features: train shape (74645, 21527), test shape (112071, 21527)

We are now ready for Model Selection and Tuning. Let's begin with a Linear Solution, which is invariably a Logistic Regression model. We will find the regularization value that gives the best score using cross validation and then predict on the test data.

Logistic Regression Solution

As I mentioned earlier, this solution is taken from Dune Dweller's script.

The target, demo_train.group, is a categorical. Let's encode it similar to the way we did for the predictors.

In [154]:
target_encoder = LabelEncoder().fit(demo_train.group)
y = target_encoder.transform(demo_train.group)
nclasses = len(target_encoder.classes_)
In [165]:
def score(clf, random_state = 0):
    kf = StratifiedKFold(y, n_folds=5, shuffle=True, random_state=random_state)
    pred = np.zeros((y.shape[0],nclasses))
    for itrain, itest in kf:
        Xtr, Xte = Xtrain[itrain, :], Xtrain[itest, :]
        ytr, yte = y[itrain], y[itest]
        clf.fit(Xtr, ytr)
        pred[itest,:] = clf.predict_proba(Xte)
    return log_loss(y, pred)
In [166]:
Cs = np.logspace(-3,0,4)
res = []
for C in Cs:
    res.append(score(LogisticRegression(C = C)))
plt.semilogx(Cs, res,'-o');
print(res)
[2.3454585286199126, 2.2860489534757469, 2.2947290278037582, 2.4322723671037414]

Judging by the plot the best value for C is somewhere between 0.01 and 0.1.

In [167]:
score(LogisticRegression(C=0.02))
Out[167]:
2.2807562510743371

Predict on test data

In [177]:
clf = LogisticRegression(C=0.02)
clf.fit(Xtrain, y)
pred = pd.DataFrame(clf.predict_proba(Xtest), index = demo_test.index, columns=target_encoder.classes_)
pred.to_csv('logreg_subm.csv',index=True)

The submission scored 2.3466. Not a great score, but a baseline nonetheless. Linear models generally serve as a good baseline measure for the the more complex non-linear models.

XGBoost Solution

Let us now feed the feature matrix that we used for training the LogisticRegression model to an XGBoost model and compare its performance against the linear model.

In [176]:
# Create an XGBoost Muliclass Classifier model with the default learning rate (0.1), 
# 80% subsampling, 5-level deep, 500-tree wide
xgb_model = xgb.XGBClassifier(max_depth= 5, subsample= 0.8, objective= 'multi:softprob',
                       colsample_bytree= 0.9,min_child_weight=0,n_estimators=500,seed=17)

# Fit the model
xgb_model.fit(Xtrain, y)

# Predict on test data
preds = xgb_model.predict_proba(Xtest)

preds_df = pd.DataFrame(preds, columns=target_encoder.classes_)
preds_df['device_id'] = demo_test.reset_index()['device_id']
preds_df.set_index('device_id', inplace=True)
preds_df.to_csv('xgboost_preds.csv', index = True)

The submission scored 2.2698. A pretty good improvement over the linear model.

There are more things that we can try, like model ensembling, feature engineering, trying out a neural network solution etc. Infact, the solution that got me the best score was as neural network solution that was implemented using Keras.

I am not describing that neural network solution in this post for two reasons - first, I couldn't manage to get Keras work with jupyter, and the second, the post would become too long and would qualify for TL;DRs.

Thanks for reading the post. I hope you found it interesting and was helpful to you in some way. If you found this post interesting you may find my other posts in the 'Kaggle Series' worth reading too. Following are the other posts in the series.

- Vinay Huddar