User Segmentation is an important requirement for Targeted Digital Marketing and Customized User Experience. While a lot of new ways of Customer Segmentation are being used lately, thanks to the explosion of online social media, the traditional way of segmenting users based on Demographics is still very relevant. Knowing just the gender and age/age-group of a person can tell a lot about their interests, inclinations and their propensity of buying certin kinds of products.

Not all the users are equally willing to share their gender and age/age-group. Knowing the value of this kind of data, companies try various ways of getting their hands on it. It is generally not a good idea to make it mandatory for users to provide this data at the time of registeration or signing-up. It carries the risk of loosing a potential user/customer. So, companies resort to non-invasive ways like predicting demographic data of users who have not provided this information by matching their usage patterns with those who have.

This was exactly the challenge of the competition titled, TalkingData Mobile User Demographics. TalkingData is a major chinese mobile app/game analytics company with its user count in millions. Given demographic and app download/usage data of some of their users the challenge was to develop a model that can best predict the gender and age-group of their other users. You may learn more about the competition from the Kaggle website here: TalkingData Mobile User Demographics

In this post I will share some interesting observations/insights from the data and describe my solution. I will follow the systematic approach of a five staged Data Science Process that I describe (with a case study) in the following two posts:

Let's start with Data Collection

Data Collection¶

Like most competitions in Kaggle, the data was provided in a set of CSV files. The following information has been extracted from the competition's data page.

"The Data is collected from TalkingData SDK integrated within mobile apps TalkingData serves under the service term between TalkingData and mobile app developers...

The data schema can be represented in the following chart:"

File descriptions¶

gender_age_train.csv, gender_age_test.csv - the training and test set
- group: this is the target variable you are going to predict
events.csv, app_events.csv - when a user uses TalkingData SDK, the event gets logged in this data. Each event has an event id, location (lat/long), and the event corresponds to a list of apps in app_events.
- timestamp: when the user is using an app with TalkingData SDK
app_labels.csv - apps and their labels, the label_id's can be used to join with label_categories
label_categories.csv - apps' labels and their categories in text
phone_brand_device_model.csv - device ids, brand, and models
- phone_brand: note that the brands are in Chinese: 三星 samsung, 天语 Ktouch etc

Data Preprocessing¶

We will take a look at the data one table at a time and look for any preprocessing requirements, like datatype conversions, unit conversions, missing data imputation etc.

Let's import some required libraries.

In [1]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.preprocessing import StandardScaler

import seaborn as sns

gender_age_train.csv and gender_age_test.csv

In [35]:

demo_train = pd.read_csv('../Data/gender_age_train.csv')
demo_test = pd.read_csv('../Data/gender_age_test.csv')

print('\nShape of gender_age_train.csv: {0}'.format(demo_train.shape))
print('Shape of gender_age_test.csv: {0}\n'.format(demo_test.shape))

print('Sample data of gender_age_train.csv:\n{0}'.format(demo_train.head(3)))
print('\nSample data of gender_age_test.csv:\n{0}'.format(demo_test.head(3)))

print('\nDatatypes of features...\n{0}'.format(demo_train.dtypes))

Shape of gender_age_train.csv: (74645, 4)
Shape of gender_age_test.csv: (112071, 1)

Sample data of gender_age_train.csv:
             device_id gender  age   group
0 -8076087639492063270      M   35  M32-38
1 -2897161552818060146      M   35  M32-38
2 -8260683887967679142      M   35  M32-38

Sample data of gender_age_test.csv:
             device_id
0  1002079943728939269
1 -1547860181818787117
2  7374582448058474277

Datatypes of features...
device_id     int64
gender       object
age           int64
group        object
dtype: object

There are about 75K training samples with device_id, gender, age and group data. The group column is formed from gender and age. There are about 112K test samples with only device_id. The model that we develop has to predict groups for these test samples.

The ratio of train:test sample size in this dataset is 40:60. It is not so common to have more test samples than train samples as it is in this dataset.

gender and group are string types. We will be encoding them to their numerical equivalents when we create machine learning models. For now, we will keep them in their current form for visualising and understanding purposes.

phone_brand_device_model

In [36]:

phone_brand = pd.read_csv('../Data/phone_brand_device_model.csv')

print('\nShape of phone_brand_device_model.csv: {0}\n'.format(phone_brand.shape))
print('Sample data of phone_brand_device_model.csv:\n{0}'.format(phone_brand.head(3)))
print('\nDatatypes of features...\n{0}'.format(phone_brand.dtypes))

Shape of phone_brand_device_model.csv: (187245, 3)

Sample data of phone_brand_device_model.csv:
             device_id phone_brand device_model
0 -8890648629457979026          小米           红米
1  1277779817574759137          小米         MI 2
2  5137427614288105724          三星    Galaxy S4

Datatypes of features...
device_id        int64
phone_brand     object
device_model    object
dtype: object

This table has brand and model data. Names of many brands and devices are in chinese. This should not be a conern as far as model training is concerned since these will be encoded to their numerical equivalents. However, it will be helpful to know their English names so we can get a peek into the Chinese Mobile market.

One of the participants had shared the English version of phone_brand_device_model.csv. We will use for EDA.

In [38]:

phone_brand_english = pd.read_csv('../Data/english_phone_brand_device_model.csv')

print('\nShape of phone_brand_device_model.csv: {0}\n'.format(phone_brand_english.shape))
print('Sample data of phone_brand_device_model.csv:\n{0}'.format(phone_brand_english.head(3)))

Shape of phone_brand_device_model.csv: (187245, 3)

Sample data of phone_brand_device_model.csv:
             device_id phone_brand device_model
0 -8890648629457979026      xiaomi           ??
1  1277779817574759137      xiaomi         MI 2
2  5137427614288105724     samsung    Galaxy S4

events.csv

In [40]:

events = pd.read_csv('../Data/events.csv')
print('\nShape of events.csv: {0}\n'.format(events.shape))
print('Sample data of events.csv:\n{0}'.format(events.head(3)))
print('\nDatatypes of features...\n{0}'.format(events.dtypes))

Shape of events.csv: (3252950, 5)

Sample data of events.csv:
   event_id            device_id            timestamp  longitude  latitude
0         1    29182687948017175  2016-05-01 00:55:25     121.38     31.24
1         2 -6401643145415154744  2016-05-01 00:54:12     103.65     30.97
2         3 -4833982096941402721  2016-05-01 00:08:05     106.60     29.70

Datatypes of features...
event_id       int64
device_id      int64
timestamp     object
longitude    float64
latitude     float64
dtype: object

This table has time and position information. device_id is provided for merging it with gender_age_train/gender_age_test data.

timestamp is of string type. Let's convert it to datetime64 type.

In [41]:

events['timestamp'] = pd.to_datetime(events.timestamp)
print('Datatype of timestamp: {0}'.format(events.timestamp.dtype))

Datatype of timestamp: datetime64[ns]

app_events.csv

In [21]:

app_events = pd.read_csv('../Data/app_events.csv')
print('\nShape of app_events.csv: {0}\n'.format(app_events.shape))
print('Sample data of app_events.csv:\n{0}'.format(app_events.head()))
print('\nDatatypes of app_events...\n{0}'.format(app_events.dtypes))

Shape of app_events.csv: (32473067, 4)

Sample data of app_events.csv:
   event_id               app_id  is_installed  is_active
0         2  5927333115845830913             1          1
1         2 -5720078949152207372             1          0
2         2 -1633887856876571208             1          0
3         2  -653184325010919369             1          1
4         2  8693964245073640147             1          1

Datatypes of app_events...
event_id        int64
app_id          int64
is_installed    int64
is_active       int64
dtype: object

It looks like is_installed and is_active are boolean even though they are of type inet64. Let's check this and if they are indeed boolean we will convert their types to int8.

In [22]:

print('is_installed unique values and their counts...\n{0}\n'.format(app_events.is_installed.value_counts()))
print('is_active unique values and their counts...\n{0}'.format(app_events.is_active.value_counts()))

is_installed unique values and their counts...
1    32473067
Name: is_installed, dtype: int64

is_active unique values and their counts...
0    19740071
1    12732996
Name: is_active, dtype: int64

Hmm...is_installed has a single value in it (=1). Let's remove it since it doesn't a constant-column doesn't add any value. And, is_active is indeed boolean. We will convert it to int8 type.

In [25]:

app_events.drop(['is_installed'], axis = 1, inplace = True)
app_events['is_active'] = app_events['is_active'].astype(np.int8)

print('\nShape of app_events.csv: {0}\n'.format(app_events.shape))
print('Sample data of app_events.csv:\n{0}'.format(app_events.head(3)))
print('\nDatatypes of app_events...\n{0}'.format(app_events.dtypes))

Shape of app_events.csv: (32473067, 3)

Sample data of app_events.csv:
   event_id               app_id  is_active
0         2  5927333115845830913          1
1         2 -5720078949152207372          0
2         2 -1633887856876571208          0

Datatypes of app_events...
event_id     int64
app_id       int64
is_active     int8
dtype: object

app_labels.csv

In [28]:

app_labels = pd.read_csv('../Data/app_labels.csv')
print('\nShape of app_labels.csv: {0}\n'.format(app_labels.shape))
print('Sample data of app_labels.csv:\n{0}'.format(app_labels.head(3)))
print('\nDatatypes of app_labels...\n{0}\n'.format(app_labels.dtypes))

Shape of app_labels.csv: (459943, 2)

Sample data of app_labels.csv:
                app_id  label_id
0  7324884708820027918       251
1 -4494216993218550286       251
2  6058196446775239644       406

Datatypes of app_labels...
app_id      int64
label_id    int64
dtype: object

Let's check the range of label_id and convert it to a smaller datatype if the range is small.

In [31]:

print('label_id.min: {0}\nlabel_id.max: {1}'.format(app_labels.label_id.min(), app_labels.label_id.max()))

label_id.min: 2
label_id.max: 1021

int16 is good enough for this range. Let's change the type.

In [33]:

app_labels['label_id'] = app_labels['label_id'].astype(np.int16)
print('Datatype of label_id: {0}'.format(app_labels.label_id.dtype))

Datatype of label_id: int16

label_categories.csv

In [29]:

label_categories = pd.read_csv('../Data/label_categories.csv')
print('\nShape of label_categories.csv: {0}\n'.format(label_categories.shape))
print('Sample data of label_categories.csv:\n{0}'.format(label_categories.head(3)))
print('\nDatatypes of label_categories...\n{0}\n'.format(label_categories.dtypes))

Shape of label_categories.csv: (930, 2)

Sample data of label_categories.csv:
   label_id          category
0         1               NaN
1         2    game-game type
2         3  game-Game themes

Datatypes of label_categories...
label_id     int64
category    object
dtype: object

Let's convert label_id to int16 type as we did in the previous table. Looks like there are some missing values in 'category' column. We can impute them if and when needed. For now we will let them be.

In [34]:

label_categories['label_id'] = label_categories['label_id'].astype(np.int16)
print('Datatype of label_id: {0}'.format(label_categories.label_id.dtype))

Datatype of label_id: int16

Data Preprocessing Summary¶

There wasn't much preprocessing required for this dataset, since it the dataset is pretty simple and well formed. The few things that we did are...

events: converted timestep column from string type to datetime64 type
app_events: removed the constant column is_installed and converted is_active to from int64 to int8 type
app_labels: converted label_id from int64 to int16
label_categories: converted label_id from int64 to int16

Let's move on to the more interesting part - Exploratory Data Analysis

Exploratory Data Analysis¶

Let's try to glean from the dataset as much information/insight as we can about the different user groups before we pass it through the machine learning models. We can try to learn things like phone and brand preferences, app download/usage patterns etc.

Let's begin by understanding the distribution of train-set samples across genders and across different user groups.

Visualizing sample distribution across ages, genders and age-groups¶

Let's begin by answering this question: What is the proportion of male and female users in the training dataset?

In [73]:

plt.figure().set_size_inches(8, 1)
demo_train.gender.value_counts().plot.barh()
plt.title('Sample distribution across genders')
plt.xlabel('Sample Count')
plt.ylabel('Gender');

The answer is 64 : 36, i.e. the training dataset has 64% samples of male users and 36% samples of female users.

The next question would be: How is the sample distribution across demographic groups?

In [94]:

plt.figure().set_size_inches(8, 4)
demo_train.groupby('group').group.count().plot.bar()
plt.title('Sample distribution across demographic groups')
plt.xlabel('Group')
plt.ylabel('Sample Count');

The height of the bars reflect the skewed male:female sample ratio. Setting aside this difference, the distributions of male and female groups look similar. There are, however, some small differences, like F23- is bigger than F24-26 whereas M22- is smaller than M23-26 and F43+ is smaller than F29-32 whereas M39+ is bigger than M29-31.

Let us now explore the distribution across ages within each gender.

In [69]:

fig, axes = plt.subplots(1, 2)
fig.set_size_inches(12, 4)
demo_train[demo_train.gender == 'F'].age.hist(bins = 50, ax = axes[0])
axes[0].set(title = 'Sample Distribution, Gender = Female', xlabel = 'age', ylabel = 'sample count')
demo_train[demo_train.gender == 'M'].age.hist(bins = 50, ax = axes[1])
axes[1].set(title = 'Sample Distribution, Gender = Male', xlabel = 'age', ylabel = 'sample count');

The distribution of samples across ages in both the genders are similar. Note that the axes scales are different in the two plots.

Visualizing Phone Brands Distribution¶

Let's take a look at the distribution of phone brands. We can first look at the full distribution and then look at its distribution in train and test samples separately. This will tell us if the distribution of samples between train and test is even or skewed.

Before we look at the plots let's find out how many brands and models are included in the dataset.

In [75]:

phone_brand_english.apply(lambda x: x.nunique())

Out[75]:

device_id       186716
phone_brand         81
device_model      1559
dtype: int64

So, the dataset has 81 brands and these brands have 1559 models among them.

In [97]:

plt.figure().set_size_inches(12, 5)
phone_brand_english.groupby('phone_brand').phone_brand.count().plot.bar()
plt.title('Phone Brand Distribution')
plt.xlabel('Brands')
plt.ylabel('Device Count');

We can see the market leaders clearly - Xiaomi, Samsung, Huawei. And then there are the followers - OPPO, Vivo, Meizu, Coolpad and Lenovo. Most of the sample space is made up of these brands.

Now, let's check the distribution across train and test samples.

In [207]:

demo_phone_train = demo_train.merge(phone_brand_english, on = 'device_id', how = 'left')
demo_phone_test = demo_test.merge(phone_brand_english, on = 'device_id', how = 'left')

plt.figure().set_size_inches(11, 5)

demo_phone_train.groupby('phone_brand').phone_brand.count().plot(label = 'train')
demo_phone_test.groupby('phone_brand').phone_brand.count().plot(label = 'test')
plt.title('Phone Brand Distribution in Train Samples')
plt.xlabel('Brands')
plt.ylabel('Device Count');
plt.legend();

Ok, the distribution of phone brands between train and test samples is almost identical and mirrors the overall trend.

Now, let's take a look at the distribution of phone brands across genders and age-groups. We can do this with only train samples since age and gender info is provided only in train samples.

In [209]:

gender_wise_brand_distrib = demo_phone_train.groupby('gender').phone_brand.value_counts().unstack(level=0)

plt.figure().set_size_inches(11,5)
gender_wise_brand_distrib['M'].plot(label = 'Male')
gender_wise_brand_distrib['F'].plot(label = 'Female')
plt.xlabel('Phone Brands')
plt.ylabel('Device Count')
plt.title('Gender based Brand Distribution')
plt.legend();

Here again, the trends are similar but with small differences. The market share of Samsung is much closer to Xiaomi among female population than it is in male population. OPPO and Vivo have a better acceptance among females than in males.

In [204]:

group_wise_brand_distrib = demo_phone_train.groupby('group').phone_brand.value_counts().unstack(level=0)

groups = group_wise_brand_distrib.columns
brands = gender_wise_brand_distrib.M[gender_wise_brand_distrib.M > 500].index
group_wise_brand_distrib_top_brands = group_wise_brand_distrib.ix[brands]

group_wise_brand_distrib_top_brands = pd.DataFrame(StandardScaler().fit_transform(group_wise_brand_distrib_top_brands),
                                                   columns = group_wise_brand_distrib_top_brands.columns, index = brands)

num_groups = len(groups)
fig, axes = plt.subplots(3, 2)
fig.set_size_inches(11, 3*3)
for i in range(num_groups/2):
    ax = axes[i/2,i%2]
    group_wise_brand_distrib_top_brands[groups[i]].plot(ax=ax)
    group_wise_brand_distrib_top_brands[groups[i+6]].plot(ax=ax)
    ax.set_xlabel('Phone Brands')
    ax.set_ylabel('Normalized Device Count')
    ax.set_title('Top Brands Distribution - {0} & {1}'.format(grp_names[i], grp_names[i+6]))
    ax.legend()

fig.tight_layout()

While the trend lines generally overlap we do see some small differences. The model must identify such differences inorder to be able to do a good job of classification.

Visualizing 'events' data distribution¶

Let's begin by visualizing how events are distribued across time.

Events distribution with respect to Time

In [147]:

from datetime import datetime
print ("Min Time: %s\nMax Time: %s\n" % (events.timestamp.min(), events.timestamp.max()))
print ("Percentage of events from 1st May to 7th May: %.2f %%" % 
       (np.sum((events.timestamp > datetime(2016, 5, 1)) & (events.timestamp < datetime(2016, 5, 8)))/float(len(events))*100))

Min Time: 2016-04-30 23:52:24
Max Time: 2016-05-08 00:00:08

Percentage of events from 1st May to 7th May: 99.97 %

All events with a few exceptions were recorded between 1st May and 7th May. The exception events were recorded a few mins before 1st May and a few mins after 7th May as the Min and Max times suggest.

Let's take a look a look at how events are distributed across days and hours. We will extract day and hour information from timestamp and add them as features to the events dataframe.

In [143]:

events['day'] = events.timestamp.apply(lambda x: x.day)
day_distrib = events.day.value_counts().sort_index()

events['hour'] = events.timestamp.apply(lambda x: x.hour)
hourly_distrib = events.hour.value_counts().sort_index()

events.head(3)

Out[143]:

	event_id	device_id	timestamp	longitude	latitude	day
0	1	29182687948017175	2016-05-01 00:55:25	121.38	31.24	1
1	2	-6401643145415154744	2016-05-01 00:54:12	103.65	30.97	1
2	3	-4833982096941402721	2016-05-01 00:08:05	106.60	29.70	1

Now, we can plot day-wise and hourly distributions of events.

In [146]:

fig, axes = plt.subplots(1, 2)
fig.set_size_inches(12, 4)

#day_distrib.plot.bar(ax=axes[0])
xticklabels = ['5 / 1', '5 / 2', '5 / 3', '5 / 4', '5 / 5', '5 / 6', '5 / 7', '5 / 8', '4 / 30']
axes[0].bar(range(len(xticklabels)), day_distrib)
axes[0].set_xticklabels(xticklabels)
axes[0].set_xlabel('Day')
axes[0].set_ylabel('Events Count')
axes[0].set_title('Day-wise Distribution')

#hourly_distrib.plot.bar(ax=axes[1])
axes[1].bar(range(24), hourly_distrib)
axes[1].set_xlabel('Hour')
axes[1].set_ylabel('Events Count')
axes[1].set_title('Hourly Distribution');

Day-wise distribution looks pretty flat. This shows how mobile phones have become a part of our daily lives - it doesn't matter if it's a weekend or weekday, or if its a Monday or a Friday. We use them all the time.

We see an interesting pattern in hourly distributions. The x-axis stretchs from 12am to 12pm. We see a sharp dip in the number of events in the wee hours of the morning from 1am to 5am, while people are sleeping. And then it increases sharply from 5am to 8am as people wake up and start their day. The traffic peaks at 10am and then there's a small trough in late noon (3pm - 4pm) before it starts picking again as evening sets in. We see the second peak at 9pm before it starts plummeting as night sets in.

We now know that the only thing that can stop people from using their mobile phones is...sleep. :)

It will interesting to see if there is any difference in the usage pattern between genders.

In [187]:

gender_age_events = events.merge(demo_train, how='left', left_on='device_id', right_on='device_id')

day_gender_distrib = gender_age_events.groupby('gender')['day'].value_counts().unstack(level=0).fillna(value= 0)
days_idx = ['5 / 1', '5 / 2', '5 / 3', '5 / 4', '5 / 5', '5 / 6', '5 / 7', '5 / 8', '4 / 30']
day_gender_distrib = pd.DataFrame(StandardScaler().fit_transform(day_gender_distrib),
                                  index = days_idx, columns = day_gender_distrib.columns)

hour_gender_distrib = gender_age_events.groupby('gender')['hour'].value_counts().unstack(level=0).fillna(value= 0)
hour_gender_distrib = pd.DataFrame(StandardScaler().fit_transform(hour_gender_distrib), 
                                   columns = hour_gender_distrib.columns)

fig, axes = plt.subplots(1, 2)
fig.set_size_inches(10, 5)

ax = axes[0]
day_gender_distrib['M'].iloc[:-2].plot(ax=ax)
day_gender_distrib['F'].iloc[:-2].plot(ax=ax)
ax.set(title = 'Day-wise Event Distribution', xlabel = 'Day', ylabel = 'Normalized Count')
ax.legend()

ax = axes[1]
hour_gender_distrib['M'].plot(ax=ax)
hour_gender_distrib['F'].plot(ax=ax)
ax.set(title = 'Hourly Event Distribution - Male', xlabel = 'Hour', ylabel = 'Normalized Count')
ax.legend();

fig.tight_layout()

There is a clear difference in day-wise distribution of events between the sexes. While traffic from male users peaks in the early part of the week (Tuesday and Wednesday) and the traffic from female users peaks in the later part of the week (Thursday and Friday).

While the hourly events distribution of the sexes are similar there are some differences. Traffic from male users peaks during morning hours and traffic from female users peaks during evening hours. And, more male users use thier mobile phones at midnight than female users.

Since the data is only for one week we cannot draw any concrete conclusions.

Now, let's checkout how events are distributed with respect to locations.

Events distribution with respect to Location

This part of the code is influenced by one of the posts from competitions website. While I have made some minor tweaks, full credit goes to 'BeyondBeneath'. You can find the original post here

In [156]:

from mpl_toolkits.basemap import Basemap

# Set up plot
events_sample = events.sample(n=100000)
plt.figure(1, figsize=(12,6))

# Mercator of World
m1 = Basemap(projection='merc', llcrnrlat=-60, urcrnrlat=65, llcrnrlon=-180,
             urcrnrlon=180, lat_ts=0, resolution='c')

m1.fillcontinents(color='#191919',lake_color='#000000') # dark grey land, black lakes
m1.drawmapboundary(fill_color='#000000')                # black background
m1.drawcountries(linewidth=0.1, color="w")              # thin white line for country borders

# Plot the data
mxy = m1(events_sample["longitude"].tolist(), events_sample["latitude"].tolist())
m1.scatter(mxy[0], mxy[1], s=3, c="#1292db", lw=0, alpha=1, zorder=5)

plt.title("Global view of events")
plt.show()

Its clear that all data comes from China with some stray events in other countries. There are a bunch of events at (0, 0). These can be considered missing data.

Let's zoom-in on China.

In [157]:

# Sample it down to only the China region
lon_min, lon_max = 75, 135
lat_min, lat_max = 15, 55

idx_china = (events["longitude"]>lon_min) & (events["longitude"]<lon_max) &\
            (events["latitude"]>lat_min) & (events["latitude"]<lat_max)

events_china = events[idx_china].sample(n=100000)

# Mercator of China
plt.figure(2, figsize=(12,6))

m2 = Basemap(projection='merc', llcrnrlat=lat_min, urcrnrlat=lat_max, llcrnrlon=lon_min,
             urcrnrlon=lon_max, lat_ts=35, resolution='i')

m2.fillcontinents(color='#191919',lake_color='#000000') # dark grey land, black lakes
m2.drawmapboundary(fill_color='#000000')                # black background
m2.drawcountries(linewidth=0.1, color="w")              # thin white line for country borders

# Plot the data
mxy = m2(events_china["longitude"].tolist(), events_china["latitude"].tolist())
m2.scatter(mxy[0], mxy[1], s=5, c="#1292db", lw=0, alpha=0.05, zorder=5)

plt.title("China view of events")
plt.show()

This closely matches the distribution of population across China. There are couple of other interesting charts in the orignal post. For example, the one that shows the distribution of events from male and female users in Beijing. Male users seem to be more spread out compared to female users.

I can continue with the data analysis w.r.t apps and their categories. But in the interest of keeping the size of the post from getting too long I will not do that now. We can do that if required later.

Since this post is getting too long I will stop here and continue with the Modeling part in a sequel post:¶

TalkingData Mobile User Demographics - Part 2

Please continue reading.

TalkingData Mobile User Demographics - Part 1

Thu 22 September 2016 Kaggle Series Kaggle / EDA / XGBoost