User Segmentation is an important requirement for Targeted Digital Marketing and Customized User Experience. While a lot of new ways of Customer Segmentation are being used lately, thanks to the explosion of online social media, the traditional way of segmenting users based on Demographics is still very relevant. Knowing just the gender and age/age-group of a person can tell a lot about their interests, inclinations and their propensity of buying certin kinds of products.
Not all the users are equally willing to share their gender and age/age-group. Knowing the value of this kind of data, companies try various ways of getting their hands on it. It is generally not a good idea to make it mandatory for users to provide this data at the time of registeration or signing-up. It carries the risk of loosing a potential user/customer. So, companies resort to non-invasive ways like predicting demographic data of users who have not provided this information by matching their usage patterns with those who have.
This was exactly the challenge of the competition titled, TalkingData Mobile User Demographics. TalkingData is a major chinese mobile app/game analytics company with its user count in millions. Given demographic and app download/usage data of some of their users the challenge was to develop a model that can best predict the gender and age-group of their other users. You may learn more about the competition from the Kaggle website here: TalkingData Mobile User Demographics
In this post I will share some interesting observations/insights from the data and describe my solution. I will follow the systematic approach of a five staged Data Science Process that I describe (with a case study) in the following two posts:
Let's start with Data Collection
Data Collection¶
Like most competitions in Kaggle, the data was provided in a set of CSV files. The following information has been extracted from the competition's data page.
"The Data is collected from TalkingData SDK integrated within mobile apps TalkingData serves under the service term between TalkingData and mobile app developers...
The data schema can be represented in the following chart:"
File descriptions¶
- gender_age_train.csv, gender_age_test.csv - the training and test set
- group: this is the target variable you are going to predict
- events.csv, app_events.csv - when a user uses TalkingData SDK, the event gets logged in this data. Each event has an event id, location (lat/long), and the event corresponds to a list of apps in app_events.
- timestamp: when the user is using an app with TalkingData SDK
- app_labels.csv - apps and their labels, the label_id's can be used to join with label_categories
- label_categories.csv - apps' labels and their categories in text
- phone_brand_device_model.csv - device ids, brand, and models
- phone_brand: note that the brands are in Chinese: 三星 samsung, 天语 Ktouch etc
Data Preprocessing¶
We will take a look at the data one table at a time and look for any preprocessing requirements, like datatype conversions, unit conversions, missing data imputation etc.
Let's import some required libraries.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.preprocessing import StandardScaler
import seaborn as sns
gender_age_train.csv and gender_age_test.csv
demo_train = pd.read_csv('../Data/gender_age_train.csv')
demo_test = pd.read_csv('../Data/gender_age_test.csv')
print('\nShape of gender_age_train.csv: {0}'.format(demo_train.shape))
print('Shape of gender_age_test.csv: {0}\n'.format(demo_test.shape))
print('Sample data of gender_age_train.csv:\n{0}'.format(demo_train.head(3)))
print('\nSample data of gender_age_test.csv:\n{0}'.format(demo_test.head(3)))
print('\nDatatypes of features...\n{0}'.format(demo_train.dtypes))
There are about 75K training samples with device_id, gender, age and group data. The group column is formed from gender and age. There are about 112K test samples with only device_id. The model that we develop has to predict groups for these test samples.
The ratio of train:test sample size in this dataset is 40:60. It is not so common to have more test samples than train samples as it is in this dataset.
gender and group are string types. We will be encoding them to their numerical equivalents when we create machine learning models. For now, we will keep them in their current form for visualising and understanding purposes.
phone_brand_device_model
phone_brand = pd.read_csv('../Data/phone_brand_device_model.csv')
print('\nShape of phone_brand_device_model.csv: {0}\n'.format(phone_brand.shape))
print('Sample data of phone_brand_device_model.csv:\n{0}'.format(phone_brand.head(3)))
print('\nDatatypes of features...\n{0}'.format(phone_brand.dtypes))
This table has brand and model data. Names of many brands and devices are in chinese. This should not be a conern as far as model training is concerned since these will be encoded to their numerical equivalents. However, it will be helpful to know their English names so we can get a peek into the Chinese Mobile market.
One of the participants had shared the English version of phone_brand_device_model.csv. We will use for EDA.
phone_brand_english = pd.read_csv('../Data/english_phone_brand_device_model.csv')
print('\nShape of phone_brand_device_model.csv: {0}\n'.format(phone_brand_english.shape))
print('Sample data of phone_brand_device_model.csv:\n{0}'.format(phone_brand_english.head(3)))
events.csv
events = pd.read_csv('../Data/events.csv')
print('\nShape of events.csv: {0}\n'.format(events.shape))
print('Sample data of events.csv:\n{0}'.format(events.head(3)))
print('\nDatatypes of features...\n{0}'.format(events.dtypes))
This table has time and position information. device_id is provided for merging it with gender_age_train/gender_age_test data.
timestamp is of string type. Let's convert it to datetime64 type.
events['timestamp'] = pd.to_datetime(events.timestamp)
print('Datatype of timestamp: {0}'.format(events.timestamp.dtype))
app_events.csv
app_events = pd.read_csv('../Data/app_events.csv')
print('\nShape of app_events.csv: {0}\n'.format(app_events.shape))
print('Sample data of app_events.csv:\n{0}'.format(app_events.head()))
print('\nDatatypes of app_events...\n{0}'.format(app_events.dtypes))
It looks like is_installed and is_active are boolean even though they are of type inet64. Let's check this and if they are indeed boolean we will convert their types to int8.
print('is_installed unique values and their counts...\n{0}\n'.format(app_events.is_installed.value_counts()))
print('is_active unique values and their counts...\n{0}'.format(app_events.is_active.value_counts()))
Hmm...is_installed has a single value in it (=1). Let's remove it since it doesn't a constant-column doesn't add any value. And, is_active is indeed boolean. We will convert it to int8 type.
app_events.drop(['is_installed'], axis = 1, inplace = True)
app_events['is_active'] = app_events['is_active'].astype(np.int8)
print('\nShape of app_events.csv: {0}\n'.format(app_events.shape))
print('Sample data of app_events.csv:\n{0}'.format(app_events.head(3)))
print('\nDatatypes of app_events...\n{0}'.format(app_events.dtypes))
app_labels.csv
app_labels = pd.read_csv('../Data/app_labels.csv')
print('\nShape of app_labels.csv: {0}\n'.format(app_labels.shape))
print('Sample data of app_labels.csv:\n{0}'.format(app_labels.head(3)))
print('\nDatatypes of app_labels...\n{0}\n'.format(app_labels.dtypes))
Let's check the range of label_id and convert it to a smaller datatype if the range is small.
print('label_id.min: {0}\nlabel_id.max: {1}'.format(app_labels.label_id.min(), app_labels.label_id.max()))
int16 is good enough for this range. Let's change the type.
app_labels['label_id'] = app_labels['label_id'].astype(np.int16)
print('Datatype of label_id: {0}'.format(app_labels.label_id.dtype))
label_categories.csv
label_categories = pd.read_csv('../Data/label_categories.csv')
print('\nShape of label_categories.csv: {0}\n'.format(label_categories.shape))
print('Sample data of label_categories.csv:\n{0}'.format(label_categories.head(3)))
print('\nDatatypes of label_categories...\n{0}\n'.format(label_categories.dtypes))
Let's convert label_id to int16 type as we did in the previous table. Looks like there are some missing values in 'category' column. We can impute them if and when needed. For now we will let them be.
label_categories['label_id'] = label_categories['label_id'].astype(np.int16)
print('Datatype of label_id: {0}'.format(label_categories.label_id.dtype))
Data Preprocessing Summary¶
There wasn't much preprocessing required for this dataset, since it the dataset is pretty simple and well formed. The few things that we did are...
- events: converted timestep column from string type to datetime64 type
- app_events: removed the constant column is_installed and converted is_active to from int64 to int8 type
- app_labels: converted label_id from int64 to int16
- label_categories: converted label_id from int64 to int16
Let's move on to the more interesting part - Exploratory Data Analysis
Exploratory Data Analysis¶
Let's try to glean from the dataset as much information/insight as we can about the different user groups before we pass it through the machine learning models. We can try to learn things like phone and brand preferences, app download/usage patterns etc.
Let's begin by understanding the distribution of train-set samples across genders and across different user groups.
Visualizing sample distribution across ages, genders and age-groups¶
Let's begin by answering this question: What is the proportion of male and female users in the training dataset?
plt.figure().set_size_inches(8, 1)
demo_train.gender.value_counts().plot.barh()
plt.title('Sample distribution across genders')
plt.xlabel('Sample Count')
plt.ylabel('Gender');
The answer is 64 : 36, i.e. the training dataset has 64% samples of male users and 36% samples of female users.
The next question would be: How is the sample distribution across demographic groups?
plt.figure().set_size_inches(8, 4)
demo_train.groupby('group').group.count().plot.bar()
plt.title('Sample distribution across demographic groups')
plt.xlabel('Group')
plt.ylabel('Sample Count');
The height of the bars reflect the skewed male:female sample ratio. Setting aside this difference, the distributions of male and female groups look similar. There are, however, some small differences, like F23- is bigger than F24-26 whereas M22- is smaller than M23-26 and F43+ is smaller than F29-32 whereas M39+ is bigger than M29-31.
Let us now explore the distribution across ages within each gender.
fig, axes = plt.subplots(1, 2)
fig.set_size_inches(12, 4)
demo_train[demo_train.gender == 'F'].age.hist(bins = 50, ax = axes[0])
axes[0].set(title = 'Sample Distribution, Gender = Female', xlabel = 'age', ylabel = 'sample count')
demo_train[demo_train.gender == 'M'].age.hist(bins = 50, ax = axes[1])
axes[1].set(title = 'Sample Distribution, Gender = Male', xlabel = 'age', ylabel = 'sample count');
The distribution of samples across ages in both the genders are similar. Note that the axes scales are different in the two plots.
Visualizing Phone Brands Distribution¶
Let's take a look at the distribution of phone brands. We can first look at the full distribution and then look at its distribution in train and test samples separately. This will tell us if the distribution of samples between train and test is even or skewed.
Before we look at the plots let's find out how many brands and models are included in the dataset.
phone_brand_english.apply(lambda x: x.nunique())
So, the dataset has 81 brands and these brands have 1559 models among them.
plt.figure().set_size_inches(12, 5)
phone_brand_english.groupby('phone_brand').phone_brand.count().plot.bar()
plt.title('Phone Brand Distribution')
plt.xlabel('Brands')
plt.ylabel('Device Count');
We can see the market leaders clearly - Xiaomi, Samsung, Huawei. And then there are the followers - OPPO, Vivo, Meizu, Coolpad and Lenovo. Most of the sample space is made up of these brands.
Now, let's check the distribution across train and test samples.
demo_phone_train = demo_train.merge(phone_brand_english, on = 'device_id', how = 'left')
demo_phone_test = demo_test.merge(phone_brand_english, on = 'device_id', how = 'left')
plt.figure().set_size_inches(11, 5)
demo_phone_train.groupby('phone_brand').phone_brand.count().plot(label = 'train')
demo_phone_test.groupby('phone_brand').phone_brand.count().plot(label = 'test')
plt.title('Phone Brand Distribution in Train Samples')
plt.xlabel('Brands')
plt.ylabel('Device Count');
plt.legend();
Ok, the distribution of phone brands between train and test samples is almost identical and mirrors the overall trend.
Now, let's take a look at the distribution of phone brands across genders and age-groups. We can do this with only train samples since age and gender info is provided only in train samples.
gender_wise_brand_distrib = demo_phone_train.groupby('gender').phone_brand.value_counts().unstack(level=0)
plt.figure().set_size_inches(11,5)
gender_wise_brand_distrib['M'].plot(label = 'Male')
gender_wise_brand_distrib['F'].plot(label = 'Female')
plt.xlabel('Phone Brands')
plt.ylabel('Device Count')
plt.title('Gender based Brand Distribution')
plt.legend();
Here again, the trends are similar but with small differences. The market share of Samsung is much closer to Xiaomi among female population than it is in male population. OPPO and Vivo have a better acceptance among females than in males.
group_wise_brand_distrib = demo_phone_train.groupby('group').phone_brand.value_counts().unstack(level=0)
groups = group_wise_brand_distrib.columns
brands = gender_wise_brand_distrib.M[gender_wise_brand_distrib.M > 500].index
group_wise_brand_distrib_top_brands = group_wise_brand_distrib.ix[brands]
group_wise_brand_distrib_top_brands = pd.DataFrame(StandardScaler().fit_transform(group_wise_brand_distrib_top_brands),
columns = group_wise_brand_distrib_top_brands.columns, index = brands)
num_groups = len(groups)
fig, axes = plt.subplots(3, 2)
fig.set_size_inches(11, 3*3)
for i in range(num_groups/2):
ax = axes[i/2,i%2]
group_wise_brand_distrib_top_brands[groups[i]].plot(ax=ax)
group_wise_brand_distrib_top_brands[groups[i+6]].plot(ax=ax)
ax.set_xlabel('Phone Brands')
ax.set_ylabel('Normalized Device Count')
ax.set_title('Top Brands Distribution - {0} & {1}'.format(grp_names[i], grp_names[i+6]))
ax.legend()
fig.tight_layout()
While the trend lines generally overlap we do see some small differences. The model must identify such differences inorder to be able to do a good job of classification.
Visualizing 'events' data distribution¶
Let's begin by visualizing how events are distribued across time.
Events distribution with respect to Time
from datetime import datetime
print ("Min Time: %s\nMax Time: %s\n" % (events.timestamp.min(), events.timestamp.max()))
print ("Percentage of events from 1st May to 7th May: %.2f %%" %
(np.sum((events.timestamp > datetime(2016, 5, 1)) & (events.timestamp < datetime(2016, 5, 8)))/float(len(events))*100))
All events with a few exceptions were recorded between 1st May and 7th May. The exception events were recorded a few mins before 1st May and a few mins after 7th May as the Min and Max times suggest.
Let's take a look a look at how events are distributed across days and hours. We will extract day and hour information from timestamp and add them as features to the events dataframe.
events['day'] = events.timestamp.apply(lambda x: x.day)
day_distrib = events.day.value_counts().sort_index()
events['hour'] = events.timestamp.apply(lambda x: x.hour)
hourly_distrib = events.hour.value_counts().sort_index()
events.head(3)
Now, we can plot day-wise and hourly distributions of events.
fig, axes = plt.subplots(1, 2)
fig.set_size_inches(12, 4)
#day_distrib.plot.bar(ax=axes[0])
xticklabels = ['5 / 1', '5 / 2', '5 / 3', '5 / 4', '5 / 5', '5 / 6', '5 / 7', '5 / 8', '4 / 30']
axes[0].bar(range(len(xticklabels)), day_distrib)
axes[0].set_xticklabels(xticklabels)
axes[0].set_xlabel('Day')
axes[0].set_ylabel('Events Count')
axes[0].set_title('Day-wise Distribution')
#hourly_distrib.plot.bar(ax=axes[1])
axes[1].bar(range(24), hourly_distrib)
axes[1].set_xlabel('Hour')
axes[1].set_ylabel('Events Count')
axes[1].set_title('Hourly Distribution');
Day-wise distribution looks pretty flat. This shows how mobile phones have become a part of our daily lives - it doesn't matter if it's a weekend or weekday, or if its a Monday or a Friday. We use them all the time.
We see an interesting pattern in hourly distributions. The x-axis stretchs from 12am to 12pm. We see a sharp dip in the number of events in the wee hours of the morning from 1am to 5am, while people are sleeping. And then it increases sharply from 5am to 8am as people wake up and start their day. The traffic peaks at 10am and then there's a small trough in late noon (3pm - 4pm) before it starts picking again as evening sets in. We see the second peak at 9pm before it starts plummeting as night sets in.
We now know that the only thing that can stop people from using their mobile phones is...sleep. :)
It will interesting to see if there is any difference in the usage pattern between genders.
gender_age_events = events.merge(demo_train, how='left', left_on='device_id', right_on='device_id')
day_gender_distrib = gender_age_events.groupby('gender')['day'].value_counts().unstack(level=0).fillna(value= 0)
days_idx = ['5 / 1', '5 / 2', '5 / 3', '5 / 4', '5 / 5', '5 / 6', '5 / 7', '5 / 8', '4 / 30']
day_gender_distrib = pd.DataFrame(StandardScaler().fit_transform(day_gender_distrib),
index = days_idx, columns = day_gender_distrib.columns)
hour_gender_distrib = gender_age_events.groupby('gender')['hour'].value_counts().unstack(level=0).fillna(value= 0)
hour_gender_distrib = pd.DataFrame(StandardScaler().fit_transform(hour_gender_distrib),
columns = hour_gender_distrib.columns)
fig, axes = plt.subplots(1, 2)
fig.set_size_inches(10, 5)
ax = axes[0]
day_gender_distrib['M'].iloc[:-2].plot(ax=ax)
day_gender_distrib['F'].iloc[:-2].plot(ax=ax)
ax.set(title = 'Day-wise Event Distribution', xlabel = 'Day', ylabel = 'Normalized Count')
ax.legend()
ax = axes[1]
hour_gender_distrib['M'].plot(ax=ax)
hour_gender_distrib['F'].plot(ax=ax)
ax.set(title = 'Hourly Event Distribution - Male', xlabel = 'Hour', ylabel = 'Normalized Count')
ax.legend();
fig.tight_layout()
There is a clear difference in day-wise distribution of events between the sexes. While traffic from male users peaks in the early part of the week (Tuesday and Wednesday) and the traffic from female users peaks in the later part of the week (Thursday and Friday).
While the hourly events distribution of the sexes are similar there are some differences. Traffic from male users peaks during morning hours and traffic from female users peaks during evening hours. And, more male users use thier mobile phones at midnight than female users.
Since the data is only for one week we cannot draw any concrete conclusions.
Now, let's checkout how events are distributed with respect to locations.
Events distribution with respect to Location
This part of the code is influenced by one of the posts from competitions website. While I have made some minor tweaks, full credit goes to 'BeyondBeneath'. You can find the original post here
from mpl_toolkits.basemap import Basemap
# Set up plot
events_sample = events.sample(n=100000)
plt.figure(1, figsize=(12,6))
# Mercator of World
m1 = Basemap(projection='merc', llcrnrlat=-60, urcrnrlat=65, llcrnrlon=-180,
urcrnrlon=180, lat_ts=0, resolution='c')
m1.fillcontinents(color='#191919',lake_color='#000000') # dark grey land, black lakes
m1.drawmapboundary(fill_color='#000000') # black background
m1.drawcountries(linewidth=0.1, color="w") # thin white line for country borders
# Plot the data
mxy = m1(events_sample["longitude"].tolist(), events_sample["latitude"].tolist())
m1.scatter(mxy[0], mxy[1], s=3, c="#1292db", lw=0, alpha=1, zorder=5)
plt.title("Global view of events")
plt.show()
Its clear that all data comes from China with some stray events in other countries. There are a bunch of events at (0, 0). These can be considered missing data.
Let's zoom-in on China.
# Sample it down to only the China region
lon_min, lon_max = 75, 135
lat_min, lat_max = 15, 55
idx_china = (events["longitude"]>lon_min) & (events["longitude"]<lon_max) &\
(events["latitude"]>lat_min) & (events["latitude"]<lat_max)
events_china = events[idx_china].sample(n=100000)
# Mercator of China
plt.figure(2, figsize=(12,6))
m2 = Basemap(projection='merc', llcrnrlat=lat_min, urcrnrlat=lat_max, llcrnrlon=lon_min,
urcrnrlon=lon_max, lat_ts=35, resolution='i')
m2.fillcontinents(color='#191919',lake_color='#000000') # dark grey land, black lakes
m2.drawmapboundary(fill_color='#000000') # black background
m2.drawcountries(linewidth=0.1, color="w") # thin white line for country borders
# Plot the data
mxy = m2(events_china["longitude"].tolist(), events_china["latitude"].tolist())
m2.scatter(mxy[0], mxy[1], s=5, c="#1292db", lw=0, alpha=0.05, zorder=5)
plt.title("China view of events")
plt.show()
This closely matches the distribution of population across China. There are couple of other interesting charts in the orignal post. For example, the one that shows the distribution of events from male and female users in Beijing. Male users seem to be more spread out compared to female users.
I can continue with the data analysis w.r.t apps and their categories. But in the interest of keeping the size of the post from getting too long I will not do that now. We can do that if required later.
Since this post is getting too long I will stop here and continue with the Modeling part in a sequel post:¶
TalkingData Mobile User Demographics - Part 2
Please continue reading.
Comments
comments powered by Disqus