Introduction
This is the fourth post (and quite possibly the last) of my Minimal Data Science blog series, the previous posts can be located here:
- lenguyenthedat.github.io/minimal-data-science-1-starcraft
- lenguyenthedat.github.io/minimal-data-science-2-avazu
- lenguyenthedat.github.io/minimal-data-science-3-mnist-neuralnet
In this post, we will be looking into a recent Data Science challenge hosted in Singapore by Dextra in August 2015: the MINDEF Data Analytics Challenge.
The task is to predict resignation probabilities within the military for 8000 personnel, given a training set of 15000 personnel with data such as demographic, work, and family-related information.
Note: The source codes as well as original datasets for this series will also be updated at this Github repository of mine.
Choosing my model
Normally, for such a data set, tree-based ensemble models work surprisingly well. After running a quick test with various models, Random Forrest and Gradient Boosting Machine just stood out completely.
An illustration of different algorithms and their accuracies. Source: scikit-learn.org
The implementation of such algorithm is also playing a significant role. I chose dmlc’s XGBoost since it is by far one of the best implementations of Gradient Boosting Machine, and it has been the winner for most of the challenges lately.
Feature engineering
The very first step of feature engineering is to figure out roughly how important the original features are:
Full set of original feature (with name censored) importances as provided by XGBoost
Below is the list of feature engineering tasks that were done for the data set:
- Combining
promotion
andgender
:
- Combining
age
andgender
:
- Splitting
rank_grouping
- this feature is actually a combination of level and type of employment:
- Capping
salary_increment
- for this particular data set, some might have more than 10x or even 100x increment in salary, which indicates either noises or out-liners:
- Fixing
min_child_age
- originally if there is no kid,min_child_age
was set as 0, which is a wrong value to be used. We can’t usenull
,mean()
, ormode()
for this value either, and should instead consider this age to be much higher than for those actually has kids.
- Removing noisy features.
- Dealing with
null
data. I simply usedmean()
for numerical features andmode()
for categorical features
NATIONAL SERVICEMEN
always get classified as resigned - this is actually more of a “hard-coded” fact than feature engineering.
Parameter tuning
Choosing the correct parameters is essential to building strong models. In regard to tuning XGBoost, there is not much to say except following the guides in XGBoost’s param_tuning.md notes.
As for my best single model, below was the parameters I used:
Cross-Validation for local testing
Most data science challenges will have a limit on total number of submissions you can make. K-fold cross validation (CV) helps solving that problem, by simply splitting the training data into k parts for k iteration of validation. K-fold cross validation also helps solving over-fitting problem, which seemed to be quite an important part of this competition.
An illustration of 3-fold cross validation
When your CV is strong enough, trust it. Only submit your model when it’s relatively high. A low local score but high public leader board score could be a sign of over-fitting, and should be looked into very carefully.
Lastly, scikit-learn does provide a cross_validation module that can be used quite straightforwardly:
Blending and stacking
Ensembling might not be used much in practical projects due to its complexity, but it is very important to win Kaggle-like competition. Recently and historically, all competition winners have blended hundreds of their models and solutions together to get a final best result.
Blending also helps in prevent over-fitting. For more information, you can learn more about this topic at MLWave’s Kaggle Ensembling Guide.
Do keep in mind that, in order for blending to work efficiently, or to get the highest score, we need to first have single models that work extremely well. For this projects, I simply choose to blend a list of differently-setup XGBoost models (either different hyper parameters, or different feature sets) with kaggle_avg.py.
Luck
Admittedly, luck always involved in data science challenges. There are ways to minimize it as discussed above. For this competition, I only managed to finish #5 in the public leader board, but ended up in #1 in private leader board with some very close runner-up scores (only ~0.2% better than the second position).
Lastly:
Below are some great advices from the pro himself, Owen Zhang, talking about this very topic, in more depth in over-fitting problem, Gradient Boosting Machine’s parameter tuning, feature engineering and more: