This is the first in a data science blog series that I’m writing. My goal for this series is not only sharing, tutorializing, but also, making personal notes while learning and working as a Data Scientist. If you are reading this blog, please feel free to give me any feedback or question you might have.
Note: The source codes as well as original datasets for this series will also be updated at this Github repository of mine.
Preparation
For the purpose of this project, I’m going to use this SkillCraft data set. (You can also download as well as have a quick look of it from this ShareCSV url - Many thanks to Ken Tran and Huy Nguyen for such a neat tool!)
Input: StarCraft 2 dataset (CSV) with 20 different attributes.
Output: A classification / prediction model to determine League Index (more information and context can be found here
We can see that the correctness reduce greatly at League #1 and #7. It’s simply because they are the 2 classes with lowest number of members.
With a little over 3000 entries in our given dataset, we simply don’t have enough data to “learn”.
With the above (relatively confident) result, we can already see the benefit of using data science in a simple classification problem.
There are, of course, a few ways to improve our result:
Larger dataset. 3000 entries is simply not enough for this model.
Better dataset. We would love to get a dataset with an even distribution of League members.
For Random Forest, simply improving our features set in quality and quantity, or increasing the number of trees would give us a better result at a certain performance cost.
Consider different classifiers (described below).
##Take it to another level##
Now that we’ve used a proper classification technique (Random Forest), but that’s not everything about data science. Choosing the right technique for the right task is not only an interesting problem, but also mandatory.
In this part, we are going to compare between various classifiers and pick the best one for the task:
Mass-define classifiers:
Train them:
Evaluation:
As we can see, Linear Discriminant Analysis (LDA) clearly won with a relatively low training time and best RMSD!