728x90
In [1]:
import os, time, gc
import pandas as pd, numpy as np
from tqdm import tqdm
In [2]:
os.listdir('input')
Out[2]:
In [3]:
%%time
tr = pd.read_csv("input/train_V2.csv")
te = pd.read_csv("input/test_V2.csv")
In [4]:
tr.head()
Out[4]:
In [5]:
tr.columns
Out[5]:
In [6]:
tr.describe()
Out[6]:
In [7]:
tr.info()
In [8]:
tr.dtypes.value_counts()
Out[8]:
In [9]:
tr.select_dtypes(include=['float']).head()
Out[9]:
In [10]:
tr.select_dtypes(include=['object']).head()
Out[10]:
Pure Feature Model¶
For Machine Learning promblems, I run the model, like lgb, xgb or catboost with the pure features, in general.
Before running the machine, let's do some EDA.
Plotting¶
Using describe() method, we can see the basic statistics.
But that's not intuitive I assume.
Priority (most correlated with the target value, i.e. the winPlacePerc.)¶
- damageDealt
- killPlace
- killPoints
- kills
- killStreaks
- matchDuration: Duration of match in seconds.(is is match based or each users?)
- matchType: Different game type can have different target distribution. -> I can make multiple models for each matchType. I assume this is a reasonable way for predicting more accurate results.
- maxPlace: if solo type, the number of users in the match(possible max 100). if duo, the possible max(50). i.e. the total group but as description says, this is not match with the numGroups
- rideDistance
- roadKills
- rideDistance
- roadKills
- swimDistance
- teamKills
- vehicleDestroys
- walkDistance
- weaponsAcquired: the number of weapons picked up
In [11]:
import matplotlib.pyplot as plt
import seaborn as sns
Plot few features just for seeing linearity¶
In [12]:
%%time
plt.figure(figsize=(5,5))
sns.jointplot(x="winPlacePerc", y="damageDealt", data=tr, ratio=3, color="r")
plt.title("damageDealt linearity")
plt.show()
In [13]:
%%time
plt.figure(figsize=(5,5))
sns.jointplot(x="winPlacePerc", y="kills", data=tr, ratio=3, color="r")
plt.title("kills linearity")
plt.show()
As expected, right skewed distribution
Plot all the numeric features to see the corr with the target¶
In [14]:
import seaborn as sns
i = 0
features = tr.select_dtypes(exclude=['object']).columns[:-1]
# sns.set_style('whitegrid')
print("num of numeric features: {}".format(len(features)))
In [15]:
%%time
plt.figure()
fig, ax = plt.subplots(6,4,figsize=(20,16)) #(nrow, ncol)
for feat in features:
i += 1
plt.subplot(6,4,i)
plt.scatter(tr['winPlacePerc'], tr[feat], marker='+')
#plt.xlabel(feat, fontsize=9)
#plt.xticks([1,8,15,22,29])
plt.title(feat, fontsize=9)
plt.tight_layout()
plt.show();
From the plots, we can notice below facts.¶
1. Checking the outliers¶
headshotkills, damagedealt, kills, killstreaks, longestkill, rankpoints, revives, roadkills, swimdistance, teamkills, walkdistance, weaponsacquired
2. Some linearities¶
> boost, damagedealt, DBNOs, headshotkills, heals, killplace, kills, killstreak(?), longestkill, rideDistance, roadkill, swimdistance, vehicledestroys, walkdistance¶
Plot all the numeric features by matchType¶
In [16]:
tr['matchType'].value_counts()
Out[16]:
In [17]:
print("number of matchType: {}".format(tr['matchType'].nunique()))
print("number of features: {}".format(len(features)))
In [18]:
mt_ls = tr['matchType'].unique()
In [19]:
%%time
for mt in mt_ls:
print("="*30+" {} ".format(mt).upper()+"="*30+"\n")
plt.figure()
fig, ax = plt.subplots(6,4,figsize=(20,8)) #(nrow, ncol)
for i, feat in enumerate(features):
# i += 1
plt.subplot(6,4,i+1)
plt.scatter(tr[tr['matchType']==mt]['winPlacePerc'], tr[tr['matchType']==mt][feat], marker='+', c='black')
#plt.xlabel(feat, fontsize=9)
#plt.xticks([1,8,15,22,29])
plt.title(feat, fontsize=9)
plt.tight_layout()
plt.show();
In [20]:
mt_ls.sort()
mt_ls
Out[20]:
Compare matchType¶
In [21]:
%%time
for feat in features:
print("="*60+" {} ".format(feat).upper()+"="*60+"\n")
plt.figure()
fig, ax = plt.subplots(4,4,figsize=(20,8)) #(nrow, ncol)
for i, mt in enumerate(mt_ls):
# i += 1
plt.subplot(4,4,i+1)
plt.scatter(tr[tr['matchType']==mt]['winPlacePerc'], tr[tr['matchType']==mt][feat], marker='+', c='black')
#plt.xlabel(feat, fontsize=9)
#plt.xticks([1,8,15,22,29])
plt.title(mt.upper(), fontsize=9)
plt.tight_layout()
plt.show();
As seen above plots, distributions are differ by each matchType. And linearity with each features are also differ by matchType. We can apply this charateristics for making effective feature.
Or we could build models by matchType. --> data can be too small for some matchType model
Outliers¶
Tabular datasets could have some outliers.
Considering this dataset that comes from online battle royal game, there are some data from "bad users".
As I experienced, there are cheaters and abusers. The formers are playing this game for win or just for fun by using unauthorized programming tools. The latters are playing just for fun by trolling teammates no matter how the game goes.
- As I culled the columns that might have outliers by seeing the plots, I should delve into the columns below.
:headshotkills, damagedealt, kills, killstreaks, longestkill, rankpoints, revives, roadkills, swimdistance, teamkills, walkdistance, weaponsacquired
- I suppose only a teamkills feature has a relation with the abusers.
Handling Outlier¶
To deal with the outliers, there are some ways.
1. One thing is just eliminate them before train the model. This is easiest way of handling outlier. However, after removing them from the dataset, if the amount of data reduced to too small, this is promblematic. And also, test dataset can also have some amount of outliers in it. Therefore, we should carefully consider this situation.
2. Not removing them from the dataset. How we can handle those nuisances. Another way I have tried is to predict outliers. You could think this as an detecting cheaters and abusers.
So, in this case, we have to build binary classification model and the target is outlier. How to know it's outlier or not? This field is up to human.
In this case, we have to build model in two categories; regression and classification models. For regression model, after excluding the outlier, we train and predict. For classification model, we designate outliers as binary, then we train with all the data. With threshold of some where between 90~100% of probability, if the prob pass the threshold then we designate those as outlier.
3. Just train all. If dataset is too small. Or after eliminating outlier, model's performance doesn't reach the baseline performance.
728x90
'DL' 카테고리의 다른 글
[Git repo] from "git clone" to "merge" (0) | 2021.01.31 |
---|---|
Pytorch Dataset - cv2.imread 메모리 사용 (0) | 2021.01.31 |
[PUBG] ML_baseline(lightgbm) (0) | 2020.06.02 |
[PUBG] Detecting Outliers (0) | 2020.06.02 |
[tabnet] beating tablet data with deep learning (0) | 2020.05.28 |
댓글