본문 바로가기
DL

[PUBG] Detecting Outliers

by YGSEO 2020. 6. 2.
728x90
outlier

Detecting Outliers


1. Head Shot
For many users, high head shot rate is almost impossible. Even professional online fps gamers hard to exceed 30%.
headshotkills/kills = headshot_rate
2. Damagedealt
3. kills
4. killstreaks
5. longestkill
As fas as I know, 1km kill is very hard to achieve.
6. rankpoints(elo-like ranking)
7. revives
8. roadkills
9. swimdistance
10. teamkills
For detecting abusers.
11. walkdistance
12. weaponsacquired

In [1]:
import os, time, gc
import pandas as pd, numpy as np
from tqdm import tqdm
import matplotlib.pyplot as plt
import seaborn as sns
In [2]:
os.listdir('input')
Out[2]:
['sample_submission_V2.csv', 'test_V2.csv', 'train_V2.csv']
In [3]:
%%time
tr = pd.read_csv("input/train_V2.csv")
te = pd.read_csv("input/test_V2.csv")
Wall time: 12.7 s
In [4]:
def missing_values_table(df):# Function to calculate missing values by column# Funct 
    mis_val = df.isnull().sum() # Total missing values
    mis_val_pct = 100 * df.isnull().sum() / len(df)# Percentage of missing values
    mis_val_df = pd.concat([mis_val, mis_val_pct], axis=1)# Make a table with the results
    mis_val_df_cols = mis_val_df.rename(columns = {0 : 'Missing Values', 1 : '% of Total Values'})# Rename the columns
    mis_val_df_cols = mis_val_df_cols[mis_val_df_cols.iloc[:,1] != 0].sort_values('% of Total Values', ascending=False).round(1)# Sort the table by percentage of missing descending
    print ("Dataframe has " + str(df.shape[1]) + " columns.\n" 
           "There are " + str(mis_val_df_cols.shape[0]) + " cols having missing values.")# Print some summary information
    return mis_val_df_cols # Return the dataframe with missing information
In [5]:
missing_values_table(tr)
Dataframe has 29 columns.
There are 1 cols having missing values.
Out[5]:
Missing Values % of Total Values
winPlacePerc 1 0.0
In [6]:
missing_values_table(te)
Dataframe has 28 columns.
There are 0 cols having missing values.
Out[6]:
Missing Values % of Total Values

1. Headshot

In [7]:
%%time
plt.figure(figsize=(5,5))
sns.jointplot(x="winPlacePerc", y="headshotKills", data=tr, ratio=3, color="darkolivegreen")
plt.title("headshotKills and Target")
plt.show()
<Figure size 360x360 with 0 Axes>
Wall time: 26.4 s
In [8]:
mt_ls = tr['matchType'].unique()
mt_ls.sort()
mt_ls
Out[8]:
array(['crashfpp', 'crashtpp', 'duo', 'duo-fpp', 'flarefpp', 'flaretpp',
       'normal-duo', 'normal-duo-fpp', 'normal-solo', 'normal-solo-fpp',
       'normal-squad', 'normal-squad-fpp', 'solo', 'solo-fpp', 'squad',
       'squad-fpp'], dtype=object)
In [9]:
%%time

print("="*60+" {} ".format('headshotKills').upper()+"="*60+"\n")
plt.figure()
fig, ax = plt.subplots(4,4,figsize=(20,8)) #(nrow, ncol)

for i, mt in enumerate(mt_ls):
#         i += 1
    plt.subplot(4,4,i+1)
    plt.scatter(tr[tr['matchType']==mt]['winPlacePerc'], tr[tr['matchType']==mt]['headshotKills'], marker='+', c='red')
    #plt.xlabel(feat, fontsize=9)
    #plt.xticks([1,8,15,22,29]) 
    plt.title(mt.upper(), fontsize=9)
    plt.tight_layout()

plt.show();
============================================================ HEADSHOTKILLS ============================================================

<Figure size 432x288 with 0 Axes>
Wall time: 24.6 s

Head Shot Rate

In [10]:
tr['HSR'] = tr['headshotKills']/tr['kills']
In [11]:
tr['HSR'].describe()
Out[11]:
count    1.917244e+06
mean     2.391822e-01
std      3.532459e-01
min      0.000000e+00
25%      0.000000e+00
50%      0.000000e+00
75%      5.000000e-01
max      1.000000e+00
Name: HSR, dtype: float64
In [12]:
%%time
plt.figure(figsize=(5,5))
sns.jointplot(x="winPlacePerc", y="HSR", data=tr, ratio=3, color="r")
plt.title("HSR and Target")
plt.show()
<Figure size 360x360 with 0 Axes>
Wall time: 11.5 s
In [13]:
%%time

print("="*60+" {} ".format('HSR').upper()+"="*60+"\n")
plt.figure()
fig, ax = plt.subplots(4,4,figsize=(20,8)) #(nrow, ncol)

for i, mt in enumerate(mt_ls):
#         i += 1
    plt.subplot(4,4,i+1)
    plt.scatter(tr[tr['matchType']==mt]['winPlacePerc'], tr[tr['matchType']==mt]['HSR'], marker='+', c='red')
    #plt.xlabel(feat, fontsize=9)
    #plt.xticks([1,8,15,22,29]) 
    plt.title(mt.upper(), fontsize=9)
    plt.tight_layout()

plt.show();
============================================================ HSR ============================================================

<Figure size 432x288 with 0 Axes>
Wall time: 22.5 s
As you can see, many players' head shot rates are over 0.5. This is insane.
Let's clean the data and see how the plots are changes.

In [14]:
tr['HSR_clean'] = np.where(tr['HSR']>=0.5, 0, tr['HSR'])
In [15]:
tr['HSR_clean'].describe()
Out[15]:
count    1.917244e+06
mean     3.679607e-02
std      9.939526e-02
min      0.000000e+00
25%      0.000000e+00
50%      0.000000e+00
75%      0.000000e+00
max      4.864865e-01
Name: HSR_clean, dtype: float64
Head rates that near 0.5 is also quite high.
I have to see what happens if I lower the threshold until 0.3.
In [16]:
%%time
plt.figure(figsize=(5,5))
sns.jointplot(x="winPlacePerc", y="HSR_clean", data=tr, ratio=3, color="r")
plt.ylim((0,1))
plt.title("HSR and Target")
plt.show()
<Figure size 360x360 with 0 Axes>
Wall time: 10.3 s
In [17]:
%%time
print("="*60+" {} ".format('HSR_clean').upper()+"="*60+"\n")
plt.figure()
fig, ax = plt.subplots(4,4,figsize=(20,8)) #(nrow, ncol)

for i, mt in enumerate(mt_ls):
#         i += 1
    plt.subplot(4,4,i+1)
    plt.scatter(tr[tr['matchType']==mt]['winPlacePerc'], tr[tr['matchType']==mt]['HSR_clean'], marker='+', c='red')
    #plt.xlabel(feat, fontsize=9)
    #plt.xticks([1,8,15,22,29]) 
    plt.title(mt.upper(), fontsize=9)
    plt.tight_layout()

plt.show();
============================================================ HSR_CLEAN ============================================================

<Figure size 432x288 with 0 Axes>
Wall time: 22.4 s
The upper lines diappeared. It looks quite normal.
I can do more cleansing process.
But I would stop here.
How many outliers are in headshotrates
In [18]:
tr['HSR_outnum'] = np.where(tr['HSR']>=0.5, 1, 0)
print("# of outliers in HSR : {} & {:0.4f}".format(tr['HSR_outnum'].sum(),tr['HSR_outnum'].sum()/tr.shape[0]))
# of outliers in HSR : 503601 & 0.1132
This doesn't look big. But we have to check the loss by matchType
In [19]:
loss_pc = []
loss = []
for t in tr['matchType'].unique():
#     print("{} has loss of {:0.4f}".format(t.upper(), tr[(tr['matchType']==t) & (tr['HSR_outnum']==1)].shape[0]/tr.shape[0]))
    loss_pc.append(tr[(tr['matchType']==t) & (tr['HSR_outnum']==1)].shape[0]/tr.shape[0])
    loss.append(tr[(tr['matchType']==t) & (tr['HSR_outnum']==1)].shape[0])

df = pd.DataFrame(data=loss_pc)
df.index = tr['matchType'].unique().tolist()
df.columns = ['loss_perc']
df['loss'] = loss
df.sort_values(by='loss_perc', inplace=True, ascending=False)
df
Out[19]:
loss_perc loss
squad-fpp 0.044944 199865
duo-fpp 0.025520 113488
solo-fpp 0.015500 68929
squad 0.014635 65083
duo 0.007304 32479
solo 0.004513 20069
normal-squad-fpp 0.000461 2049
normal-duo-fpp 0.000129 574
crashfpp 0.000072 318
normal-solo-fpp 0.000056 250
flaretpp 0.000051 228
normal-squad 0.000022 97
flarefpp 0.000018 80
normal-solo 0.000012 52
normal-duo 0.000005 23
crashtpp 0.000004 17
With this dataframe, we can notice that squad-fpp has the most cheaters.
And losses are not quite big than expected. Thus I suppose I could eliminate the outliers for sure.

2. Damage Dealt

In [20]:
%%time
plt.figure(figsize=(5,5))
sns.jointplot(x="winPlacePerc", y="damageDealt", data=tr, ratio=3, color="darkturquoise")
plt.title("damageDealt and Target")
plt.show()
<Figure size 360x360 with 0 Axes>
Wall time: 26.6 s
In [21]:
%%time

print("="*60+" {} ".format('damageDealt').upper()+"="*60+"\n")
plt.figure()
fig, ax = plt.subplots(4,4,figsize=(20,8)) #(nrow, ncol)

for i, mt in enumerate(mt_ls):
#         i += 1
    plt.subplot(4,4,i+1)
    plt.scatter(tr[tr['matchType']==mt]['winPlacePerc'], tr[tr['matchType']==mt]['damageDealt'], marker='+', c='darkturquoise')
    #plt.xlabel(feat, fontsize=9)
    #plt.xticks([1,8,15,22,29]) 
    plt.title(mt.upper(), fontsize=9)
    plt.tight_layout()

plt.show();
============================================================ DAMAGEDEALT ============================================================

<Figure size 432x288 with 0 Axes>
Wall time: 25 s
Same with boxplot
In [22]:
%%time

print("="*60+" {} ".format('damageDealt').upper()+"="*60+"\n")
plt.figure()
fig, ax = plt.subplots(4,4,figsize=(20,8)) #(nrow, ncol)

for i, mt in enumerate(mt_ls):
#         i += 1
    plt.subplot(4,4,i+1)
    plt.boxplot(tr[tr['matchType']==mt]['damageDealt'], vert=False)
    #plt.xlabel(feat, fontsize=9)
    #plt.xticks([1,8,15,22,29]) 
    plt.title(mt.upper(), fontsize=9)
    plt.tight_layout()

plt.show();
============================================================ DAMAGEDEALT ============================================================

<Figure size 432x288 with 0 Axes>
Wall time: 8.39 s
Unlike headshot, there aren't upper line. But we have to closely look the plots. We have to look y-axis. If I consider the baseline as solo match, maxium value of damagedealt are around 2k. There are y-axis that over 4k even 6k.

However, I can't assume 4k and 6k values are the outliers. But consider over 4k as outlier seems right decision to me and see the result for this.
In [23]:
tr['DD_clean'] = np.where(tr['damageDealt']>=4000, 0, tr['damageDealt'])
In [24]:
tr['DD_clean'].describe()
Out[24]:
count    4.446966e+06
mean     1.306822e+02
std      1.702982e+02
min      0.000000e+00
25%      0.000000e+00
50%      8.424000e+01
75%      1.860000e+02
max      3.987000e+03
Name: DD_clean, dtype: float64
In [25]:
%%time

print("="*60+" {} ".format('DD_clean').upper()+"="*60+"\n")
plt.figure()
fig, ax = plt.subplots(4,4,figsize=(20,8)) #(nrow, ncol)

for i, mt in enumerate(mt_ls):
#         i += 1
    plt.subplot(4,4,i+1)
    plt.scatter(tr[tr['matchType']==mt]['winPlacePerc'], tr[tr['matchType']==mt]['DD_clean'], marker='+', c='darkturquoise')
    #plt.xlabel(feat, fontsize=9)
    #plt.xticks([1,8,15,22,29]) 
    plt.title(mt.upper(), fontsize=9)
    plt.tight_layout()

plt.show();
============================================================ DD_CLEAN ============================================================

<Figure size 432x288 with 0 Axes>
Wall time: 25.4 s
Feels comfortable
Check the loss
In [26]:
tr['DD_outnum'] = np.where(tr['damageDealt']>=4000, 1, 0)
print("# of outliers in DD : {} & {:0.4f}".format(tr['DD_outnum'].sum(),tr['DD_outnum'].sum()/tr.shape[0]))
# of outliers in DD : 32 & 0.0000
In [27]:
loss_pc = []
loss = []
for t in tr['matchType'].unique():
#     print("{} has loss of {:0.4f}".format(t.upper(), tr[(tr['matchType']==t) & (tr['HSR_outnum']==1)].shape[0]/tr.shape[0]))
    loss_pc.append(tr[(tr['matchType']==t) & (tr['DD_outnum']==1)].shape[0]/tr.shape[0])
    loss.append(tr[(tr['matchType']==t) & (tr['DD_outnum']==1)].shape[0])

df = pd.DataFrame(data=loss_pc)
df.index = tr['matchType'].unique().tolist()
df.columns = ['loss_perc']
df['loss'] = loss
df.sort_values(by='loss_perc', inplace=True, ascending=False)
df
Out[27]:
loss_perc loss
normal-solo-fpp 3.373086e-06 15
normal-squad-fpp 2.473597e-06 11
normal-duo-fpp 1.124362e-06 5
normal-squad 2.248724e-07 1
squad-fpp 0.000000e+00 0
duo 0.000000e+00 0
solo-fpp 0.000000e+00 0
squad 0.000000e+00 0
duo-fpp 0.000000e+00 0
solo 0.000000e+00 0
crashfpp 0.000000e+00 0
flaretpp 0.000000e+00 0
flarefpp 0.000000e+00 0
normal-duo 0.000000e+00 0
crashtpp 0.000000e+00 0
normal-solo 0.000000e+00 0
Few outliers were there.

3. kills

In [28]:
%%time
plt.figure(figsize=(5,5))
sns.jointplot(x="winPlacePerc", y="kills", data=tr, ratio=3, color="blue")
plt.title("kills and Target")
plt.show()
<Figure size 360x360 with 0 Axes>
Wall time: 25.9 s
In [29]:
%%time

print("="*60+" {} ".format('kills').upper()+"="*60+"\n")
plt.figure()
fig, ax = plt.subplots(4,4,figsize=(20,8)) #(nrow, ncol)

for i, mt in enumerate(mt_ls):
#         i += 1
    plt.subplot(4,4,i+1)
    plt.scatter(tr[tr['matchType']==mt]['winPlacePerc'], tr[tr['matchType']==mt]['kills'], marker='+', c='blue')
    #plt.xlabel(feat, fontsize=9)
    #plt.xticks([1,8,15,22,29]) 
    plt.title(mt.upper(), fontsize=9)
    plt.tight_layout()

plt.show();
============================================================ KILLS ============================================================

<Figure size 432x288 with 0 Axes>
Wall time: 25 s
Well, this is quite impressive since I expected many outliers but it doesn't seem to have much.
Squad match can reach 60 kills but Solo?
I had seen some players kill around 40 kill when play solo match. But 60?
I can't surely decide any outliers in kills.

4. killstreaks

In [30]:
%%time
plt.figure(figsize=(5,5))
sns.jointplot(x="winPlacePerc", y="killStreaks", data=tr, ratio=3, color="tomato")
plt.title("killStreaks and Target")
plt.show()
<Figure size 360x360 with 0 Axes>
Wall time: 26.7 s
In [31]:
%%time

print("="*60+" {} ".format('killStreaks').upper()+"="*60+"\n")
plt.figure()
fig, ax = plt.subplots(4,4,figsize=(20,8)) #(nrow, ncol)

for i, mt in enumerate(mt_ls):
#         i += 1
    plt.subplot(4,4,i+1)
    plt.scatter(tr[tr['matchType']==mt]['winPlacePerc'], tr[tr['matchType']==mt]['killStreaks'], marker='+', c='tomato')
    #plt.xlabel(feat, fontsize=9)
    #plt.xticks([1,8,15,22,29]) 
    plt.title(mt.upper(), fontsize=9)
    plt.tight_layout()

plt.show();
============================================================ KILLSTREAKS ============================================================

<Figure size 432x288 with 0 Axes>
Wall time: 25.3 s
There is one plot you can tell. NORMAL-SQUAD-FPP. Most of the plot shows maxium values are around 10. Solo is possible. But unlike SQUAD-FPP's max doesn't exceed 8.
Therefore, I can decide that Only for NORMAL-SQUAD-FPP matchType, killstreak > 10 then considered as outlier.
In [32]:
tr['KS_clean'] = np.where(tr['killStreaks']>10, 0, tr['killStreaks'])
tr['KS_outnum'] = np.where(tr['killStreaks']>10, 1, 0)
print("# of outliers in KS : {} & {:0.4f}".format(tr['KS_outnum'].sum(),tr['KS_outnum'].sum()/tr.shape[0]))
# of outliers in KS : 23 & 0.0000
In [33]:
loss_pc = []
loss = []
for t in tr['matchType'].unique():
#     print("{} has loss of {:0.4f}".format(t.upper(), tr[(tr['matchType']==t) & (tr['HSR_outnum']==1)].shape[0]/tr.shape[0]))
    loss_pc.append(tr[(tr['matchType']==t) & (tr['KS_outnum']==1)].shape[0]/tr.shape[0])
    loss.append(tr[(tr['matchType']==t) & (tr['KS_outnum']==1)].shape[0])

df = pd.DataFrame(data=loss_pc)
df.index = tr['matchType'].unique().tolist()
df.columns = ['loss_perc']
df['loss'] = loss
df.sort_values(by='loss_perc', inplace=True, ascending=False)
df
Out[33]:
loss_perc loss
solo 4.047704e-06 18
normal-squad-fpp 8.994897e-07 4
normal-solo-fpp 2.248724e-07 1
squad-fpp 0.000000e+00 0
duo 0.000000e+00 0
solo-fpp 0.000000e+00 0
squad 0.000000e+00 0
duo-fpp 0.000000e+00 0
crashfpp 0.000000e+00 0
flaretpp 0.000000e+00 0
flarefpp 0.000000e+00 0
normal-duo-fpp 0.000000e+00 0
normal-duo 0.000000e+00 0
normal-squad 0.000000e+00 0
crashtpp 0.000000e+00 0
normal-solo 0.000000e+00 0
Except SOLO match, there are 5 outliers.

5. longestkill

In [34]:
%%time
plt.figure(figsize=(5,5))
sns.jointplot(x="winPlacePerc", y="longestKill", data=tr, ratio=3, color="indigo")
plt.title("longestKill and Target")
plt.show()
<Figure size 360x360 with 0 Axes>
Wall time: 26.4 s
In [35]:
%%time

print("="*60+" {} ".format('longestKill').upper()+"="*60+"\n")
plt.figure()
fig, ax = plt.subplots(4,4,figsize=(20,8)) #(nrow, ncol)

for i, mt in enumerate(mt_ls):
#         i += 1
    plt.subplot(4,4,i+1)
    plt.scatter(tr[tr['matchType']==mt]['winPlacePerc'], tr[tr['matchType']==mt]['longestKill'], marker='+', c='indigo')
    #plt.xlabel(feat, fontsize=9)
    #plt.xticks([1,8,15,22,29]) 
    plt.title(mt.upper(), fontsize=9)
    plt.tight_layout()

plt.show();
============================================================ LONGESTKILL ============================================================

<Figure size 432x288 with 0 Axes>
Wall time: 25.2 s
1km kill can be possible.
Thus, I can't tell there is an outlier.
I need more evidences to cull them out.

6. rankpoints

In [36]:
%%time
plt.figure(figsize=(5,5))
sns.jointplot(x="winPlacePerc", y="rankPoints", data=tr, ratio=3, color="rosybrown")
plt.title("rankPoints and Target")
plt.show()
<Figure size 360x360 with 0 Axes>
Wall time: 26.9 s
In [37]:
%%time

print("="*60+" {} ".format('longestKill').upper()+"="*60+"\n")
plt.figure()
fig, ax = plt.subplots(4,4,figsize=(20,8)) #(nrow, ncol)

for i, mt in enumerate(mt_ls):
#         i += 1
    plt.subplot(4,4,i+1)
    plt.scatter(tr[tr['matchType']==mt]['winPlacePerc'], tr[tr['matchType']==mt]['longestKill'], marker='+', c='rosybrown')
    #plt.xlabel(feat, fontsize=9)
    #plt.xticks([1,8,15,22,29]) 
    plt.title(mt.upper(), fontsize=9)
    plt.tight_layout()

plt.show();
============================================================ LONGESTKILL ============================================================

<Figure size 432x288 with 0 Axes>
Wall time: 25.4 s
Hard to tell

7. revives

In [38]:
%%time
plt.figure(figsize=(5,5))
sns.jointplot(x="winPlacePerc", y="revives", data=tr, ratio=3, color="dimgrey")
plt.title("revives and Target")
plt.show()
<Figure size 360x360 with 0 Axes>
Wall time: 26.8 s
In [39]:
%%time

print("="*60+" {} ".format('revives').upper()+"="*60+"\n")
plt.figure()
fig, ax = plt.subplots(4,4,figsize=(20,8)) #(nrow, ncol)

for i, mt in enumerate(mt_ls):
#         i += 1
    plt.subplot(4,4,i+1)
    plt.scatter(tr[tr['matchType']==mt]['winPlacePerc'], tr[tr['matchType']==mt]['revives'], marker='+', c='dimgrey')
    #plt.xlabel(feat, fontsize=9)
    #plt.xticks([1,8,15,22,29]) 
    plt.title(mt.upper(), fontsize=9)
    plt.tight_layout()

plt.show();
============================================================ REVIVES ============================================================

<Figure size 432x288 with 0 Axes>
Wall time: 25.4 s
Hard to tell

8. roadkills

In [40]:
%%time
plt.figure(figsize=(5,5))
sns.jointplot(x="winPlacePerc", y="roadKills", data=tr, ratio=3, color="navy")
plt.title("roadKills and Target")
plt.show()
<Figure size 360x360 with 0 Axes>
Wall time: 25.7 s
In [41]:
%%time

print("="*60+" {} ".format('roadKills').upper()+"="*60+"\n")
plt.figure()
fig, ax = plt.subplots(4,4,figsize=(20,8)) #(nrow, ncol)

for i, mt in enumerate(mt_ls):
#         i += 1
    plt.subplot(4,4,i+1)
    plt.scatter(tr[tr['matchType']==mt]['winPlacePerc'], tr[tr['matchType']==mt]['roadKills'], marker='+', c='navy')
    #plt.xlabel(feat, fontsize=9)
    #plt.xticks([1,8,15,22,29]) 
    plt.title(mt.upper(), fontsize=9)
    plt.tight_layout()

plt.show();
============================================================ ROADKILLS ============================================================

<Figure size 432x288 with 0 Axes>
Wall time: 25.4 s
Let's make roadkill rate.
In [42]:
tr['RK_rate'] = tr['roadKills']/tr['kills']
In [43]:
%%time
plt.figure(figsize=(5,5))
sns.jointplot(x="winPlacePerc", y="RK_rate", data=tr, ratio=3, color="navy")
plt.title("RK_rate and Target")
plt.show()
<Figure size 360x360 with 0 Axes>
Wall time: 11.4 s
In [44]:
%%time
plt.figure(figsize=(5,5))
sns.jointplot(x="kills", y="RK_rate", data=tr, ratio=3, color="navy")
plt.title("RK_rate and kills")
plt.show()
<Figure size 360x360 with 0 Axes>
Wall time: 11.3 s
Hard to tell

9. teamKills

In [45]:
%%time
plt.figure(figsize=(5,5))
sns.jointplot(x="winPlacePerc", y="teamKills", data=tr, ratio=3, color="violet")
plt.title("teamKills and Target")
plt.show()
<Figure size 360x360 with 0 Axes>
Wall time: 26.9 s
In [46]:
%%time

print("="*60+" {} ".format('teamKills').upper()+"="*60+"\n")
plt.figure()
fig, ax = plt.subplots(4,4,figsize=(20,8)) #(nrow, ncol)

for i, mt in enumerate(mt_ls):
#         i += 1
    plt.subplot(4,4,i+1)
    plt.scatter(tr[tr['matchType']==mt]['winPlacePerc'], tr[tr['matchType']==mt]['teamKills'], marker='+', c='violet')
    #plt.xlabel(feat, fontsize=9)
    #plt.xticks([1,8,15,22,29]) 
    plt.title(mt.upper(), fontsize=9)
    plt.tight_layout()

plt.show();
============================================================ TEAMKILLS ============================================================

<Figure size 432x288 with 0 Axes>
Wall time: 25.7 s
As you see Solo has also teamkills, it says that self-kill count as teamkills. Thus, Squad can have multiple teamkills.
But, how can I detect whether this teamkills happen accidently or intentionally.

10. swimdistance

In [47]:
%%time
plt.figure(figsize=(5,5))
sns.jointplot(x="winPlacePerc", y="swimDistance", data=tr, ratio=3, color="steelblue")
plt.title("swimDistance and Target")
plt.show()
<Figure size 360x360 with 0 Axes>
Wall time: 26.8 s
In [48]:
%%time

print("="*60+" {} ".format('swimDistance').upper()+"="*60+"\n")
plt.figure()
fig, ax = plt.subplots(4,4,figsize=(20,8)) #(nrow, ncol)

for i, mt in enumerate(mt_ls):
#         i += 1
    plt.subplot(4,4,i+1)
    plt.scatter(tr[tr['matchType']==mt]['winPlacePerc'], tr[tr['matchType']==mt]['swimDistance'], marker='+', c='steelblue')
    #plt.xlabel(feat, fontsize=9)
    #plt.xticks([1,8,15,22,29]) 
    plt.title(mt.upper(), fontsize=9)
    plt.tight_layout()

plt.show();
============================================================ SWIMDISTANCE ============================================================

<Figure size 432x288 with 0 Axes>
Wall time: 25.4 s
Hard to tell

11. walkdistance

In [49]:
%%time
plt.figure(figsize=(5,5))
sns.jointplot(x="winPlacePerc", y="walkDistance", data=tr, ratio=3, color="darkcyan")
plt.title("walkDistance and Target")
plt.show()
<Figure size 360x360 with 0 Axes>
Wall time: 26.8 s
In [50]:
%%time

print("="*60+" {} ".format('walkDistance').upper()+"="*60+"\n")
plt.figure()
fig, ax = plt.subplots(4,4,figsize=(20,8)) #(nrow, ncol)

for i, mt in enumerate(mt_ls):
#         i += 1
    plt.subplot(4,4,i+1)
    plt.scatter(tr[tr['matchType']==mt]['winPlacePerc'], tr[tr['matchType']==mt]['walkDistance'], marker='+', c='darkcyan')
    #plt.xlabel(feat, fontsize=9)
    #plt.xticks([1,8,15,22,29]) 
    plt.title(mt.upper(), fontsize=9)
    plt.tight_layout()

plt.show();
============================================================ WALKDISTANCE ============================================================

<Figure size 432x288 with 0 Axes>
Wall time: 25.6 s
If you see the map for example erangel.
On the left image, the yellow diagonal line is about 7.5km.
And this is the one of the biggest maps in PUBG.
On the right image, the yellow grid shows 1km in width and height.

Therefore, if a player who moved more than 7.5k is a possible outlier.
In [51]:
tr['WD_clean'] = np.where(tr['walkDistance']>7500, 0, tr['walkDistance'])
tr['WD_outnum'] = np.where(tr['walkDistance']>7500, 1, 0)
print("# of outliers in walkDistance : {} & {:0.4f}".format(tr['WD_outnum'].sum(),tr['WD_outnum'].sum()/tr.shape[0]))
# of outliers in walkDistance : 1579 & 0.0004
In [52]:
%%time

print("="*60+" {} ".format('walkDistance_CLEAN').upper()+"="*60+"\n")
plt.figure()
fig, ax = plt.subplots(4,4,figsize=(20,8)) #(nrow, ncol)

for i, mt in enumerate(mt_ls):
#         i += 1
    plt.subplot(4,4,i+1)
    plt.scatter(tr[tr['matchType']==mt]['winPlacePerc'], tr[tr['matchType']==mt]['WD_clean'], marker='+', c='darkcyan')
    #plt.xlabel(feat, fontsize=9)
    #plt.xticks([1,8,15,22,29]) 
    plt.title(mt.upper(), fontsize=9)
    plt.tight_layout()

plt.show();
============================================================ WALKDISTANCE_CLEAN ============================================================

<Figure size 432x288 with 0 Axes>
Wall time: 25.9 s
In [53]:
loss_pc = []
loss = []
for t in tr['matchType'].unique():
#     print("{} has loss of {:0.4f}".format(t.upper(), tr[(tr['matchType']==t) & (tr['HSR_outnum']==1)].shape[0]/tr.shape[0]))
    loss_pc.append(tr[(tr['matchType']==t) & (tr['WD_outnum']==1)].shape[0]/tr.shape[0])
    loss.append(tr[(tr['matchType']==t) & (tr['WD_outnum']==1)].shape[0])

df = pd.DataFrame(data=loss_pc)
df.index = tr['matchType'].unique().tolist()
df.columns = ['loss_perc']
df['loss'] = loss
df.sort_values(by='loss_perc', inplace=True, ascending=False)
df
Out[53]:
loss_perc loss
squad-fpp 1.585351e-04 705
duo-fpp 7.353328e-05 327
solo-fpp 4.227601e-05 188
squad 4.002729e-05 178
duo 2.203750e-05 98
solo 1.259286e-05 56
normal-duo-fpp 2.698469e-06 12
normal-squad-fpp 2.023852e-06 9
flaretpp 1.124362e-06 5
flarefpp 2.248724e-07 1
crashfpp 0.000000e+00 0
normal-solo-fpp 0.000000e+00 0
normal-duo 0.000000e+00 0
normal-squad 0.000000e+00 0
crashtpp 0.000000e+00 0
normal-solo 0.000000e+00 0
After cleansing the outliers, the plots have some cuts in upper side.

12. ridedistance

In [54]:
%%time
plt.figure(figsize=(5,5))
sns.jointplot(x="winPlacePerc", y="rideDistance", data=tr, ratio=3, color="palevioletred")
plt.title("rideDistance and Target")
plt.show()
<Figure size 360x360 with 0 Axes>
Wall time: 27.1 s
In [55]:
%%time

print("="*60+" {} ".format('rideDistance').upper()+"="*60+"\n")
plt.figure()
fig, ax = plt.subplots(4,4,figsize=(20,8)) #(nrow, ncol)

for i, mt in enumerate(mt_ls):
#         i += 1
    plt.subplot(4,4,i+1)
    plt.scatter(tr[tr['matchType']==mt]['winPlacePerc'], tr[tr['matchType']==mt]['rideDistance'], marker='+', c='palevioletred')
    #plt.xlabel(feat, fontsize=9)
    #plt.xticks([1,8,15,22,29]) 
    plt.title(mt.upper(), fontsize=9)
    plt.tight_layout()

plt.show();
============================================================ RIDEDISTANCE ============================================================

<Figure size 432x288 with 0 Axes>
Wall time: 25.7 s
Hard to tell

13. weaponsacquired

In [56]:
%%time
plt.figure(figsize=(5,5))
sns.jointplot(x="winPlacePerc", y="weaponsAcquired", data=tr, ratio=3, color="mediumpurple")
plt.title("weaponsAcquired and Target")
plt.show()
<Figure size 360x360 with 0 Axes>
Wall time: 27 s
In [57]:
%%time

print("="*60+" {} ".format('weaponsAcquired').upper()+"="*60+"\n")
plt.figure()
fig, ax = plt.subplots(4,4,figsize=(20,8)) #(nrow, ncol)

for i, mt in enumerate(mt_ls):
#         i += 1
    plt.subplot(4,4,i+1)
    plt.scatter(tr[tr['matchType']==mt]['winPlacePerc'], tr[tr['matchType']==mt]['weaponsAcquired'], marker='+', c='mediumpurple')
    #plt.xlabel(feat, fontsize=9)
    #plt.xticks([1,8,15,22,29]) 
    plt.title(mt.upper(), fontsize=9)
    plt.tight_layout()

plt.show();
============================================================ WEAPONSACQUIRED ============================================================

<Figure size 432x288 with 0 Axes>
Wall time: 25.5 s
In [58]:
plt.figure(figsize=(20,5))
sns.distplot(tr['weaponsAcquired'], bins=100)
plt.xticks(np.arange(0, 300, step=10))
plt.show()
As I searched the game stat website, rankers are collecting weapons around 5 to 6.
But, here many game types have much more number of obtained weapons.
I'd like to eliminate upto 20 weapons to take.
In [59]:
tr['WP_clean'] = np.where(tr['weaponsAcquired']>20, 0, tr['weaponsAcquired'])
tr['WP_outnum'] = np.where(tr['weaponsAcquired']>20, 1, 0)
print("# of outliers in weaponsAcquired : {} & {:0.4f}".format(tr['WP_outnum'].sum(),tr['WP_outnum'].sum()/tr.shape[0]))
# of outliers in weaponsAcquired : 3162 & 0.0007
In [60]:
%%time

print("="*60+" {} ".format('weaponsAcquired_CLEAN').upper()+"="*60+"\n")
plt.figure()
fig, ax = plt.subplots(4,4,figsize=(20,8)) #(nrow, ncol)

for i, mt in enumerate(mt_ls):
#         i += 1
    plt.subplot(4,4,i+1)
    plt.scatter(tr[tr['matchType']==mt]['winPlacePerc'], tr[tr['matchType']==mt]['WP_clean'], marker='+', c='mediumpurple')
    #plt.xlabel(feat, fontsize=9)
    #plt.xticks([1,8,15,22,29]) 
    plt.title(mt.upper(), fontsize=9)
    plt.tight_layout()

plt.show();
============================================================ WEAPONSACQUIRED_CLEAN ============================================================

<Figure size 432x288 with 0 Axes>
Wall time: 26.5 s

14. heals

In [61]:
%%time
plt.figure(figsize=(5,5))
sns.jointplot(x="winPlacePerc", y="heals", data=tr, ratio=3, color="maroon")
plt.title("heals and Target")
plt.show()
<Figure size 360x360 with 0 Axes>
Wall time: 26 s
In [62]:
%%time

print("="*60+" {} ".format('heals').upper()+"="*60+"\n")
plt.figure()
fig, ax = plt.subplots(4,4,figsize=(20,8)) #(nrow, ncol)

for i, mt in enumerate(mt_ls):
#         i += 1
    plt.subplot(4,4,i+1)
    plt.scatter(tr[tr['matchType']==mt]['winPlacePerc'], tr[tr['matchType']==mt]['heals'], marker='+', c='maroon')
    #plt.xlabel(feat, fontsize=9)
    #plt.xticks([1,8,15,22,29]) 
    plt.title(mt.upper(), fontsize=9)
    plt.tight_layout()

plt.show();
============================================================ HEALS ============================================================

<Figure size 432x288 with 0 Axes>
Wall time: 25.4 s
As I usually plays this game like a camper, i.e. avoiding confronting enemy and focusing on surviving, I generally uses around 20 heals if I survived near top 10.
But around 50 heals are quite suspicious.
I'd like to remove players who heal more than 40 times.
In [63]:
tr['HL_clean'] = np.where(tr['heals']>40, 0, tr['heals'])
tr['HL_outnum'] = np.where(tr['heals']>40, 1, 0)
print("# of outliers in HEALS : {} & {:0.4f}".format(tr['HL_outnum'].sum(),tr['HL_outnum'].sum()/tr.shape[0]))
# of outliers in HEALS : 115 & 0.0000
In [64]:
%%time

print("="*60+" {} ".format('heals_CLEAN').upper()+"="*60+"\n")
plt.figure()
fig, ax = plt.subplots(4,4,figsize=(20,8)) #(nrow, ncol)

for i, mt in enumerate(mt_ls):
#         i += 1
    plt.subplot(4,4,i+1)
    plt.scatter(tr[tr['matchType']==mt]['winPlacePerc'], tr[tr['matchType']==mt]['HL_clean'], marker='+', c='maroon')
    #plt.xlabel(feat, fontsize=9)
    #plt.xticks([1,8,15,22,29]) 
    plt.title(mt.upper(), fontsize=9)
    plt.tight_layout()

plt.show();
============================================================ HEALS_CLEAN ============================================================

<Figure size 432x288 with 0 Axes>
Wall time: 26.2 s

15. boosts

In [65]:
%%time
plt.figure(figsize=(5,5))
sns.jointplot(x="winPlacePerc", y="boosts", data=tr, ratio=3, color="lightseagreen")
plt.title("boosts and Target")
plt.show()
<Figure size 360x360 with 0 Axes>
Wall time: 26.9 s
In [66]:
%%time

print("="*60+" {} ".format('boosts').upper()+"="*60+"\n")
plt.figure()
fig, ax = plt.subplots(4,4,figsize=(20,8)) #(nrow, ncol)

for i, mt in enumerate(mt_ls):
#         i += 1
    plt.subplot(4,4,i+1)
    plt.scatter(tr[tr['matchType']==mt]['winPlacePerc'], tr[tr['matchType']==mt]['boosts'], marker='+', c='lightseagreen')
    #plt.xlabel(feat, fontsize=9)
    #plt.xticks([1,8,15,22,29]) 
    plt.title(mt.upper(), fontsize=9)
    plt.tight_layout()

plt.show();
============================================================ BOOSTS ============================================================

<Figure size 432x288 with 0 Axes>
Wall time: 25.6 s
The number of boosts over 20 seems unusual.
In [67]:
tr['BS_clean'] = np.where(tr['boosts']>20, 0, tr['boosts'])
tr['BS_outnum'] = np.where(tr['boosts']>20, 1, 0)
print("# of outliers in BOOSTS : {} & {:0.4f}".format(tr['BS_outnum'].sum(),tr['BS_outnum'].sum()/tr.shape[0]))
# of outliers in BOOSTS : 10 & 0.0000
In [68]:
%%time

print("="*60+" {} ".format('BS_clean').upper()+"="*60+"\n")
plt.figure()
fig, ax = plt.subplots(4,4,figsize=(20,8)) #(nrow, ncol)

for i, mt in enumerate(mt_ls):
#         i += 1
    plt.subplot(4,4,i+1)
    plt.scatter(tr[tr['matchType']==mt]['winPlacePerc'], tr[tr['matchType']==mt]['BS_clean'], marker='+', c='lightseagreen')
    #plt.xlabel(feat, fontsize=9)
    #plt.xticks([1,8,15,22,29]) 
    plt.title(mt.upper(), fontsize=9)
    plt.tight_layout()

plt.show();
============================================================ BS_CLEAN ============================================================

<Figure size 432x288 with 0 Axes>
Wall time: 26.5 s

Detecting Outlier w/ Condition

  • zero walk distance and kill
  • winplaceperc == 1?
  • road kill and ridedistance(low ride and many roadkill
  • weaponacquired == 0 & high prob of win
  • heal == 0 & high prob of win

1. Not moving and Kill

Make total Distance moved
In [69]:
tr['totalDistance'] = tr[tr.filter(like='Distance',axis=1).columns].sum(axis=1)
tr.filter(like='Distance',axis=1).head()
Out[69]:
rideDistance swimDistance walkDistance totalDistance
0 0.0000 0.00 244.80 244.8000
1 0.0045 11.04 1434.00 1445.0445
2 0.0000 0.00 161.80 161.8000
3 0.0000 0.00 202.70 202.7000
4 0.0000 0.00 49.75 49.7500
Not moving means joined the match, and left the game after match actually started.
In [70]:
tr[tr['totalDistance']==0].shape[0]/tr.shape[0], tr[tr['totalDistance']==0].shape[0]
Out[70]:
(0.021895827402323292, 97370)
In [71]:
tr[tr['walkDistance']==0].shape[0]/tr.shape[0]
Out[71]:
0.022397967513131424
In [72]:
tr['notmovingkill'] = np.where((tr['totalDistance']==0)&(tr['kills']!=0), 1, 0)
tr[tr['notmovingkill']==1].shape
Out[72]:
(1535, 47)
In [73]:
tr['notwalkingkill'] = np.where((tr['walkDistance']==0)&(tr['kills']!=0), 1, 0)
tr[tr['notwalkingkill']==1].shape
Out[73]:
(1549, 48)
In [74]:
tr[tr['notmovingkill']==1][tr.columns[3:-13]]
Out[74]:
assists boosts damageDealt DBNOs headshotKills heals killPlace killPoints kills killStreaks ... walkDistance weaponsAcquired winPoints winPlacePerc HSR HSR_clean HSR_outnum DD_clean DD_outnum KS_clean
1824 0 0 593.000 0 0 3 18 0 6 3 ... 0.0 8 0 0.8571 0.000000 0.000000 0 593.000 0 3
6673 2 0 346.600 0 0 6 33 0 3 1 ... 0.0 22 0 0.6000 0.000000 0.000000 0 346.600 0 1
11892 2 0 1750.000 0 4 5 3 0 20 6 ... 0.0 13 0 0.8947 0.200000 0.200000 0 1750.000 0 6
14631 0 0 157.800 0 0 0 69 1000 1 1 ... 0.0 7 1500 0.0000 0.000000 0.000000 0 157.800 0 1
15591 0 0 100.000 0 1 0 37 0 1 1 ... 0.0 10 0 0.3000 1.000000 0.000000 1 100.000 0 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
4440232 0 0 4.316 0 0 0 61 1000 1 1 ... 0.0 7 1500 0.8889 0.000000 0.000000 0 4.316 0 1
4440898 0 0 90.830 0 0 4 42 0 1 1 ... 0.0 8 0 0.0000 0.000000 0.000000 0 90.830 0 1
4440927 2 2 909.100 7 2 16 26 1000 6 2 ... 0.0 7 1500 0.6000 0.333333 0.333333 0 909.100 0 2
4441511 6 2 696.400 9 2 0 18 1000 9 2 ... 0.0 16 1500 0.9000 0.222222 0.222222 0 696.400 0 2
4446682 0 0 41.950 0 0 0 48 0 1 1 ... 0.0 4 0 0.9434 0.000000 0.000000 0 41.950 0 1

1535 rows × 32 columns

2. Plot Target

In [75]:
tr[tr['winPlacePerc'].isnull()]
Out[75]:
Id groupId matchId assists boosts damageDealt DBNOs headshotKills heals killPlace ... WD_outnum WP_clean WP_outnum HL_clean HL_outnum BS_clean BS_outnum totalDistance notmovingkill notwalkingkill
2744604 f70c74418bb064 12dfbede33f92b 224a123c53e008 0 0 0.0 0 0 0 1 ... 0 0 0 0 0 0 0 0.0 0 0

1 rows × 48 columns

In [76]:
tr.drop(2744604, inplace=True)
In [77]:
plt.figure(figsize=(20,5))
sns.distplot(tr['winPlacePerc'], bins=100)
# plt.xticks(np.arange(0, 300, step=10))
plt.show()
How many players who has 100% winPlacePerc?
In [78]:
tr[tr['winPlacePerc']==1].shape[0],tr[tr['winPlacePerc']==1].shape[0]/tr.shape[0]
Out[78]:
(127573, 0.02868765551336698)

3. Road kill and RideDistance

In [79]:
tr[(tr['rideDistance']==0)&tr['roadKills']!=0].shape
Out[79]:
(180, 48)
In [80]:
tr[(tr['rideDistance']==0)&tr['roadKills']!=0]['roadKills'].describe()
Out[80]:
count    180.0
mean       1.0
std        0.0
min        1.0
25%        1.0
50%        1.0
75%        1.0
max        1.0
Name: roadKills, dtype: float64
Definitely outliers.

4. Weapon Acquired and win

In [81]:
tr[(tr['weaponsAcquired']==0)&tr['winPlacePerc']==0].shape[0],tr[(tr['weaponsAcquired']==0)&tr['winPlacePerc']==0].shape[0]/tr.shape[0]
Out[81]:
(4315529, 0.97044366213811)
In [82]:
tr[(tr['weaponsAcquired']==0)&tr['winPlacePerc']!=0].shape[0],tr[(tr['weaponsAcquired']==0)&tr['winPlacePerc']!=0].shape[0]/tr.shape[0]
Out[82]:
(131436, 0.029556337861890075)
In [83]:
tr[(tr['weaponsAcquired']==0)&(tr['winPlacePerc']!=0)]['winPlacePerc'].describe()
Out[83]:
count    131436.000000
mean          0.158867
std           0.175013
min           0.010100
25%           0.041700
50%           0.097800
75%           0.197800
max           1.000000
Name: winPlacePerc, dtype: float64
In [84]:
tr[(tr['weaponsAcquired']==0)&(tr['winPlacePerc']>0.2)].shape[0]
Out[84]:
32060
In [85]:
tr[(tr['weaponsAcquired']==0)&(tr['winPlacePerc']>0.5)].shape[0]
Out[85]:
7726
No weapon acquired means that you have to survive by your hands only.
This indicates two things. One thing is that low win perc with zero weapons means the player is actually not playing.
The other thing is that high win perc with zero weapons means cheaters.
I have to choose the threshold of winPlacePerc with zero weapon players for culling out cheaters.
In [86]:
tr[(tr['weaponsAcquired']==0)&(tr['winPlacePerc']==1)].shape[0]
Out[86]:
201
These 201 players are definitely cheaters.
In [87]:
tr[(tr['weaponsAcquired']==0)&(tr['winPlacePerc']>0.5)]['winPlacePerc'].describe()
Out[87]:
count    7726.000000
mean        0.701615
std         0.139747
min         0.505100
25%         0.577800
50%         0.673900
75%         0.808500
max         1.000000
Name: winPlacePerc, dtype: float64
In [88]:
tr[(tr['weaponsAcquired']==0)&(tr['winPlacePerc']>0.8)].shape[0]
Out[88]:
2022
These 2022 players are highly cheaters, I assume.

5. Heals and Boost and win

In [89]:
tr[tr['heals']==0]['winPlacePerc'].describe()
Out[89]:
count    2.648197e+06
mean     3.328857e-01
std      2.671897e-01
min      0.000000e+00
25%      1.075000e-01
50%      2.759000e-01
75%      5.155000e-01
max      1.000000e+00
Name: winPlacePerc, dtype: float64
There is zero heals but wins the game.
In [90]:
tr[(tr['heals']==0)&(tr['winPlacePerc']!=0)]['winPlacePerc'].describe()
Out[90]:
count    2.430631e+06
mean     3.626823e-01
std      2.587930e-01
min      1.010000e-02
25%      1.481000e-01
50%      3.077000e-01
75%      5.385000e-01
max      1.000000e+00
Name: winPlacePerc, dtype: float64
In [91]:
tr[(tr['heals']==0)&(tr['winPlacePerc']>0.5)]['winPlacePerc'].describe()
Out[91]:
count    675016.000000
mean          0.714444
std           0.141200
min           0.505100
25%           0.592600
50%           0.693900
75%           0.824700
max           1.000000
Name: winPlacePerc, dtype: float64
In [92]:
tr[(tr['heals']==0)&(tr['winPlacePerc']==1)].shape[0]
Out[92]:
20889
This 20889 players play zero heals but win.
Even luckily not heals or, if in squad mode other teammate do win can possible.
Even so, 20889 is quite big number to achieve, I suppose.
In [93]:
tr[(tr['heals']==0)&(tr['winPlacePerc']>0.8)].shape[0], tr[(tr['heals']==0)&(tr['winPlacePerc']>0.8)].shape[0]/tr.shape[0]
Out[93]:
(196752, 0.04424410806021635)
Heals n Boosts
In [94]:
tr['HnB'] = tr[['heals','boosts']].sum(axis=1)
In [95]:
tr[(tr['HnB']==0)&(tr['winPlacePerc']==1)].shape[0]
Out[95]:
7075
These 7075 are definitely cheaters.
No heals. No boosts. How those players win the game?
In [96]:
tr[(tr['HnB']==0)&(tr['winPlacePerc']>0.5)]['winPlacePerc'].describe()
Out[96]:
count    392074.000000
mean          0.683201
std           0.131476
min           0.505100
25%           0.571400
50%           0.655200
75%           0.775500
max           1.000000
Name: winPlacePerc, dtype: float64
In [97]:
tr[(tr['HnB']==0)&(tr['winPlacePerc']>0.8)]['winPlacePerc'].shape[0]
Out[97]:
81343
I can say they are cheaters.
728x90

'DL' 카테고리의 다른 글

[Git repo] from "git clone" to "merge"  (0) 2021.01.31
Pytorch Dataset - cv2.imread 메모리 사용  (0) 2021.01.31
[PUBG] ML_baseline(lightgbm)  (0) 2020.06.02
[PUBG] EDA  (0) 2020.05.29
[tabnet] beating tablet data with deep learning  (0) 2020.05.28

댓글