Detecting Outliers¶

1. Head Shot¶

For many users, high head shot rate is almost impossible. Even professional online fps gamers hard to exceed 30%.
headshotkills/kills = headshot_rate

2. Damagedealt¶

3. kills¶

4. killstreaks¶

5. longestkill¶

As fas as I know, 1km kill is very hard to achieve.

6. rankpoints(elo-like ranking)¶

7. revives¶

8. roadkills¶

9. swimdistance¶

10. teamkills¶

For detecting abusers.

11. walkdistance¶

12. weaponsacquired¶

import os, time, gc
import pandas as pd, numpy as np
from tqdm import tqdm
import matplotlib.pyplot as plt
import seaborn as sns

os.listdir('input')

['sample_submission_V2.csv', 'test_V2.csv', 'train_V2.csv']

%%time
tr = pd.read_csv("input/train_V2.csv")
te = pd.read_csv("input/test_V2.csv")

Wall time: 12.7 s

def missing_values_table(df):# Function to calculate missing values by column# Funct 
    mis_val = df.isnull().sum() # Total missing values
    mis_val_pct = 100 * df.isnull().sum() / len(df)# Percentage of missing values
    mis_val_df = pd.concat([mis_val, mis_val_pct], axis=1)# Make a table with the results
    mis_val_df_cols = mis_val_df.rename(columns = {0 : 'Missing Values', 1 : '% of Total Values'})# Rename the columns
    mis_val_df_cols = mis_val_df_cols[mis_val_df_cols.iloc[:,1] != 0].sort_values('% of Total Values', ascending=False).round(1)# Sort the table by percentage of missing descending
    print ("Dataframe has " + str(df.shape[1]) + " columns.\n" 
           "There are " + str(mis_val_df_cols.shape[0]) + " cols having missing values.")# Print some summary information
    return mis_val_df_cols # Return the dataframe with missing information

missing_values_table(tr)

Dataframe has 29 columns.
There are 1 cols having missing values.

missing_values_table(te)

Dataframe has 28 columns.
There are 0 cols having missing values.

1. Headshot¶

%%time
plt.figure(figsize=(5,5))
sns.jointplot(x="winPlacePerc", y="headshotKills", data=tr, ratio=3, color="darkolivegreen")
plt.title("headshotKills and Target")
plt.show()

<Figure size 360x360 with 0 Axes>

Wall time: 26.4 s

mt_ls = tr['matchType'].unique()
mt_ls.sort()
mt_ls

array(['crashfpp', 'crashtpp', 'duo', 'duo-fpp', 'flarefpp', 'flaretpp',
       'normal-duo', 'normal-duo-fpp', 'normal-solo', 'normal-solo-fpp',
       'normal-squad', 'normal-squad-fpp', 'solo', 'solo-fpp', 'squad',
       'squad-fpp'], dtype=object)

%%time

print("="*60+" {} ".format('headshotKills').upper()+"="*60+"\n")
plt.figure()
fig, ax = plt.subplots(4,4,figsize=(20,8)) #(nrow, ncol)

for i, mt in enumerate(mt_ls):
#         i += 1
    plt.subplot(4,4,i+1)
    plt.scatter(tr[tr['matchType']==mt]['winPlacePerc'], tr[tr['matchType']==mt]['headshotKills'], marker='+', c='red')
    #plt.xlabel(feat, fontsize=9)
    #plt.xticks([1,8,15,22,29]) 
    plt.title(mt.upper(), fontsize=9)
    plt.tight_layout()

plt.show();

============================================================ HEADSHOTKILLS ============================================================

<Figure size 432x288 with 0 Axes>

Wall time: 24.6 s

Head Shot Rate¶

tr['HSR'] = tr['headshotKills']/tr['kills']

tr['HSR'].describe()

count    1.917244e+06
mean     2.391822e-01
std      3.532459e-01
min      0.000000e+00
25%      0.000000e+00
50%      0.000000e+00
75%      5.000000e-01
max      1.000000e+00
Name: HSR, dtype: float64

%%time
plt.figure(figsize=(5,5))
sns.jointplot(x="winPlacePerc", y="HSR", data=tr, ratio=3, color="r")
plt.title("HSR and Target")
plt.show()

<Figure size 360x360 with 0 Axes>

Wall time: 11.5 s

%%time

print("="*60+" {} ".format('HSR').upper()+"="*60+"\n")
plt.figure()
fig, ax = plt.subplots(4,4,figsize=(20,8)) #(nrow, ncol)

for i, mt in enumerate(mt_ls):
#         i += 1
    plt.subplot(4,4,i+1)
    plt.scatter(tr[tr['matchType']==mt]['winPlacePerc'], tr[tr['matchType']==mt]['HSR'], marker='+', c='red')
    #plt.xlabel(feat, fontsize=9)
    #plt.xticks([1,8,15,22,29]) 
    plt.title(mt.upper(), fontsize=9)
    plt.tight_layout()

plt.show();

============================================================ HSR ============================================================

<Figure size 432x288 with 0 Axes>

Wall time: 22.5 s

As you can see, many players' head shot rates are over 0.5. This is insane.
Let's clean the data and see how the plots are changes.

tr['HSR_clean'] = np.where(tr['HSR']>=0.5, 0, tr['HSR'])

tr['HSR_clean'].describe()

count    1.917244e+06
mean     3.679607e-02
std      9.939526e-02
min      0.000000e+00
25%      0.000000e+00
50%      0.000000e+00
75%      0.000000e+00
max      4.864865e-01
Name: HSR_clean, dtype: float64

Head rates that near 0.5 is also quite high.
I have to see what happens if I lower the threshold until 0.3.

%%time
plt.figure(figsize=(5,5))
sns.jointplot(x="winPlacePerc", y="HSR_clean", data=tr, ratio=3, color="r")
plt.ylim((0,1))
plt.title("HSR and Target")
plt.show()

<Figure size 360x360 with 0 Axes>

Wall time: 10.3 s

%%time
print("="*60+" {} ".format('HSR_clean').upper()+"="*60+"\n")
plt.figure()
fig, ax = plt.subplots(4,4,figsize=(20,8)) #(nrow, ncol)

for i, mt in enumerate(mt_ls):
#         i += 1
    plt.subplot(4,4,i+1)
    plt.scatter(tr[tr['matchType']==mt]['winPlacePerc'], tr[tr['matchType']==mt]['HSR_clean'], marker='+', c='red')
    #plt.xlabel(feat, fontsize=9)
    #plt.xticks([1,8,15,22,29]) 
    plt.title(mt.upper(), fontsize=9)
    plt.tight_layout()

plt.show();

============================================================ HSR_CLEAN ============================================================

<Figure size 432x288 with 0 Axes>

Wall time: 22.4 s

The upper lines diappeared. It looks quite normal.
I can do more cleansing process.
But I would stop here.

How many outliers are in headshotrates¶

tr['HSR_outnum'] = np.where(tr['HSR']>=0.5, 1, 0)
print("# of outliers in HSR : {} & {:0.4f}".format(tr['HSR_outnum'].sum(),tr['HSR_outnum'].sum()/tr.shape[0]))

# of outliers in HSR : 503601 & 0.1132

This doesn't look big. But we have to check the loss by matchType

loss_pc = []
loss = []
for t in tr['matchType'].unique():
#     print("{} has loss of {:0.4f}".format(t.upper(), tr[(tr['matchType']==t) & (tr['HSR_outnum']==1)].shape[0]/tr.shape[0]))
    loss_pc.append(tr[(tr['matchType']==t) & (tr['HSR_outnum']==1)].shape[0]/tr.shape[0])
    loss.append(tr[(tr['matchType']==t) & (tr['HSR_outnum']==1)].shape[0])

df = pd.DataFrame(data=loss_pc)
df.index = tr['matchType'].unique().tolist()
df.columns = ['loss_perc']
df['loss'] = loss
df.sort_values(by='loss_perc', inplace=True, ascending=False)
df

With this dataframe, we can notice that squad-fpp has the most cheaters.
And losses are not quite big than expected. Thus I suppose I could eliminate the outliers for sure.

2. Damage Dealt¶

%%time
plt.figure(figsize=(5,5))
sns.jointplot(x="winPlacePerc", y="damageDealt", data=tr, ratio=3, color="darkturquoise")
plt.title("damageDealt and Target")
plt.show()

<Figure size 360x360 with 0 Axes>

Wall time: 26.6 s

%%time

print("="*60+" {} ".format('damageDealt').upper()+"="*60+"\n")
plt.figure()
fig, ax = plt.subplots(4,4,figsize=(20,8)) #(nrow, ncol)

for i, mt in enumerate(mt_ls):
#         i += 1
    plt.subplot(4,4,i+1)
    plt.scatter(tr[tr['matchType']==mt]['winPlacePerc'], tr[tr['matchType']==mt]['damageDealt'], marker='+', c='darkturquoise')
    #plt.xlabel(feat, fontsize=9)
    #plt.xticks([1,8,15,22,29]) 
    plt.title(mt.upper(), fontsize=9)
    plt.tight_layout()

plt.show();

============================================================ DAMAGEDEALT ============================================================

<Figure size 432x288 with 0 Axes>

Wall time: 25 s

Same with boxplot

%%time

print("="*60+" {} ".format('damageDealt').upper()+"="*60+"\n")
plt.figure()
fig, ax = plt.subplots(4,4,figsize=(20,8)) #(nrow, ncol)

for i, mt in enumerate(mt_ls):
#         i += 1
    plt.subplot(4,4,i+1)
    plt.boxplot(tr[tr['matchType']==mt]['damageDealt'], vert=False)
    #plt.xlabel(feat, fontsize=9)
    #plt.xticks([1,8,15,22,29]) 
    plt.title(mt.upper(), fontsize=9)
    plt.tight_layout()

plt.show();

============================================================ DAMAGEDEALT ============================================================

<Figure size 432x288 with 0 Axes>

Wall time: 8.39 s

Unlike headshot, there aren't upper line. But we have to closely look the plots. We have to look y-axis. If I consider the baseline as solo match, maxium value of damagedealt are around 2k. There are y-axis that over 4k even 6k.

However, I can't assume 4k and 6k values are the outliers. But consider over 4k as outlier seems right decision to me and see the result for this.

tr['DD_clean'] = np.where(tr['damageDealt']>=4000, 0, tr['damageDealt'])

tr['DD_clean'].describe()

count    4.446966e+06
mean     1.306822e+02
std      1.702982e+02
min      0.000000e+00
25%      0.000000e+00
50%      8.424000e+01
75%      1.860000e+02
max      3.987000e+03
Name: DD_clean, dtype: float64

%%time

print("="*60+" {} ".format('DD_clean').upper()+"="*60+"\n")
plt.figure()
fig, ax = plt.subplots(4,4,figsize=(20,8)) #(nrow, ncol)

for i, mt in enumerate(mt_ls):
#         i += 1
    plt.subplot(4,4,i+1)
    plt.scatter(tr[tr['matchType']==mt]['winPlacePerc'], tr[tr['matchType']==mt]['DD_clean'], marker='+', c='darkturquoise')
    #plt.xlabel(feat, fontsize=9)
    #plt.xticks([1,8,15,22,29]) 
    plt.title(mt.upper(), fontsize=9)
    plt.tight_layout()

plt.show();

============================================================ DD_CLEAN ============================================================

<Figure size 432x288 with 0 Axes>

Wall time: 25.4 s

Feels comfortable

Check the loss¶

tr['DD_outnum'] = np.where(tr['damageDealt']>=4000, 1, 0)
print("# of outliers in DD : {} & {:0.4f}".format(tr['DD_outnum'].sum(),tr['DD_outnum'].sum()/tr.shape[0]))

# of outliers in DD : 32 & 0.0000

loss_pc = []
loss = []
for t in tr['matchType'].unique():
#     print("{} has loss of {:0.4f}".format(t.upper(), tr[(tr['matchType']==t) & (tr['HSR_outnum']==1)].shape[0]/tr.shape[0]))
    loss_pc.append(tr[(tr['matchType']==t) & (tr['DD_outnum']==1)].shape[0]/tr.shape[0])
    loss.append(tr[(tr['matchType']==t) & (tr['DD_outnum']==1)].shape[0])

df = pd.DataFrame(data=loss_pc)
df.index = tr['matchType'].unique().tolist()
df.columns = ['loss_perc']
df['loss'] = loss
df.sort_values(by='loss_perc', inplace=True, ascending=False)
df

Few outliers were there.

3. kills¶

%%time
plt.figure(figsize=(5,5))
sns.jointplot(x="winPlacePerc", y="kills", data=tr, ratio=3, color="blue")
plt.title("kills and Target")
plt.show()

<Figure size 360x360 with 0 Axes>

Wall time: 25.9 s

%%time

print("="*60+" {} ".format('kills').upper()+"="*60+"\n")
plt.figure()
fig, ax = plt.subplots(4,4,figsize=(20,8)) #(nrow, ncol)

for i, mt in enumerate(mt_ls):
#         i += 1
    plt.subplot(4,4,i+1)
    plt.scatter(tr[tr['matchType']==mt]['winPlacePerc'], tr[tr['matchType']==mt]['kills'], marker='+', c='blue')
    #plt.xlabel(feat, fontsize=9)
    #plt.xticks([1,8,15,22,29]) 
    plt.title(mt.upper(), fontsize=9)
    plt.tight_layout()

plt.show();

============================================================ KILLS ============================================================

<Figure size 432x288 with 0 Axes>

Wall time: 25 s

Well, this is quite impressive since I expected many outliers but it doesn't seem to have much.
Squad match can reach 60 kills but Solo?
I had seen some players kill around 40 kill when play solo match. But 60?
I can't surely decide any outliers in kills.

4. killstreaks¶

%%time
plt.figure(figsize=(5,5))
sns.jointplot(x="winPlacePerc", y="killStreaks", data=tr, ratio=3, color="tomato")
plt.title("killStreaks and Target")
plt.show()

<Figure size 360x360 with 0 Axes>

Wall time: 26.7 s

%%time

print("="*60+" {} ".format('killStreaks').upper()+"="*60+"\n")
plt.figure()
fig, ax = plt.subplots(4,4,figsize=(20,8)) #(nrow, ncol)

for i, mt in enumerate(mt_ls):
#         i += 1
    plt.subplot(4,4,i+1)
    plt.scatter(tr[tr['matchType']==mt]['winPlacePerc'], tr[tr['matchType']==mt]['killStreaks'], marker='+', c='tomato')
    #plt.xlabel(feat, fontsize=9)
    #plt.xticks([1,8,15,22,29]) 
    plt.title(mt.upper(), fontsize=9)
    plt.tight_layout()

plt.show();

============================================================ KILLSTREAKS ============================================================

<Figure size 432x288 with 0 Axes>

Wall time: 25.3 s

There is one plot you can tell. NORMAL-SQUAD-FPP. Most of the plot shows maxium values are around 10. Solo is possible. But unlike SQUAD-FPP's max doesn't exceed 8.
Therefore, I can decide that Only for NORMAL-SQUAD-FPP matchType, killstreak > 10 then considered as outlier.

tr['KS_clean'] = np.where(tr['killStreaks']>10, 0, tr['killStreaks'])
tr['KS_outnum'] = np.where(tr['killStreaks']>10, 1, 0)
print("# of outliers in KS : {} & {:0.4f}".format(tr['KS_outnum'].sum(),tr['KS_outnum'].sum()/tr.shape[0]))

# of outliers in KS : 23 & 0.0000

loss_pc = []
loss = []
for t in tr['matchType'].unique():
#     print("{} has loss of {:0.4f}".format(t.upper(), tr[(tr['matchType']==t) & (tr['HSR_outnum']==1)].shape[0]/tr.shape[0]))
    loss_pc.append(tr[(tr['matchType']==t) & (tr['KS_outnum']==1)].shape[0]/tr.shape[0])
    loss.append(tr[(tr['matchType']==t) & (tr['KS_outnum']==1)].shape[0])

df = pd.DataFrame(data=loss_pc)
df.index = tr['matchType'].unique().tolist()
df.columns = ['loss_perc']
df['loss'] = loss
df.sort_values(by='loss_perc', inplace=True, ascending=False)
df

Except SOLO match, there are 5 outliers.

5. longestkill¶

%%time
plt.figure(figsize=(5,5))
sns.jointplot(x="winPlacePerc", y="longestKill", data=tr, ratio=3, color="indigo")
plt.title("longestKill and Target")
plt.show()

<Figure size 360x360 with 0 Axes>

Wall time: 26.4 s

%%time

print("="*60+" {} ".format('longestKill').upper()+"="*60+"\n")
plt.figure()
fig, ax = plt.subplots(4,4,figsize=(20,8)) #(nrow, ncol)

for i, mt in enumerate(mt_ls):
#         i += 1
    plt.subplot(4,4,i+1)
    plt.scatter(tr[tr['matchType']==mt]['winPlacePerc'], tr[tr['matchType']==mt]['longestKill'], marker='+', c='indigo')
    #plt.xlabel(feat, fontsize=9)
    #plt.xticks([1,8,15,22,29]) 
    plt.title(mt.upper(), fontsize=9)
    plt.tight_layout()

plt.show();

============================================================ LONGESTKILL ============================================================

<Figure size 432x288 with 0 Axes>

Wall time: 25.2 s

1km kill can be possible.
Thus, I can't tell there is an outlier.
I need more evidences to cull them out.

6. rankpoints¶

%%time
plt.figure(figsize=(5,5))
sns.jointplot(x="winPlacePerc", y="rankPoints", data=tr, ratio=3, color="rosybrown")
plt.title("rankPoints and Target")
plt.show()

<Figure size 360x360 with 0 Axes>

Wall time: 26.9 s

%%time

print("="*60+" {} ".format('longestKill').upper()+"="*60+"\n")
plt.figure()
fig, ax = plt.subplots(4,4,figsize=(20,8)) #(nrow, ncol)

for i, mt in enumerate(mt_ls):
#         i += 1
    plt.subplot(4,4,i+1)
    plt.scatter(tr[tr['matchType']==mt]['winPlacePerc'], tr[tr['matchType']==mt]['longestKill'], marker='+', c='rosybrown')
    #plt.xlabel(feat, fontsize=9)
    #plt.xticks([1,8,15,22,29]) 
    plt.title(mt.upper(), fontsize=9)
    plt.tight_layout()

plt.show();

============================================================ LONGESTKILL ============================================================

<Figure size 432x288 with 0 Axes>

Wall time: 25.4 s

Hard to tell

7. revives¶

%%time
plt.figure(figsize=(5,5))
sns.jointplot(x="winPlacePerc", y="revives", data=tr, ratio=3, color="dimgrey")
plt.title("revives and Target")
plt.show()

<Figure size 360x360 with 0 Axes>

Wall time: 26.8 s

%%time

print("="*60+" {} ".format('revives').upper()+"="*60+"\n")
plt.figure()
fig, ax = plt.subplots(4,4,figsize=(20,8)) #(nrow, ncol)

for i, mt in enumerate(mt_ls):
#         i += 1
    plt.subplot(4,4,i+1)
    plt.scatter(tr[tr['matchType']==mt]['winPlacePerc'], tr[tr['matchType']==mt]['revives'], marker='+', c='dimgrey')
    #plt.xlabel(feat, fontsize=9)
    #plt.xticks([1,8,15,22,29]) 
    plt.title(mt.upper(), fontsize=9)
    plt.tight_layout()

plt.show();

============================================================ REVIVES ============================================================

<Figure size 432x288 with 0 Axes>

Wall time: 25.4 s

Hard to tell

8. roadkills¶

%%time
plt.figure(figsize=(5,5))
sns.jointplot(x="winPlacePerc", y="roadKills", data=tr, ratio=3, color="navy")
plt.title("roadKills and Target")
plt.show()

<Figure size 360x360 with 0 Axes>

Wall time: 25.7 s

%%time

print("="*60+" {} ".format('roadKills').upper()+"="*60+"\n")
plt.figure()
fig, ax = plt.subplots(4,4,figsize=(20,8)) #(nrow, ncol)

for i, mt in enumerate(mt_ls):
#         i += 1
    plt.subplot(4,4,i+1)
    plt.scatter(tr[tr['matchType']==mt]['winPlacePerc'], tr[tr['matchType']==mt]['roadKills'], marker='+', c='navy')
    #plt.xlabel(feat, fontsize=9)
    #plt.xticks([1,8,15,22,29]) 
    plt.title(mt.upper(), fontsize=9)
    plt.tight_layout()

plt.show();

============================================================ ROADKILLS ============================================================

<Figure size 432x288 with 0 Axes>

Wall time: 25.4 s

Let's make roadkill rate.

tr['RK_rate'] = tr['roadKills']/tr['kills']

%%time
plt.figure(figsize=(5,5))
sns.jointplot(x="winPlacePerc", y="RK_rate", data=tr, ratio=3, color="navy")
plt.title("RK_rate and Target")
plt.show()

<Figure size 360x360 with 0 Axes>

Wall time: 11.4 s

%%time
plt.figure(figsize=(5,5))
sns.jointplot(x="kills", y="RK_rate", data=tr, ratio=3, color="navy")
plt.title("RK_rate and kills")
plt.show()

<Figure size 360x360 with 0 Axes>

Wall time: 11.3 s

Hard to tell

9. teamKills¶

%%time
plt.figure(figsize=(5,5))
sns.jointplot(x="winPlacePerc", y="teamKills", data=tr, ratio=3, color="violet")
plt.title("teamKills and Target")
plt.show()

<Figure size 360x360 with 0 Axes>

Wall time: 26.9 s

%%time

print("="*60+" {} ".format('teamKills').upper()+"="*60+"\n")
plt.figure()
fig, ax = plt.subplots(4,4,figsize=(20,8)) #(nrow, ncol)

for i, mt in enumerate(mt_ls):
#         i += 1
    plt.subplot(4,4,i+1)
    plt.scatter(tr[tr['matchType']==mt]['winPlacePerc'], tr[tr['matchType']==mt]['teamKills'], marker='+', c='violet')
    #plt.xlabel(feat, fontsize=9)
    #plt.xticks([1,8,15,22,29]) 
    plt.title(mt.upper(), fontsize=9)
    plt.tight_layout()

plt.show();

============================================================ TEAMKILLS ============================================================

<Figure size 432x288 with 0 Axes>

Wall time: 25.7 s

As you see Solo has also teamkills, it says that self-kill count as teamkills. Thus, Squad can have multiple teamkills.
But, how can I detect whether this teamkills happen accidently or intentionally.

10. swimdistance¶

%%time
plt.figure(figsize=(5,5))
sns.jointplot(x="winPlacePerc", y="swimDistance", data=tr, ratio=3, color="steelblue")
plt.title("swimDistance and Target")
plt.show()

<Figure size 360x360 with 0 Axes>

Wall time: 26.8 s

%%time

print("="*60+" {} ".format('swimDistance').upper()+"="*60+"\n")
plt.figure()
fig, ax = plt.subplots(4,4,figsize=(20,8)) #(nrow, ncol)

for i, mt in enumerate(mt_ls):
#         i += 1
    plt.subplot(4,4,i+1)
    plt.scatter(tr[tr['matchType']==mt]['winPlacePerc'], tr[tr['matchType']==mt]['swimDistance'], marker='+', c='steelblue')
    #plt.xlabel(feat, fontsize=9)
    #plt.xticks([1,8,15,22,29]) 
    plt.title(mt.upper(), fontsize=9)
    plt.tight_layout()

plt.show();

============================================================ SWIMDISTANCE ============================================================

<Figure size 432x288 with 0 Axes>

Wall time: 25.4 s

Hard to tell

11. walkdistance¶

%%time
plt.figure(figsize=(5,5))
sns.jointplot(x="winPlacePerc", y="walkDistance", data=tr, ratio=3, color="darkcyan")
plt.title("walkDistance and Target")
plt.show()

<Figure size 360x360 with 0 Axes>

Wall time: 26.8 s

%%time

print("="*60+" {} ".format('walkDistance').upper()+"="*60+"\n")
plt.figure()
fig, ax = plt.subplots(4,4,figsize=(20,8)) #(nrow, ncol)

for i, mt in enumerate(mt_ls):
#         i += 1
    plt.subplot(4,4,i+1)
    plt.scatter(tr[tr['matchType']==mt]['winPlacePerc'], tr[tr['matchType']==mt]['walkDistance'], marker='+', c='darkcyan')
    #plt.xlabel(feat, fontsize=9)
    #plt.xticks([1,8,15,22,29]) 
    plt.title(mt.upper(), fontsize=9)
    plt.tight_layout()

plt.show();

============================================================ WALKDISTANCE ============================================================

<Figure size 432x288 with 0 Axes>

Wall time: 25.6 s

If you see the map for example erangel.
On the left image, the yellow diagonal line is about 7.5km.
And this is the one of the biggest maps in PUBG.
On the right image, the yellow grid shows 1km in width and height.

Therefore, if a player who moved more than 7.5k is a possible outlier.

tr['WD_clean'] = np.where(tr['walkDistance']>7500, 0, tr['walkDistance'])
tr['WD_outnum'] = np.where(tr['walkDistance']>7500, 1, 0)
print("# of outliers in walkDistance : {} & {:0.4f}".format(tr['WD_outnum'].sum(),tr['WD_outnum'].sum()/tr.shape[0]))

# of outliers in walkDistance : 1579 & 0.0004

%%time

print("="*60+" {} ".format('walkDistance_CLEAN').upper()+"="*60+"\n")
plt.figure()
fig, ax = plt.subplots(4,4,figsize=(20,8)) #(nrow, ncol)

for i, mt in enumerate(mt_ls):
#         i += 1
    plt.subplot(4,4,i+1)
    plt.scatter(tr[tr['matchType']==mt]['winPlacePerc'], tr[tr['matchType']==mt]['WD_clean'], marker='+', c='darkcyan')
    #plt.xlabel(feat, fontsize=9)
    #plt.xticks([1,8,15,22,29]) 
    plt.title(mt.upper(), fontsize=9)
    plt.tight_layout()

plt.show();

============================================================ WALKDISTANCE_CLEAN ============================================================

<Figure size 432x288 with 0 Axes>

Wall time: 25.9 s

loss_pc = []
loss = []
for t in tr['matchType'].unique():
#     print("{} has loss of {:0.4f}".format(t.upper(), tr[(tr['matchType']==t) & (tr['HSR_outnum']==1)].shape[0]/tr.shape[0]))
    loss_pc.append(tr[(tr['matchType']==t) & (tr['WD_outnum']==1)].shape[0]/tr.shape[0])
    loss.append(tr[(tr['matchType']==t) & (tr['WD_outnum']==1)].shape[0])

df = pd.DataFrame(data=loss_pc)
df.index = tr['matchType'].unique().tolist()
df.columns = ['loss_perc']
df['loss'] = loss
df.sort_values(by='loss_perc', inplace=True, ascending=False)
df

After cleansing the outliers, the plots have some cuts in upper side.

12. ridedistance¶

%%time
plt.figure(figsize=(5,5))
sns.jointplot(x="winPlacePerc", y="rideDistance", data=tr, ratio=3, color="palevioletred")
plt.title("rideDistance and Target")
plt.show()

<Figure size 360x360 with 0 Axes>

Wall time: 27.1 s

%%time

print("="*60+" {} ".format('rideDistance').upper()+"="*60+"\n")
plt.figure()
fig, ax = plt.subplots(4,4,figsize=(20,8)) #(nrow, ncol)

for i, mt in enumerate(mt_ls):
#         i += 1
    plt.subplot(4,4,i+1)
    plt.scatter(tr[tr['matchType']==mt]['winPlacePerc'], tr[tr['matchType']==mt]['rideDistance'], marker='+', c='palevioletred')
    #plt.xlabel(feat, fontsize=9)
    #plt.xticks([1,8,15,22,29]) 
    plt.title(mt.upper(), fontsize=9)
    plt.tight_layout()

plt.show();

============================================================ RIDEDISTANCE ============================================================

<Figure size 432x288 with 0 Axes>

Wall time: 25.7 s

Hard to tell

13. weaponsacquired¶

%%time
plt.figure(figsize=(5,5))
sns.jointplot(x="winPlacePerc", y="weaponsAcquired", data=tr, ratio=3, color="mediumpurple")
plt.title("weaponsAcquired and Target")
plt.show()

<Figure size 360x360 with 0 Axes>

Wall time: 27 s

%%time

print("="*60+" {} ".format('weaponsAcquired').upper()+"="*60+"\n")
plt.figure()
fig, ax = plt.subplots(4,4,figsize=(20,8)) #(nrow, ncol)

for i, mt in enumerate(mt_ls):
#         i += 1
    plt.subplot(4,4,i+1)
    plt.scatter(tr[tr['matchType']==mt]['winPlacePerc'], tr[tr['matchType']==mt]['weaponsAcquired'], marker='+', c='mediumpurple')
    #plt.xlabel(feat, fontsize=9)
    #plt.xticks([1,8,15,22,29]) 
    plt.title(mt.upper(), fontsize=9)
    plt.tight_layout()

plt.show();

============================================================ WEAPONSACQUIRED ============================================================

<Figure size 432x288 with 0 Axes>

Wall time: 25.5 s

plt.figure(figsize=(20,5))
sns.distplot(tr['weaponsAcquired'], bins=100)
plt.xticks(np.arange(0, 300, step=10))
plt.show()

As I searched the game stat website, rankers are collecting weapons around 5 to 6.
But, here many game types have much more number of obtained weapons.
I'd like to eliminate upto 20 weapons to take.

tr['WP_clean'] = np.where(tr['weaponsAcquired']>20, 0, tr['weaponsAcquired'])
tr['WP_outnum'] = np.where(tr['weaponsAcquired']>20, 1, 0)
print("# of outliers in weaponsAcquired : {} & {:0.4f}".format(tr['WP_outnum'].sum(),tr['WP_outnum'].sum()/tr.shape[0]))

# of outliers in weaponsAcquired : 3162 & 0.0007

%%time

print("="*60+" {} ".format('weaponsAcquired_CLEAN').upper()+"="*60+"\n")
plt.figure()
fig, ax = plt.subplots(4,4,figsize=(20,8)) #(nrow, ncol)

for i, mt in enumerate(mt_ls):
#         i += 1
    plt.subplot(4,4,i+1)
    plt.scatter(tr[tr['matchType']==mt]['winPlacePerc'], tr[tr['matchType']==mt]['WP_clean'], marker='+', c='mediumpurple')
    #plt.xlabel(feat, fontsize=9)
    #plt.xticks([1,8,15,22,29]) 
    plt.title(mt.upper(), fontsize=9)
    plt.tight_layout()

plt.show();

============================================================ WEAPONSACQUIRED_CLEAN ============================================================

<Figure size 432x288 with 0 Axes>

Wall time: 26.5 s

14. heals¶

%%time
plt.figure(figsize=(5,5))
sns.jointplot(x="winPlacePerc", y="heals", data=tr, ratio=3, color="maroon")
plt.title("heals and Target")
plt.show()

<Figure size 360x360 with 0 Axes>

Wall time: 26 s

%%time

print("="*60+" {} ".format('heals').upper()+"="*60+"\n")
plt.figure()
fig, ax = plt.subplots(4,4,figsize=(20,8)) #(nrow, ncol)

for i, mt in enumerate(mt_ls):
#         i += 1
    plt.subplot(4,4,i+1)
    plt.scatter(tr[tr['matchType']==mt]['winPlacePerc'], tr[tr['matchType']==mt]['heals'], marker='+', c='maroon')
    #plt.xlabel(feat, fontsize=9)
    #plt.xticks([1,8,15,22,29]) 
    plt.title(mt.upper(), fontsize=9)
    plt.tight_layout()

plt.show();

============================================================ HEALS ============================================================

<Figure size 432x288 with 0 Axes>

Wall time: 25.4 s

As I usually plays this game like a camper, i.e. avoiding confronting enemy and focusing on surviving, I generally uses around 20 heals if I survived near top 10.
But around 50 heals are quite suspicious.
I'd like to remove players who heal more than 40 times.

tr['HL_clean'] = np.where(tr['heals']>40, 0, tr['heals'])
tr['HL_outnum'] = np.where(tr['heals']>40, 1, 0)
print("# of outliers in HEALS : {} & {:0.4f}".format(tr['HL_outnum'].sum(),tr['HL_outnum'].sum()/tr.shape[0]))

# of outliers in HEALS : 115 & 0.0000

%%time

print("="*60+" {} ".format('heals_CLEAN').upper()+"="*60+"\n")
plt.figure()
fig, ax = plt.subplots(4,4,figsize=(20,8)) #(nrow, ncol)

for i, mt in enumerate(mt_ls):
#         i += 1
    plt.subplot(4,4,i+1)
    plt.scatter(tr[tr['matchType']==mt]['winPlacePerc'], tr[tr['matchType']==mt]['HL_clean'], marker='+', c='maroon')
    #plt.xlabel(feat, fontsize=9)
    #plt.xticks([1,8,15,22,29]) 
    plt.title(mt.upper(), fontsize=9)
    plt.tight_layout()

plt.show();

============================================================ HEALS_CLEAN ============================================================

<Figure size 432x288 with 0 Axes>

Wall time: 26.2 s

15. boosts¶

%%time
plt.figure(figsize=(5,5))
sns.jointplot(x="winPlacePerc", y="boosts", data=tr, ratio=3, color="lightseagreen")
plt.title("boosts and Target")
plt.show()

<Figure size 360x360 with 0 Axes>

Wall time: 26.9 s

%%time

print("="*60+" {} ".format('boosts').upper()+"="*60+"\n")
plt.figure()
fig, ax = plt.subplots(4,4,figsize=(20,8)) #(nrow, ncol)

for i, mt in enumerate(mt_ls):
#         i += 1
    plt.subplot(4,4,i+1)
    plt.scatter(tr[tr['matchType']==mt]['winPlacePerc'], tr[tr['matchType']==mt]['boosts'], marker='+', c='lightseagreen')
    #plt.xlabel(feat, fontsize=9)
    #plt.xticks([1,8,15,22,29]) 
    plt.title(mt.upper(), fontsize=9)
    plt.tight_layout()

plt.show();

============================================================ BOOSTS ============================================================

<Figure size 432x288 with 0 Axes>

Wall time: 25.6 s

The number of boosts over 20 seems unusual.

tr['BS_clean'] = np.where(tr['boosts']>20, 0, tr['boosts'])
tr['BS_outnum'] = np.where(tr['boosts']>20, 1, 0)
print("# of outliers in BOOSTS : {} & {:0.4f}".format(tr['BS_outnum'].sum(),tr['BS_outnum'].sum()/tr.shape[0]))

# of outliers in BOOSTS : 10 & 0.0000

%%time

print("="*60+" {} ".format('BS_clean').upper()+"="*60+"\n")
plt.figure()
fig, ax = plt.subplots(4,4,figsize=(20,8)) #(nrow, ncol)

for i, mt in enumerate(mt_ls):
#         i += 1
    plt.subplot(4,4,i+1)
    plt.scatter(tr[tr['matchType']==mt]['winPlacePerc'], tr[tr['matchType']==mt]['BS_clean'], marker='+', c='lightseagreen')
    #plt.xlabel(feat, fontsize=9)
    #plt.xticks([1,8,15,22,29]) 
    plt.title(mt.upper(), fontsize=9)
    plt.tight_layout()

plt.show();

============================================================ BS_CLEAN ============================================================

<Figure size 432x288 with 0 Axes>

Wall time: 26.5 s

Detecting Outlier w/ Condition¶

zero walk distance and kill
winplaceperc == 1?
road kill and ridedistance(low ride and many roadkill
weaponacquired == 0 & high prob of win
heal == 0 & high prob of win

1. Not moving and Kill¶

Make total Distance moved

tr['totalDistance'] = tr[tr.filter(like='Distance',axis=1).columns].sum(axis=1)
tr.filter(like='Distance',axis=1).head()

Not moving means joined the match, and left the game after match actually started.

tr[tr['totalDistance']==0].shape[0]/tr.shape[0], tr[tr['totalDistance']==0].shape[0]

(0.021895827402323292, 97370)

tr[tr['walkDistance']==0].shape[0]/tr.shape[0]

0.022397967513131424

tr['notmovingkill'] = np.where((tr['totalDistance']==0)&(tr['kills']!=0), 1, 0)
tr[tr['notmovingkill']==1].shape

(1535, 47)

tr['notwalkingkill'] = np.where((tr['walkDistance']==0)&(tr['kills']!=0), 1, 0)
tr[tr['notwalkingkill']==1].shape

(1549, 48)

tr[tr['notmovingkill']==1][tr.columns[3:-13]]

2. Plot Target¶

tr[tr['winPlacePerc'].isnull()]

tr.drop(2744604, inplace=True)

plt.figure(figsize=(20,5))
sns.distplot(tr['winPlacePerc'], bins=100)
# plt.xticks(np.arange(0, 300, step=10))
plt.show()

How many players who has 100% winPlacePerc?

tr[tr['winPlacePerc']==1].shape[0],tr[tr['winPlacePerc']==1].shape[0]/tr.shape[0]

(127573, 0.02868765551336698)

3. Road kill and RideDistance¶

tr[(tr['rideDistance']==0)&tr['roadKills']!=0].shape

(180, 48)

tr[(tr['rideDistance']==0)&tr['roadKills']!=0]['roadKills'].describe()

count    180.0
mean       1.0
std        0.0
min        1.0
25%        1.0
50%        1.0
75%        1.0
max        1.0
Name: roadKills, dtype: float64

Definitely outliers.

4. Weapon Acquired and win¶

tr[(tr['weaponsAcquired']==0)&tr['winPlacePerc']==0].shape[0],tr[(tr['weaponsAcquired']==0)&tr['winPlacePerc']==0].shape[0]/tr.shape[0]

(4315529, 0.97044366213811)

tr[(tr['weaponsAcquired']==0)&tr['winPlacePerc']!=0].shape[0],tr[(tr['weaponsAcquired']==0)&tr['winPlacePerc']!=0].shape[0]/tr.shape[0]

(131436, 0.029556337861890075)

tr[(tr['weaponsAcquired']==0)&(tr['winPlacePerc']!=0)]['winPlacePerc'].describe()

count    131436.000000
mean          0.158867
std           0.175013
min           0.010100
25%           0.041700
50%           0.097800
75%           0.197800
max           1.000000
Name: winPlacePerc, dtype: float64

tr[(tr['weaponsAcquired']==0)&(tr['winPlacePerc']>0.2)].shape[0]

32060

tr[(tr['weaponsAcquired']==0)&(tr['winPlacePerc']>0.5)].shape[0]

7726

No weapon acquired means that you have to survive by your hands only.
This indicates two things. One thing is that low win perc with zero weapons means the player is actually not playing.
The other thing is that high win perc with zero weapons means cheaters.
I have to choose the threshold of winPlacePerc with zero weapon players for culling out cheaters.

tr[(tr['weaponsAcquired']==0)&(tr['winPlacePerc']==1)].shape[0]

201

These 201 players are definitely cheaters.

tr[(tr['weaponsAcquired']==0)&(tr['winPlacePerc']>0.5)]['winPlacePerc'].describe()

count    7726.000000
mean        0.701615
std         0.139747
min         0.505100
25%         0.577800
50%         0.673900
75%         0.808500
max         1.000000
Name: winPlacePerc, dtype: float64

tr[(tr['weaponsAcquired']==0)&(tr['winPlacePerc']>0.8)].shape[0]

2022

These 2022 players are highly cheaters, I assume.

5. Heals and Boost and win¶

tr[tr['heals']==0]['winPlacePerc'].describe()

count    2.648197e+06
mean     3.328857e-01
std      2.671897e-01
min      0.000000e+00
25%      1.075000e-01
50%      2.759000e-01
75%      5.155000e-01
max      1.000000e+00
Name: winPlacePerc, dtype: float64

There is zero heals but wins the game.

tr[(tr['heals']==0)&(tr['winPlacePerc']!=0)]['winPlacePerc'].describe()

count    2.430631e+06
mean     3.626823e-01
std      2.587930e-01
min      1.010000e-02
25%      1.481000e-01
50%      3.077000e-01
75%      5.385000e-01
max      1.000000e+00
Name: winPlacePerc, dtype: float64

tr[(tr['heals']==0)&(tr['winPlacePerc']>0.5)]['winPlacePerc'].describe()

count    675016.000000
mean          0.714444
std           0.141200
min           0.505100
25%           0.592600
50%           0.693900
75%           0.824700
max           1.000000
Name: winPlacePerc, dtype: float64

tr[(tr['heals']==0)&(tr['winPlacePerc']==1)].shape[0]

20889

This 20889 players play zero heals but win.
Even luckily not heals or, if in squad mode other teammate do win can possible.
Even so, 20889 is quite big number to achieve, I suppose.

tr[(tr['heals']==0)&(tr['winPlacePerc']>0.8)].shape[0], tr[(tr['heals']==0)&(tr['winPlacePerc']>0.8)].shape[0]/tr.shape[0]

(196752, 0.04424410806021635)

Heals n Boosts¶

tr['HnB'] = tr[['heals','boosts']].sum(axis=1)

tr[(tr['HnB']==0)&(tr['winPlacePerc']==1)].shape[0]

7075

These 7075 are definitely cheaters.
No heals. No boosts. How those players win the game?

tr[(tr['HnB']==0)&(tr['winPlacePerc']>0.5)]['winPlacePerc'].describe()

count    392074.000000
mean          0.683201
std           0.131476
min           0.505100
25%           0.571400
50%           0.655200
75%           0.775500
max           1.000000
Name: winPlacePerc, dtype: float64

tr[(tr['HnB']==0)&(tr['winPlacePerc']>0.8)]['winPlacePerc'].shape[0]

81343

I can say they are cheaters.

	loss_perc	loss
squad-fpp	0.044944	199865
duo-fpp	0.025520	113488
solo-fpp	0.015500	68929
squad	0.014635	65083
duo	0.007304	32479
solo	0.004513	20069
normal-squad-fpp	0.000461	2049
normal-duo-fpp	0.000129	574
crashfpp	0.000072	318
normal-solo-fpp	0.000056	250
flaretpp	0.000051	228
normal-squad	0.000022	97
flarefpp	0.000018	80
normal-solo	0.000012	52
normal-duo	0.000005	23
crashtpp	0.000004	17

	assists	boosts	damageDealt	DBNOs	headshotKills	heals	killPlace	killPoints	kills	killStreaks	...	walkDistance	weaponsAcquired	winPoints	winPlacePerc	HSR	HSR_clean	HSR_outnum	DD_clean	DD_outnum	KS_clean
1824	0	0	593.000	0	0	3	18	0	6	3	...	0.0	8	0	0.8571	0.000000	0.000000	0	593.000	0	3
6673	2	0	346.600	0	0	6	33	0	3	1	...	0.0	22	0	0.6000	0.000000	0.000000	0	346.600	0	1
11892	2	0	1750.000	0	4	5	3	0	20	6	...	0.0	13	0	0.8947	0.200000	0.200000	0	1750.000	0	6
14631	0	0	157.800	0	0	0	69	1000	1	1	...	0.0	7	1500	0.0000	0.000000	0.000000	0	157.800	0	1
15591	0	0	100.000	0	1	0	37	0	1	1	...	0.0	10	0	0.3000	1.000000	0.000000	1	100.000	0	1
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
4440232	0	0	4.316	0	0	0	61	1000	1	1	...	0.0	7	1500	0.8889	0.000000	0.000000	0	4.316	0	1
4440898	0	0	90.830	0	0	4	42	0	1	1	...	0.0	8	0	0.0000	0.000000	0.000000	0	90.830	0	1
4440927	2	2	909.100	7	2	16	26	1000	6	2	...	0.0	7	1500	0.6000	0.333333	0.333333	0	909.100	0	2
4441511	6	2	696.400	9	2	0	18	1000	9	2	...	0.0	16	1500	0.9000	0.222222	0.222222	0	696.400	0	2
4446682	0	0	41.950	0	0	0	48	0	1	1	...	0.0	4	0	0.9434	0.000000	0.000000	0	41.950	0	1

[Git repo] from "git clone" to "merge" (0)	2021.01.31
Pytorch Dataset - cv2.imread 메모리 사용 (0)	2021.01.31
[PUBG] ML_baseline(lightgbm) (0)	2020.06.02
[PUBG] EDA (0)	2020.05.29
[tabnet] beating tablet data with deep learning (0)	2020.05.28

	loss_perc	loss
normal-solo-fpp	3.373086e-06	15
normal-squad-fpp	2.473597e-06	11
normal-duo-fpp	1.124362e-06	5
normal-squad	2.248724e-07	1
squad-fpp	0.000000e+00	0
duo	0.000000e+00	0
solo-fpp	0.000000e+00	0
squad	0.000000e+00	0
duo-fpp	0.000000e+00	0
solo	0.000000e+00	0
crashfpp	0.000000e+00	0
flaretpp	0.000000e+00	0
flarefpp	0.000000e+00	0
normal-duo	0.000000e+00	0
crashtpp	0.000000e+00	0
normal-solo	0.000000e+00	0

	loss_perc	loss
solo	4.047704e-06	18
normal-squad-fpp	8.994897e-07	4
normal-solo-fpp	2.248724e-07	1
squad-fpp	0.000000e+00	0
duo	0.000000e+00	0
solo-fpp	0.000000e+00	0
squad	0.000000e+00	0
duo-fpp	0.000000e+00	0
crashfpp	0.000000e+00	0
flaretpp	0.000000e+00	0
flarefpp	0.000000e+00	0
normal-duo-fpp	0.000000e+00	0
normal-duo	0.000000e+00	0
normal-squad	0.000000e+00	0
crashtpp	0.000000e+00	0
normal-solo	0.000000e+00	0

	loss_perc	loss
squad-fpp	1.585351e-04	705
duo-fpp	7.353328e-05	327
solo-fpp	4.227601e-05	188
squad	4.002729e-05	178
duo	2.203750e-05	98
solo	1.259286e-05	56
normal-duo-fpp	2.698469e-06	12
normal-squad-fpp	2.023852e-06	9
flaretpp	1.124362e-06	5
flarefpp	2.248724e-07	1
crashfpp	0.000000e+00	0
normal-solo-fpp	0.000000e+00	0
normal-duo	0.000000e+00	0
normal-squad	0.000000e+00	0
crashtpp	0.000000e+00	0
normal-solo	0.000000e+00	0

	rideDistance	swimDistance	walkDistance	totalDistance
0	0.0000	0.00	244.80	244.8000
1	0.0045	11.04	1434.00	1445.0445
2	0.0000	0.00	161.80	161.8000
3	0.0000	0.00	202.70	202.7000
4	0.0000	0.00	49.75	49.7500

[PUBG] Detecting Outliers

Detecting Outliers¶

1. Head Shot¶

2. Damagedealt¶

3. kills¶

4. killstreaks¶

5. longestkill¶

6. rankpoints(elo-like ranking)¶

7. revives¶

8. roadkills¶

9. swimdistance¶

10. teamkills¶

11. walkdistance¶

12. weaponsacquired¶

1. Headshot¶

Head Shot Rate¶

How many outliers are in headshotrates¶

2. Damage Dealt¶

Check the loss¶

3. kills¶

4. killstreaks¶

5. longestkill¶

6. rankpoints¶

7. revives¶

8. roadkills¶

9. teamKills¶

10. swimdistance¶

11. walkdistance¶

12. ridedistance¶

13. weaponsacquired¶

14. heals¶

15. boosts¶

Detecting Outlier w/ Condition¶

1. Not moving and Kill¶

2. Plot Target¶

3. Road kill and RideDistance¶

4. Weapon Acquired and win¶

5. Heals and Boost and win¶

Heals n Boosts¶

'DL' 카테고리의 다른 글

관련글

댓글

티스토리툴바