728x90
Detecting Outliers¶
1. Head Shot¶
For many users, high head shot rate is almost impossible. Even professional online fps gamers hard to exceed 30%.
headshotkills/kills = headshot_rate
2. Damagedealt¶
3. kills¶
4. killstreaks¶
5. longestkill¶
As fas as I know, 1km kill is very hard to achieve.
6. rankpoints(elo-like ranking)¶
7. revives¶
8. roadkills¶
9. swimdistance¶
10. teamkills¶
For detecting abusers.
11. walkdistance¶
12. weaponsacquired¶
In [1]:
import os, time, gc
import pandas as pd, numpy as np
from tqdm import tqdm
import matplotlib.pyplot as plt
import seaborn as sns
In [2]:
os.listdir('input')
Out[2]:
In [3]:
%%time
tr = pd.read_csv("input/train_V2.csv")
te = pd.read_csv("input/test_V2.csv")
In [4]:
def missing_values_table(df):# Function to calculate missing values by column# Funct
mis_val = df.isnull().sum() # Total missing values
mis_val_pct = 100 * df.isnull().sum() / len(df)# Percentage of missing values
mis_val_df = pd.concat([mis_val, mis_val_pct], axis=1)# Make a table with the results
mis_val_df_cols = mis_val_df.rename(columns = {0 : 'Missing Values', 1 : '% of Total Values'})# Rename the columns
mis_val_df_cols = mis_val_df_cols[mis_val_df_cols.iloc[:,1] != 0].sort_values('% of Total Values', ascending=False).round(1)# Sort the table by percentage of missing descending
print ("Dataframe has " + str(df.shape[1]) + " columns.\n"
"There are " + str(mis_val_df_cols.shape[0]) + " cols having missing values.")# Print some summary information
return mis_val_df_cols # Return the dataframe with missing information
In [5]:
missing_values_table(tr)
Out[5]:
In [6]:
missing_values_table(te)
Out[6]:
1. Headshot¶
In [7]:
%%time
plt.figure(figsize=(5,5))
sns.jointplot(x="winPlacePerc", y="headshotKills", data=tr, ratio=3, color="darkolivegreen")
plt.title("headshotKills and Target")
plt.show()
In [8]:
mt_ls = tr['matchType'].unique()
mt_ls.sort()
mt_ls
Out[8]:
In [9]:
%%time
print("="*60+" {} ".format('headshotKills').upper()+"="*60+"\n")
plt.figure()
fig, ax = plt.subplots(4,4,figsize=(20,8)) #(nrow, ncol)
for i, mt in enumerate(mt_ls):
# i += 1
plt.subplot(4,4,i+1)
plt.scatter(tr[tr['matchType']==mt]['winPlacePerc'], tr[tr['matchType']==mt]['headshotKills'], marker='+', c='red')
#plt.xlabel(feat, fontsize=9)
#plt.xticks([1,8,15,22,29])
plt.title(mt.upper(), fontsize=9)
plt.tight_layout()
plt.show();
Head Shot Rate¶
In [10]:
tr['HSR'] = tr['headshotKills']/tr['kills']
In [11]:
tr['HSR'].describe()
Out[11]:
In [12]:
%%time
plt.figure(figsize=(5,5))
sns.jointplot(x="winPlacePerc", y="HSR", data=tr, ratio=3, color="r")
plt.title("HSR and Target")
plt.show()
In [13]:
%%time
print("="*60+" {} ".format('HSR').upper()+"="*60+"\n")
plt.figure()
fig, ax = plt.subplots(4,4,figsize=(20,8)) #(nrow, ncol)
for i, mt in enumerate(mt_ls):
# i += 1
plt.subplot(4,4,i+1)
plt.scatter(tr[tr['matchType']==mt]['winPlacePerc'], tr[tr['matchType']==mt]['HSR'], marker='+', c='red')
#plt.xlabel(feat, fontsize=9)
#plt.xticks([1,8,15,22,29])
plt.title(mt.upper(), fontsize=9)
plt.tight_layout()
plt.show();
As you can see, many players' head shot rates are over 0.5. This is insane.
Let's clean the data and see how the plots are changes.
In [14]:
tr['HSR_clean'] = np.where(tr['HSR']>=0.5, 0, tr['HSR'])
In [15]:
tr['HSR_clean'].describe()
Out[15]:
Head rates that near 0.5 is also quite high.
I have to see what happens if I lower the threshold until 0.3.
In [16]:
%%time
plt.figure(figsize=(5,5))
sns.jointplot(x="winPlacePerc", y="HSR_clean", data=tr, ratio=3, color="r")
plt.ylim((0,1))
plt.title("HSR and Target")
plt.show()
In [17]:
%%time
print("="*60+" {} ".format('HSR_clean').upper()+"="*60+"\n")
plt.figure()
fig, ax = plt.subplots(4,4,figsize=(20,8)) #(nrow, ncol)
for i, mt in enumerate(mt_ls):
# i += 1
plt.subplot(4,4,i+1)
plt.scatter(tr[tr['matchType']==mt]['winPlacePerc'], tr[tr['matchType']==mt]['HSR_clean'], marker='+', c='red')
#plt.xlabel(feat, fontsize=9)
#plt.xticks([1,8,15,22,29])
plt.title(mt.upper(), fontsize=9)
plt.tight_layout()
plt.show();
The upper lines diappeared. It looks quite normal.
I can do more cleansing process.
But I would stop here.
How many outliers are in headshotrates¶
In [18]:
tr['HSR_outnum'] = np.where(tr['HSR']>=0.5, 1, 0)
print("# of outliers in HSR : {} & {:0.4f}".format(tr['HSR_outnum'].sum(),tr['HSR_outnum'].sum()/tr.shape[0]))
This doesn't look big. But we have to check the loss by matchType
In [19]:
loss_pc = []
loss = []
for t in tr['matchType'].unique():
# print("{} has loss of {:0.4f}".format(t.upper(), tr[(tr['matchType']==t) & (tr['HSR_outnum']==1)].shape[0]/tr.shape[0]))
loss_pc.append(tr[(tr['matchType']==t) & (tr['HSR_outnum']==1)].shape[0]/tr.shape[0])
loss.append(tr[(tr['matchType']==t) & (tr['HSR_outnum']==1)].shape[0])
df = pd.DataFrame(data=loss_pc)
df.index = tr['matchType'].unique().tolist()
df.columns = ['loss_perc']
df['loss'] = loss
df.sort_values(by='loss_perc', inplace=True, ascending=False)
df
Out[19]:
With this dataframe, we can notice that squad-fpp has the most cheaters.
And losses are not quite big than expected. Thus I suppose I could eliminate the outliers for sure.
2. Damage Dealt¶
In [20]:
%%time
plt.figure(figsize=(5,5))
sns.jointplot(x="winPlacePerc", y="damageDealt", data=tr, ratio=3, color="darkturquoise")
plt.title("damageDealt and Target")
plt.show()
In [21]:
%%time
print("="*60+" {} ".format('damageDealt').upper()+"="*60+"\n")
plt.figure()
fig, ax = plt.subplots(4,4,figsize=(20,8)) #(nrow, ncol)
for i, mt in enumerate(mt_ls):
# i += 1
plt.subplot(4,4,i+1)
plt.scatter(tr[tr['matchType']==mt]['winPlacePerc'], tr[tr['matchType']==mt]['damageDealt'], marker='+', c='darkturquoise')
#plt.xlabel(feat, fontsize=9)
#plt.xticks([1,8,15,22,29])
plt.title(mt.upper(), fontsize=9)
plt.tight_layout()
plt.show();
Same with boxplot
In [22]:
%%time
print("="*60+" {} ".format('damageDealt').upper()+"="*60+"\n")
plt.figure()
fig, ax = plt.subplots(4,4,figsize=(20,8)) #(nrow, ncol)
for i, mt in enumerate(mt_ls):
# i += 1
plt.subplot(4,4,i+1)
plt.boxplot(tr[tr['matchType']==mt]['damageDealt'], vert=False)
#plt.xlabel(feat, fontsize=9)
#plt.xticks([1,8,15,22,29])
plt.title(mt.upper(), fontsize=9)
plt.tight_layout()
plt.show();
Unlike headshot, there aren't upper line. But we have to closely look the plots. We have to look y-axis. If I consider the baseline as solo match, maxium value of damagedealt are around 2k. There are y-axis that over 4k even 6k.
However, I can't assume 4k and 6k values are the outliers. But consider over 4k as outlier seems right decision to me and see the result for this.
In [23]:
tr['DD_clean'] = np.where(tr['damageDealt']>=4000, 0, tr['damageDealt'])
In [24]:
tr['DD_clean'].describe()
Out[24]:
In [25]:
%%time
print("="*60+" {} ".format('DD_clean').upper()+"="*60+"\n")
plt.figure()
fig, ax = plt.subplots(4,4,figsize=(20,8)) #(nrow, ncol)
for i, mt in enumerate(mt_ls):
# i += 1
plt.subplot(4,4,i+1)
plt.scatter(tr[tr['matchType']==mt]['winPlacePerc'], tr[tr['matchType']==mt]['DD_clean'], marker='+', c='darkturquoise')
#plt.xlabel(feat, fontsize=9)
#plt.xticks([1,8,15,22,29])
plt.title(mt.upper(), fontsize=9)
plt.tight_layout()
plt.show();
Feels comfortable
Check the loss¶
In [26]:
tr['DD_outnum'] = np.where(tr['damageDealt']>=4000, 1, 0)
print("# of outliers in DD : {} & {:0.4f}".format(tr['DD_outnum'].sum(),tr['DD_outnum'].sum()/tr.shape[0]))
In [27]:
loss_pc = []
loss = []
for t in tr['matchType'].unique():
# print("{} has loss of {:0.4f}".format(t.upper(), tr[(tr['matchType']==t) & (tr['HSR_outnum']==1)].shape[0]/tr.shape[0]))
loss_pc.append(tr[(tr['matchType']==t) & (tr['DD_outnum']==1)].shape[0]/tr.shape[0])
loss.append(tr[(tr['matchType']==t) & (tr['DD_outnum']==1)].shape[0])
df = pd.DataFrame(data=loss_pc)
df.index = tr['matchType'].unique().tolist()
df.columns = ['loss_perc']
df['loss'] = loss
df.sort_values(by='loss_perc', inplace=True, ascending=False)
df
Out[27]:
Few outliers were there.
3. kills¶
In [28]:
%%time
plt.figure(figsize=(5,5))
sns.jointplot(x="winPlacePerc", y="kills", data=tr, ratio=3, color="blue")
plt.title("kills and Target")
plt.show()
In [29]:
%%time
print("="*60+" {} ".format('kills').upper()+"="*60+"\n")
plt.figure()
fig, ax = plt.subplots(4,4,figsize=(20,8)) #(nrow, ncol)
for i, mt in enumerate(mt_ls):
# i += 1
plt.subplot(4,4,i+1)
plt.scatter(tr[tr['matchType']==mt]['winPlacePerc'], tr[tr['matchType']==mt]['kills'], marker='+', c='blue')
#plt.xlabel(feat, fontsize=9)
#plt.xticks([1,8,15,22,29])
plt.title(mt.upper(), fontsize=9)
plt.tight_layout()
plt.show();
Well, this is quite impressive since I expected many outliers but it doesn't seem to have much.
Squad match can reach 60 kills but Solo?
I had seen some players kill around 40 kill when play solo match. But 60?
I can't surely decide any outliers in kills.
4. killstreaks¶
In [30]:
%%time
plt.figure(figsize=(5,5))
sns.jointplot(x="winPlacePerc", y="killStreaks", data=tr, ratio=3, color="tomato")
plt.title("killStreaks and Target")
plt.show()
In [31]:
%%time
print("="*60+" {} ".format('killStreaks').upper()+"="*60+"\n")
plt.figure()
fig, ax = plt.subplots(4,4,figsize=(20,8)) #(nrow, ncol)
for i, mt in enumerate(mt_ls):
# i += 1
plt.subplot(4,4,i+1)
plt.scatter(tr[tr['matchType']==mt]['winPlacePerc'], tr[tr['matchType']==mt]['killStreaks'], marker='+', c='tomato')
#plt.xlabel(feat, fontsize=9)
#plt.xticks([1,8,15,22,29])
plt.title(mt.upper(), fontsize=9)
plt.tight_layout()
plt.show();
There is one plot you can tell. NORMAL-SQUAD-FPP. Most of the plot shows maxium values are around 10. Solo is possible. But unlike SQUAD-FPP's max doesn't exceed 8.
Therefore, I can decide that Only for NORMAL-SQUAD-FPP matchType, killstreak > 10 then considered as outlier.
In [32]:
tr['KS_clean'] = np.where(tr['killStreaks']>10, 0, tr['killStreaks'])
tr['KS_outnum'] = np.where(tr['killStreaks']>10, 1, 0)
print("# of outliers in KS : {} & {:0.4f}".format(tr['KS_outnum'].sum(),tr['KS_outnum'].sum()/tr.shape[0]))
In [33]:
loss_pc = []
loss = []
for t in tr['matchType'].unique():
# print("{} has loss of {:0.4f}".format(t.upper(), tr[(tr['matchType']==t) & (tr['HSR_outnum']==1)].shape[0]/tr.shape[0]))
loss_pc.append(tr[(tr['matchType']==t) & (tr['KS_outnum']==1)].shape[0]/tr.shape[0])
loss.append(tr[(tr['matchType']==t) & (tr['KS_outnum']==1)].shape[0])
df = pd.DataFrame(data=loss_pc)
df.index = tr['matchType'].unique().tolist()
df.columns = ['loss_perc']
df['loss'] = loss
df.sort_values(by='loss_perc', inplace=True, ascending=False)
df
Out[33]:
Except SOLO match, there are 5 outliers.
5. longestkill¶
In [34]:
%%time
plt.figure(figsize=(5,5))
sns.jointplot(x="winPlacePerc", y="longestKill", data=tr, ratio=3, color="indigo")
plt.title("longestKill and Target")
plt.show()
In [35]:
%%time
print("="*60+" {} ".format('longestKill').upper()+"="*60+"\n")
plt.figure()
fig, ax = plt.subplots(4,4,figsize=(20,8)) #(nrow, ncol)
for i, mt in enumerate(mt_ls):
# i += 1
plt.subplot(4,4,i+1)
plt.scatter(tr[tr['matchType']==mt]['winPlacePerc'], tr[tr['matchType']==mt]['longestKill'], marker='+', c='indigo')
#plt.xlabel(feat, fontsize=9)
#plt.xticks([1,8,15,22,29])
plt.title(mt.upper(), fontsize=9)
plt.tight_layout()
plt.show();
1km kill can be possible.
Thus, I can't tell there is an outlier.
I need more evidences to cull them out.
6. rankpoints¶
In [36]:
%%time
plt.figure(figsize=(5,5))
sns.jointplot(x="winPlacePerc", y="rankPoints", data=tr, ratio=3, color="rosybrown")
plt.title("rankPoints and Target")
plt.show()
In [37]:
%%time
print("="*60+" {} ".format('longestKill').upper()+"="*60+"\n")
plt.figure()
fig, ax = plt.subplots(4,4,figsize=(20,8)) #(nrow, ncol)
for i, mt in enumerate(mt_ls):
# i += 1
plt.subplot(4,4,i+1)
plt.scatter(tr[tr['matchType']==mt]['winPlacePerc'], tr[tr['matchType']==mt]['longestKill'], marker='+', c='rosybrown')
#plt.xlabel(feat, fontsize=9)
#plt.xticks([1,8,15,22,29])
plt.title(mt.upper(), fontsize=9)
plt.tight_layout()
plt.show();
Hard to tell
7. revives¶
In [38]:
%%time
plt.figure(figsize=(5,5))
sns.jointplot(x="winPlacePerc", y="revives", data=tr, ratio=3, color="dimgrey")
plt.title("revives and Target")
plt.show()
In [39]:
%%time
print("="*60+" {} ".format('revives').upper()+"="*60+"\n")
plt.figure()
fig, ax = plt.subplots(4,4,figsize=(20,8)) #(nrow, ncol)
for i, mt in enumerate(mt_ls):
# i += 1
plt.subplot(4,4,i+1)
plt.scatter(tr[tr['matchType']==mt]['winPlacePerc'], tr[tr['matchType']==mt]['revives'], marker='+', c='dimgrey')
#plt.xlabel(feat, fontsize=9)
#plt.xticks([1,8,15,22,29])
plt.title(mt.upper(), fontsize=9)
plt.tight_layout()
plt.show();
Hard to tell
8. roadkills¶
In [40]:
%%time
plt.figure(figsize=(5,5))
sns.jointplot(x="winPlacePerc", y="roadKills", data=tr, ratio=3, color="navy")
plt.title("roadKills and Target")
plt.show()
In [41]:
%%time
print("="*60+" {} ".format('roadKills').upper()+"="*60+"\n")
plt.figure()
fig, ax = plt.subplots(4,4,figsize=(20,8)) #(nrow, ncol)
for i, mt in enumerate(mt_ls):
# i += 1
plt.subplot(4,4,i+1)
plt.scatter(tr[tr['matchType']==mt]['winPlacePerc'], tr[tr['matchType']==mt]['roadKills'], marker='+', c='navy')
#plt.xlabel(feat, fontsize=9)
#plt.xticks([1,8,15,22,29])
plt.title(mt.upper(), fontsize=9)
plt.tight_layout()
plt.show();
Let's make roadkill rate.
In [42]:
tr['RK_rate'] = tr['roadKills']/tr['kills']
In [43]:
%%time
plt.figure(figsize=(5,5))
sns.jointplot(x="winPlacePerc", y="RK_rate", data=tr, ratio=3, color="navy")
plt.title("RK_rate and Target")
plt.show()
In [44]:
%%time
plt.figure(figsize=(5,5))
sns.jointplot(x="kills", y="RK_rate", data=tr, ratio=3, color="navy")
plt.title("RK_rate and kills")
plt.show()
Hard to tell
9. teamKills¶
In [45]:
%%time
plt.figure(figsize=(5,5))
sns.jointplot(x="winPlacePerc", y="teamKills", data=tr, ratio=3, color="violet")
plt.title("teamKills and Target")
plt.show()
In [46]:
%%time
print("="*60+" {} ".format('teamKills').upper()+"="*60+"\n")
plt.figure()
fig, ax = plt.subplots(4,4,figsize=(20,8)) #(nrow, ncol)
for i, mt in enumerate(mt_ls):
# i += 1
plt.subplot(4,4,i+1)
plt.scatter(tr[tr['matchType']==mt]['winPlacePerc'], tr[tr['matchType']==mt]['teamKills'], marker='+', c='violet')
#plt.xlabel(feat, fontsize=9)
#plt.xticks([1,8,15,22,29])
plt.title(mt.upper(), fontsize=9)
plt.tight_layout()
plt.show();
As you see Solo has also teamkills, it says that self-kill count as teamkills. Thus, Squad can have multiple teamkills.
But, how can I detect whether this teamkills happen accidently or intentionally.
10. swimdistance¶
In [47]:
%%time
plt.figure(figsize=(5,5))
sns.jointplot(x="winPlacePerc", y="swimDistance", data=tr, ratio=3, color="steelblue")
plt.title("swimDistance and Target")
plt.show()
In [48]:
%%time
print("="*60+" {} ".format('swimDistance').upper()+"="*60+"\n")
plt.figure()
fig, ax = plt.subplots(4,4,figsize=(20,8)) #(nrow, ncol)
for i, mt in enumerate(mt_ls):
# i += 1
plt.subplot(4,4,i+1)
plt.scatter(tr[tr['matchType']==mt]['winPlacePerc'], tr[tr['matchType']==mt]['swimDistance'], marker='+', c='steelblue')
#plt.xlabel(feat, fontsize=9)
#plt.xticks([1,8,15,22,29])
plt.title(mt.upper(), fontsize=9)
plt.tight_layout()
plt.show();
Hard to tell
11. walkdistance¶
In [49]:
%%time
plt.figure(figsize=(5,5))
sns.jointplot(x="winPlacePerc", y="walkDistance", data=tr, ratio=3, color="darkcyan")
plt.title("walkDistance and Target")
plt.show()
In [50]:
%%time
print("="*60+" {} ".format('walkDistance').upper()+"="*60+"\n")
plt.figure()
fig, ax = plt.subplots(4,4,figsize=(20,8)) #(nrow, ncol)
for i, mt in enumerate(mt_ls):
# i += 1
plt.subplot(4,4,i+1)
plt.scatter(tr[tr['matchType']==mt]['winPlacePerc'], tr[tr['matchType']==mt]['walkDistance'], marker='+', c='darkcyan')
#plt.xlabel(feat, fontsize=9)
#plt.xticks([1,8,15,22,29])
plt.title(mt.upper(), fontsize=9)
plt.tight_layout()
plt.show();
If you see the map for example erangel.
On the left image, the yellow diagonal line is about 7.5km.
And this is the one of the biggest maps in PUBG.
On the right image, the yellow grid shows 1km in width and height.
Therefore, if a player who moved more than 7.5k is a possible outlier.
In [51]:
tr['WD_clean'] = np.where(tr['walkDistance']>7500, 0, tr['walkDistance'])
tr['WD_outnum'] = np.where(tr['walkDistance']>7500, 1, 0)
print("# of outliers in walkDistance : {} & {:0.4f}".format(tr['WD_outnum'].sum(),tr['WD_outnum'].sum()/tr.shape[0]))
In [52]:
%%time
print("="*60+" {} ".format('walkDistance_CLEAN').upper()+"="*60+"\n")
plt.figure()
fig, ax = plt.subplots(4,4,figsize=(20,8)) #(nrow, ncol)
for i, mt in enumerate(mt_ls):
# i += 1
plt.subplot(4,4,i+1)
plt.scatter(tr[tr['matchType']==mt]['winPlacePerc'], tr[tr['matchType']==mt]['WD_clean'], marker='+', c='darkcyan')
#plt.xlabel(feat, fontsize=9)
#plt.xticks([1,8,15,22,29])
plt.title(mt.upper(), fontsize=9)
plt.tight_layout()
plt.show();
In [53]:
loss_pc = []
loss = []
for t in tr['matchType'].unique():
# print("{} has loss of {:0.4f}".format(t.upper(), tr[(tr['matchType']==t) & (tr['HSR_outnum']==1)].shape[0]/tr.shape[0]))
loss_pc.append(tr[(tr['matchType']==t) & (tr['WD_outnum']==1)].shape[0]/tr.shape[0])
loss.append(tr[(tr['matchType']==t) & (tr['WD_outnum']==1)].shape[0])
df = pd.DataFrame(data=loss_pc)
df.index = tr['matchType'].unique().tolist()
df.columns = ['loss_perc']
df['loss'] = loss
df.sort_values(by='loss_perc', inplace=True, ascending=False)
df
Out[53]:
After cleansing the outliers, the plots have some cuts in upper side.
12. ridedistance¶
In [54]:
%%time
plt.figure(figsize=(5,5))
sns.jointplot(x="winPlacePerc", y="rideDistance", data=tr, ratio=3, color="palevioletred")
plt.title("rideDistance and Target")
plt.show()
In [55]:
%%time
print("="*60+" {} ".format('rideDistance').upper()+"="*60+"\n")
plt.figure()
fig, ax = plt.subplots(4,4,figsize=(20,8)) #(nrow, ncol)
for i, mt in enumerate(mt_ls):
# i += 1
plt.subplot(4,4,i+1)
plt.scatter(tr[tr['matchType']==mt]['winPlacePerc'], tr[tr['matchType']==mt]['rideDistance'], marker='+', c='palevioletred')
#plt.xlabel(feat, fontsize=9)
#plt.xticks([1,8,15,22,29])
plt.title(mt.upper(), fontsize=9)
plt.tight_layout()
plt.show();
Hard to tell
13. weaponsacquired¶
In [56]:
%%time
plt.figure(figsize=(5,5))
sns.jointplot(x="winPlacePerc", y="weaponsAcquired", data=tr, ratio=3, color="mediumpurple")
plt.title("weaponsAcquired and Target")
plt.show()
In [57]:
%%time
print("="*60+" {} ".format('weaponsAcquired').upper()+"="*60+"\n")
plt.figure()
fig, ax = plt.subplots(4,4,figsize=(20,8)) #(nrow, ncol)
for i, mt in enumerate(mt_ls):
# i += 1
plt.subplot(4,4,i+1)
plt.scatter(tr[tr['matchType']==mt]['winPlacePerc'], tr[tr['matchType']==mt]['weaponsAcquired'], marker='+', c='mediumpurple')
#plt.xlabel(feat, fontsize=9)
#plt.xticks([1,8,15,22,29])
plt.title(mt.upper(), fontsize=9)
plt.tight_layout()
plt.show();
In [58]:
plt.figure(figsize=(20,5))
sns.distplot(tr['weaponsAcquired'], bins=100)
plt.xticks(np.arange(0, 300, step=10))
plt.show()
As I searched the game stat website, rankers are collecting weapons around 5 to 6.
But, here many game types have much more number of obtained weapons.
I'd like to eliminate upto 20 weapons to take.
In [59]:
tr['WP_clean'] = np.where(tr['weaponsAcquired']>20, 0, tr['weaponsAcquired'])
tr['WP_outnum'] = np.where(tr['weaponsAcquired']>20, 1, 0)
print("# of outliers in weaponsAcquired : {} & {:0.4f}".format(tr['WP_outnum'].sum(),tr['WP_outnum'].sum()/tr.shape[0]))
In [60]:
%%time
print("="*60+" {} ".format('weaponsAcquired_CLEAN').upper()+"="*60+"\n")
plt.figure()
fig, ax = plt.subplots(4,4,figsize=(20,8)) #(nrow, ncol)
for i, mt in enumerate(mt_ls):
# i += 1
plt.subplot(4,4,i+1)
plt.scatter(tr[tr['matchType']==mt]['winPlacePerc'], tr[tr['matchType']==mt]['WP_clean'], marker='+', c='mediumpurple')
#plt.xlabel(feat, fontsize=9)
#plt.xticks([1,8,15,22,29])
plt.title(mt.upper(), fontsize=9)
plt.tight_layout()
plt.show();
14. heals¶
In [61]:
%%time
plt.figure(figsize=(5,5))
sns.jointplot(x="winPlacePerc", y="heals", data=tr, ratio=3, color="maroon")
plt.title("heals and Target")
plt.show()
In [62]:
%%time
print("="*60+" {} ".format('heals').upper()+"="*60+"\n")
plt.figure()
fig, ax = plt.subplots(4,4,figsize=(20,8)) #(nrow, ncol)
for i, mt in enumerate(mt_ls):
# i += 1
plt.subplot(4,4,i+1)
plt.scatter(tr[tr['matchType']==mt]['winPlacePerc'], tr[tr['matchType']==mt]['heals'], marker='+', c='maroon')
#plt.xlabel(feat, fontsize=9)
#plt.xticks([1,8,15,22,29])
plt.title(mt.upper(), fontsize=9)
plt.tight_layout()
plt.show();
As I usually plays this game like a camper, i.e. avoiding confronting enemy and focusing on surviving, I generally uses around 20 heals if I survived near top 10.
But around 50 heals are quite suspicious.
I'd like to remove players who heal more than 40 times.
In [63]:
tr['HL_clean'] = np.where(tr['heals']>40, 0, tr['heals'])
tr['HL_outnum'] = np.where(tr['heals']>40, 1, 0)
print("# of outliers in HEALS : {} & {:0.4f}".format(tr['HL_outnum'].sum(),tr['HL_outnum'].sum()/tr.shape[0]))
In [64]:
%%time
print("="*60+" {} ".format('heals_CLEAN').upper()+"="*60+"\n")
plt.figure()
fig, ax = plt.subplots(4,4,figsize=(20,8)) #(nrow, ncol)
for i, mt in enumerate(mt_ls):
# i += 1
plt.subplot(4,4,i+1)
plt.scatter(tr[tr['matchType']==mt]['winPlacePerc'], tr[tr['matchType']==mt]['HL_clean'], marker='+', c='maroon')
#plt.xlabel(feat, fontsize=9)
#plt.xticks([1,8,15,22,29])
plt.title(mt.upper(), fontsize=9)
plt.tight_layout()
plt.show();
15. boosts¶
In [65]:
%%time
plt.figure(figsize=(5,5))
sns.jointplot(x="winPlacePerc", y="boosts", data=tr, ratio=3, color="lightseagreen")
plt.title("boosts and Target")
plt.show()
In [66]:
%%time
print("="*60+" {} ".format('boosts').upper()+"="*60+"\n")
plt.figure()
fig, ax = plt.subplots(4,4,figsize=(20,8)) #(nrow, ncol)
for i, mt in enumerate(mt_ls):
# i += 1
plt.subplot(4,4,i+1)
plt.scatter(tr[tr['matchType']==mt]['winPlacePerc'], tr[tr['matchType']==mt]['boosts'], marker='+', c='lightseagreen')
#plt.xlabel(feat, fontsize=9)
#plt.xticks([1,8,15,22,29])
plt.title(mt.upper(), fontsize=9)
plt.tight_layout()
plt.show();
The number of boosts over 20 seems unusual.
In [67]:
tr['BS_clean'] = np.where(tr['boosts']>20, 0, tr['boosts'])
tr['BS_outnum'] = np.where(tr['boosts']>20, 1, 0)
print("# of outliers in BOOSTS : {} & {:0.4f}".format(tr['BS_outnum'].sum(),tr['BS_outnum'].sum()/tr.shape[0]))
In [68]:
%%time
print("="*60+" {} ".format('BS_clean').upper()+"="*60+"\n")
plt.figure()
fig, ax = plt.subplots(4,4,figsize=(20,8)) #(nrow, ncol)
for i, mt in enumerate(mt_ls):
# i += 1
plt.subplot(4,4,i+1)
plt.scatter(tr[tr['matchType']==mt]['winPlacePerc'], tr[tr['matchType']==mt]['BS_clean'], marker='+', c='lightseagreen')
#plt.xlabel(feat, fontsize=9)
#plt.xticks([1,8,15,22,29])
plt.title(mt.upper(), fontsize=9)
plt.tight_layout()
plt.show();
Detecting Outlier w/ Condition¶
- zero walk distance and kill
- winplaceperc == 1?
- road kill and ridedistance(low ride and many roadkill
- weaponacquired == 0 & high prob of win
- heal == 0 & high prob of win
1. Not moving and Kill¶
Make total Distance moved
In [69]:
tr['totalDistance'] = tr[tr.filter(like='Distance',axis=1).columns].sum(axis=1)
tr.filter(like='Distance',axis=1).head()
Out[69]:
Not moving means joined the match, and left the game after match actually started.
In [70]:
tr[tr['totalDistance']==0].shape[0]/tr.shape[0], tr[tr['totalDistance']==0].shape[0]
Out[70]:
In [71]:
tr[tr['walkDistance']==0].shape[0]/tr.shape[0]
Out[71]:
In [72]:
tr['notmovingkill'] = np.where((tr['totalDistance']==0)&(tr['kills']!=0), 1, 0)
tr[tr['notmovingkill']==1].shape
Out[72]:
In [73]:
tr['notwalkingkill'] = np.where((tr['walkDistance']==0)&(tr['kills']!=0), 1, 0)
tr[tr['notwalkingkill']==1].shape
Out[73]:
In [74]:
tr[tr['notmovingkill']==1][tr.columns[3:-13]]
Out[74]:
2. Plot Target¶
In [75]:
tr[tr['winPlacePerc'].isnull()]
Out[75]:
In [76]:
tr.drop(2744604, inplace=True)
In [77]:
plt.figure(figsize=(20,5))
sns.distplot(tr['winPlacePerc'], bins=100)
# plt.xticks(np.arange(0, 300, step=10))
plt.show()
How many players who has 100% winPlacePerc?
In [78]:
tr[tr['winPlacePerc']==1].shape[0],tr[tr['winPlacePerc']==1].shape[0]/tr.shape[0]
Out[78]:
3. Road kill and RideDistance¶
In [79]:
tr[(tr['rideDistance']==0)&tr['roadKills']!=0].shape
Out[79]:
In [80]:
tr[(tr['rideDistance']==0)&tr['roadKills']!=0]['roadKills'].describe()
Out[80]:
Definitely outliers.
4. Weapon Acquired and win¶
In [81]:
tr[(tr['weaponsAcquired']==0)&tr['winPlacePerc']==0].shape[0],tr[(tr['weaponsAcquired']==0)&tr['winPlacePerc']==0].shape[0]/tr.shape[0]
Out[81]:
In [82]:
tr[(tr['weaponsAcquired']==0)&tr['winPlacePerc']!=0].shape[0],tr[(tr['weaponsAcquired']==0)&tr['winPlacePerc']!=0].shape[0]/tr.shape[0]
Out[82]:
In [83]:
tr[(tr['weaponsAcquired']==0)&(tr['winPlacePerc']!=0)]['winPlacePerc'].describe()
Out[83]:
In [84]:
tr[(tr['weaponsAcquired']==0)&(tr['winPlacePerc']>0.2)].shape[0]
Out[84]:
In [85]:
tr[(tr['weaponsAcquired']==0)&(tr['winPlacePerc']>0.5)].shape[0]
Out[85]:
No weapon acquired means that you have to survive by your hands only.
This indicates two things. One thing is that low win perc with zero weapons means the player is actually not playing.
The other thing is that high win perc with zero weapons means cheaters.
I have to choose the threshold of winPlacePerc with zero weapon players for culling out cheaters.
In [86]:
tr[(tr['weaponsAcquired']==0)&(tr['winPlacePerc']==1)].shape[0]
Out[86]:
These 201 players are definitely cheaters.
In [87]:
tr[(tr['weaponsAcquired']==0)&(tr['winPlacePerc']>0.5)]['winPlacePerc'].describe()
Out[87]:
In [88]:
tr[(tr['weaponsAcquired']==0)&(tr['winPlacePerc']>0.8)].shape[0]
Out[88]:
These 2022 players are highly cheaters, I assume.
5. Heals and Boost and win¶
In [89]:
tr[tr['heals']==0]['winPlacePerc'].describe()
Out[89]:
There is zero heals but wins the game.
In [90]:
tr[(tr['heals']==0)&(tr['winPlacePerc']!=0)]['winPlacePerc'].describe()
Out[90]:
In [91]:
tr[(tr['heals']==0)&(tr['winPlacePerc']>0.5)]['winPlacePerc'].describe()
Out[91]:
In [92]:
tr[(tr['heals']==0)&(tr['winPlacePerc']==1)].shape[0]
Out[92]:
This 20889 players play zero heals but win.
Even luckily not heals or, if in squad mode other teammate do win can possible.
Even so, 20889 is quite big number to achieve, I suppose.
In [93]:
tr[(tr['heals']==0)&(tr['winPlacePerc']>0.8)].shape[0], tr[(tr['heals']==0)&(tr['winPlacePerc']>0.8)].shape[0]/tr.shape[0]
Out[93]:
Heals n Boosts¶
In [94]:
tr['HnB'] = tr[['heals','boosts']].sum(axis=1)
In [95]:
tr[(tr['HnB']==0)&(tr['winPlacePerc']==1)].shape[0]
Out[95]:
These 7075 are definitely cheaters.
No heals. No boosts. How those players win the game?
In [96]:
tr[(tr['HnB']==0)&(tr['winPlacePerc']>0.5)]['winPlacePerc'].describe()
Out[96]:
In [97]:
tr[(tr['HnB']==0)&(tr['winPlacePerc']>0.8)]['winPlacePerc'].shape[0]
Out[97]:
I can say they are cheaters.
728x90
'DL' 카테고리의 다른 글
[Git repo] from "git clone" to "merge" (0) | 2021.01.31 |
---|---|
Pytorch Dataset - cv2.imread 메모리 사용 (0) | 2021.01.31 |
[PUBG] ML_baseline(lightgbm) (0) | 2020.06.02 |
[PUBG] EDA (0) | 2020.05.29 |
[tabnet] beating tablet data with deep learning (0) | 2020.05.28 |
댓글