My Exploratory Data Analysis for Microsoft Malware Prediction
Competition Information This is the second Microsoft Malware Hosted competition on Kaggle. For more detials, please see here. The malware industry continues to be a well-organized, well-funded market dedicated to evading traditional security measures. Once a computer is infected by malware, criminals can hurt consumers and enterprises in many ways.With more than one billion enterprise and consumer customers, Microsoft takes this problem very seriously and is deeply invested in improving security. As one part of their overall strategy for doing so, Microsoft is challenging the data science community to develop techniques to predict if a machine will soon be hit with malware. As with their previous, Malware Challenge (2015), Microsoft is providing Kagglers with an unprecedented malware dataset to encourage open-source progress on effective techniques for predicting malware occurrences.
Goals The goal of this competition is to predict a Windows machine’s probability of getting infected by various families of malware, based on different properties of that machine. The telemetry data containing these properties and the machine infections was generated by combining heartbeat and threat reports collected by Microsoft’s endpoint protection solution, Windows Defender. Each row in this dataset corresponds to a machine, uniquely identified by a MachineIdentifier. HasDetections is the ground truth and indicates that Malware was detected on the machine. Using the information and labels in train.csv, you must predict the value for HasDetections for each machine in test.csv.
About This Kernel In this kernal, I would like to recored my personal EDA for Microsoft Malware competition.
Imported Packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
import time
import sys
import datetime
plt.style.use('seaborn')
pd.set_option('display.max_columns', 500)
Data Types
This part was taken from the helpful kernel.
dtypes = {
'MachineIdentifier': 'category',
'ProductName': 'category',
'EngineVersion': 'category',
'AppVersion': 'category',
'AvSigVersion': 'category',
'IsBeta': 'int8',
'RtpStateBitfield': 'float16',
'IsSxsPassiveMode': 'int8',
'DefaultBrowsersIdentifier': 'float16',
'AVProductStatesIdentifier': 'float32',
'AVProductsInstalled': 'float16',
'AVProductsEnabled': 'float16',
'HasTpm': 'int8',
'CountryIdentifier': 'int16',
'CityIdentifier': 'float32',
'OrganizationIdentifier': 'float16',
'GeoNameIdentifier': 'float16',
'LocaleEnglishNameIdentifier': 'int8',
'Platform': 'category',
'Processor': 'category',
'OsVer': 'category',
'OsBuild': 'int16',
'OsSuite': 'int16',
'OsPlatformSubRelease': 'category',
'OsBuildLab': 'category',
'SkuEdition': 'category',
'IsProtected': 'float16',
'AutoSampleOptIn': 'int8',
'PuaMode': 'category',
'SMode': 'float16',
'IeVerIdentifier': 'float16',
'SmartScreen': 'category',
'Firewall': 'float16',
'UacLuaenable': 'float32',
'Census_MDC2FormFactor': 'category',
'Census_DeviceFamily': 'category',
'Census_OEMNameIdentifier': 'float16',
'Census_OEMModelIdentifier': 'float32',
'Census_ProcessorCoreCount': 'float16',
'Census_ProcessorManufacturerIdentifier': 'float16',
'Census_ProcessorModelIdentifier': 'float16',
'Census_ProcessorClass': 'category',
'Census_PrimaryDiskTotalCapacity': 'float32',
'Census_PrimaryDiskTypeName': 'category',
'Census_SystemVolumeTotalCapacity': 'float32',
'Census_HasOpticalDiskDrive': 'int8',
'Census_TotalPhysicalRAM': 'float32',
'Census_ChassisTypeName': 'category',
'Census_InternalPrimaryDiagonalDisplaySizeInInches': 'float16',
'Census_InternalPrimaryDisplayResolutionHorizontal': 'float16',
'Census_InternalPrimaryDisplayResolutionVertical': 'float16',
'Census_PowerPlatformRoleName': 'category',
'Census_InternalBatteryType': 'category',
'Census_InternalBatteryNumberOfCharges': 'float32',
'Census_OSVersion': 'category',
'Census_OSArchitecture': 'category',
'Census_OSBranch': 'category',
'Census_OSBuildNumber': 'int16',
'Census_OSBuildRevision': 'int32',
'Census_OSEdition': 'category',
'Census_OSSkuName': 'category',
'Census_OSInstallTypeName': 'category',
'Census_OSInstallLanguageIdentifier': 'float16',
'Census_OSUILocaleIdentifier': 'int16',
'Census_OSWUAutoUpdateOptionsName': 'category',
'Census_IsPortableOperatingSystem': 'int8',
'Census_GenuineStateName': 'category',
'Census_ActivationChannel': 'category',
'Census_IsFlightingInternal': 'float16',
'Census_IsFlightsDisabled': 'float16',
'Census_FlightRing': 'category',
'Census_ThresholdOptIn': 'float16',
'Census_FirmwareManufacturerIdentifier': 'float16',
'Census_FirmwareVersionIdentifier': 'float32',
'Census_IsSecureBootEnabled': 'int8',
'Census_IsWIMBootEnabled': 'float16',
'Census_IsVirtualDevice': 'float16',
'Census_IsTouchEnabled': 'int8',
'Census_IsPenCapable': 'int8',
'Census_IsAlwaysOnAlwaysConnectedCapable': 'float16',
'Wdft_IsGamer': 'float16',
'Wdft_RegionIdentifier': 'float16',
'HasDetections': 'int8'
}
%time train_data = pd.read_csv("data/train.csv", dtype=dtypes)
%time test_data = pd.read_csv("data/test.csv", dtype=dtypes)
Wall time: 2min 7s
Wall time: 1min 51s
Exploratory Data Analysis
train_data.head(20)
’HasDetections’ Distribustion
train_data['HasDetections'].value_counts()
0 4462591
1 4458892
Name: HasDetections, dtype: int64
Missing Vaules
Missing Values in Training Data
total = train_data.isnull().sum().sort_values(ascending = False)
percent = (train_data.isnull().sum()/train_data.isnull().count()*100).sort_values(ascending = False)
missing_train_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_train_data.head(50)
{% raw %}
| Total | Percent | |
|---|---|---|
| PuaMode | 8919174 | 99.974119 |
| Census_ProcessorClass | 8884852 | 99.589407 |
| DefaultBrowsersIdentifier | 8488045 | 95.141637 |
| Census_IsFlightingInternal | 7408759 | 83.044030 |
| Census_InternalBatteryType | 6338429 | 71.046809 |
| Census_ThresholdOptIn | 5667325 | 63.524472 |
| Census_IsWIMBootEnabled | 5659703 | 63.439038 |
| SmartScreen | 3177011 | 35.610795 |
| OrganizationIdentifier | 2751518 | 30.841487 |
| SMode | 537759 | 6.027686 |
| CityIdentifier | 325409 | 3.647477 |
| Wdft_IsGamer | 303451 | 3.401352 |
| Wdft_RegionIdentifier | 303451 | 3.401352 |
| Census_InternalBatteryNumberOfCharges | 268755 | 3.012448 |
| Census_FirmwareManufacturerIdentifier | 183257 | 2.054109 |
| Census_IsFlightsDisabled | 160523 | 1.799286 |
| Census_FirmwareVersionIdentifier | 160133 | 1.794915 |
| Census_OEMModelIdentifier | 102233 | 1.145919 |
| Census_OEMNameIdentifier | 95478 | 1.070203 |
| Firewall | 91350 | 1.023933 |
| Census_TotalPhysicalRAM | 80533 | 0.902686 |
| Census_IsAlwaysOnAlwaysConnectedCapable | 71343 | 0.799676 |
| Census_OSInstallLanguageIdentifier | 60084 | 0.673475 |
| IeVerIdentifier | 58894 | 0.660137 |
| Census_PrimaryDiskTotalCapacity | 53016 | 0.594251 |
| Census_SystemVolumeTotalCapacity | 53002 | 0.594094 |
| Census_InternalPrimaryDiagonalDisplaySizeInInches | 47134 | 0.528320 |
| Census_InternalPrimaryDisplayResolutionHorizontal | 46986 | 0.526661 |
| Census_InternalPrimaryDisplayResolutionVertical | 46986 | 0.526661 |
| Census_ProcessorModelIdentifier | 41343 | 0.463410 |
| Census_ProcessorManufacturerIdentifier | 41313 | 0.463073 |
| Census_ProcessorCoreCount | 41306 | 0.462995 |
| AVProductsEnabled | 36221 | 0.405998 |
| AVProductsInstalled | 36221 | 0.405998 |
| AVProductStatesIdentifier | 36221 | 0.405998 |
| IsProtected | 36044 | 0.404014 |
| RtpStateBitfield | 32318 | 0.362249 |
| Census_IsVirtualDevice | 15953 | 0.178816 |
| Census_PrimaryDiskTypeName | 12844 | 0.143967 |
| UacLuaenable | 10838 | 0.121482 |
| Census_ChassisTypeName | 623 | 0.006983 |
| GeoNameIdentifier | 213 | 0.002387 |
| Census_PowerPlatformRoleName | 55 | 0.000616 |
| OsBuildLab | 21 | 0.000235 |
| LocaleEnglishNameIdentifier | 0 | 0.000000 |
| AvSigVersion | 0 | 0.000000 |
| OsPlatformSubRelease | 0 | 0.000000 |
| Processor | 0 | 0.000000 |
| OsVer | 0 | 0.000000 |
| AppVersion | 0 | 0.000000 |
Missing Values in Testing Data
total = test_data.isnull().sum().sort_values(ascending = False)
percent = (test_data.isnull().sum()/test_data.isnull().count()*100).sort_values(ascending = False)
missing_test_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_test_data.head(50)
{% raw %}
| Total | Percent | |
|---|---|---|
| PuaMode | 7851065 | 99.972139 |
| Census_ProcessorClass | 7835022 | 99.767854 |
| DefaultBrowsersIdentifier | 7546134 | 96.089277 |
| Census_IsFlightingInternal | 6673962 | 84.983408 |
| Census_InternalBatteryType | 5979844 | 76.144803 |
| SMode | 5831272 | 74.252950 |
| Census_ThresholdOptIn | 5529515 | 70.410504 |
| Census_IsWIMBootEnabled | 5522707 | 70.323814 |
| SmartScreen | 3498402 | 44.547170 |
| OrganizationIdentifier | 2482129 | 31.606380 |
| Wdft_IsGamer | 302857 | 3.856453 |
| Wdft_RegionIdentifier | 302857 | 3.856453 |
| Census_InternalBatteryNumberOfCharges | 239233 | 3.046292 |
| Census_FirmwareManufacturerIdentifier | 198634 | 2.529321 |
| CityIdentifier | 191962 | 2.444363 |
| Census_FirmwareVersionIdentifier | 177624 | 2.261789 |
| Census_IsFlightsDisabled | 125801 | 1.601897 |
| Census_OEMModelIdentifier | 95935 | 1.221596 |
| Census_TotalPhysicalRAM | 95051 | 1.210339 |
| Census_IsAlwaysOnAlwaysConnectedCapable | 90152 | 1.147957 |
| Census_OEMNameIdentifier | 89546 | 1.140241 |
| Census_PrimaryDiskTotalCapacity | 74701 | 0.951211 |
| Census_SystemVolumeTotalCapacity | 74690 | 0.951071 |
| Census_ProcessorModelIdentifier | 61314 | 0.780747 |
| Census_ProcessorManufacturerIdentifier | 61281 | 0.780326 |
| Census_ProcessorCoreCount | 61277 | 0.780275 |
| Census_OSInstallLanguageIdentifier | 58743 | 0.748009 |
| Firewall | 58472 | 0.744558 |
| IeVerIdentifier | 49796 | 0.634081 |
| Census_InternalPrimaryDiagonalDisplaySizeInInches | 42610 | 0.542578 |
| Census_InternalPrimaryDisplayResolutionVertical | 42511 | 0.541317 |
| Census_InternalPrimaryDisplayResolutionHorizontal | 42511 | 0.541317 |
| RtpStateBitfield | 32222 | 0.410301 |
| AVProductStatesIdentifier | 23767 | 0.302639 |
| AVProductsEnabled | 23767 | 0.302639 |
| AVProductsInstalled | 23767 | 0.302639 |
| IsProtected | 23649 | 0.301136 |
| Census_IsVirtualDevice | 17291 | 0.220176 |
| UacLuaenable | 7865 | 0.100150 |
| Census_PrimaryDiskTypeName | 6807 | 0.086677 |
| Census_ChassisTypeName | 341 | 0.004342 |
| GeoNameIdentifier | 147 | 0.001872 |
| Census_PowerPlatformRoleName | 26 | 0.000331 |
| OsBuildLab | 17 | 0.000216 |
| Census_OSEdition | 5 | 0.000064 |
| Census_GenuineStateName | 1 | 0.000013 |
| Platform | 0 | 0.000000 |
| CountryIdentifier | 0 | 0.000000 |
| Census_IsTouchEnabled | 0 | 0.000000 |
| LocaleEnglishNameIdentifier | 0 | 0.000000 |
Unique Column Values
In this section, I would like to know the values of each column. Specifically, we would like to know the unique values and if there is any value that only exists on the training set.
training_data_columns = list(train_data.columns)
testing_data_columns = list(test_data.columns)
unique_val = pd.DataFrame(index = [], columns = ['TotalUniqueCount','SharedUniqueCount','TrainUniqueCount','OnlyInTrain','TestUniqueCount', 'OnlyInTest'])
for column in testing_data_columns[1:]:
unique_val_train = train_data[column].unique()
unique_val_test = test_data[column].unique()
only_in_training = list(set(unique_val_train) - set(unique_val_test))
only_in_testing = list(set(unique_val_test) - set(unique_val_train))
total_unique_val_count = len(set(unique_val_train) | set(unique_val_test))
shared_unique_val_count = len(set(unique_val_train) & set(unique_val_test))
train_unique_val_count = len(unique_val_train)
train_only_unique_val_count = len(only_in_training)
test_unique_val_count = len(unique_val_test)
test_only_unique_val_count = len(only_in_testing)
unique_val.loc[column] = [total_unique_val_count, shared_unique_val_count,
train_unique_val_count, train_only_unique_val_count,
test_unique_val_count, test_only_unique_val_count]
# print(f'{column}:')
# print(f'----------------')
# print(f'Total unique count: {total_unique_count}')
# print(f'Train unique count: {train_unique_val_count}')
# print(f'Test unique count: {test_unique_val_count}')
# print(f'Only in train: {only_in_training[0:5]}{"..." if len(only_in_training) > 5 else ""}')
# print(f'Only in test: {only_in_testing[0:5]}{"..." if len(only_in_testing) > 5 else ""}')
# print('=================================================================')
# print()
unique_val
{% raw %}
| TotalUniqueCount | SharedUniqueCount | TrainUniqueCount | OnlyInTrain | TestUniqueCount | OnlyInTest | |
|---|---|---|---|---|---|---|
| ProductName | 6 | 6 | 6 | 0 | 6 | 0 |
| EngineVersion | 74 | 66 | 70 | 4 | 70 | 4 |
| AppVersion | 124 | 106 | 110 | 4 | 120 | 14 |
| AvSigVersion | 9623 | 8265 | 8531 | 266 | 9357 | 1092 |
| IsBeta | 2 | 2 | 2 | 0 | 2 | 0 |
| RtpStateBitfield | 10 | 7 | 8 | 1 | 9 | 2 |
| IsSxsPassiveMode | 2 | 2 | 2 | 0 | 2 | 0 |
| DefaultBrowsersIdentifier | 2179 | 1101 | 1731 | 630 | 1549 | 448 |
| AVProductStatesIdentifier | 39833 | 12631 | 28971 | 16340 | 23493 | 10862 |
| AVProductsInstalled | 10 | 6 | 9 | 3 | 7 | 1 |
| AVProductsEnabled | 8 | 6 | 7 | 1 | 7 | 1 |
| HasTpm | 2 | 2 | 2 | 0 | 2 | 0 |
| CountryIdentifier | 222 | 222 | 222 | 0 | 222 | 0 |
| CityIdentifier | 130525 | 82660 | 107367 | 24707 | 105818 | 23158 |
| OrganizationIdentifier | 53 | 48 | 50 | 2 | 51 | 3 |
| GeoNameIdentifier | 297 | 286 | 293 | 7 | 290 | 4 |
| LocaleEnglishNameIdentifier | 254 | 251 | 252 | 1 | 253 | 2 |
| Platform | 4 | 4 | 4 | 0 | 4 | 0 |
| Processor | 3 | 3 | 3 | 0 | 3 | 0 |
| OsVer | 69 | 33 | 58 | 25 | 44 | 11 |
| OsBuild | 101 | 53 | 76 | 23 | 78 | 25 |
| OsSuite | 14 | 13 | 14 | 1 | 13 | 0 |
| OsPlatformSubRelease | 9 | 9 | 9 | 0 | 9 | 0 |
| OsBuildLab | 776 | 562 | 664 | 102 | 674 | 112 |
| SkuEdition | 8 | 8 | 8 | 0 | 8 | 0 |
| IsProtected | 4 | 2 | 3 | 1 | 3 | 1 |
| AutoSampleOptIn | 2 | 2 | 2 | 0 | 2 | 0 |
| PuaMode | 3 | 3 | 3 | 0 | 3 | 0 |
| SMode | 4 | 2 | 3 | 1 | 3 | 1 |
| IeVerIdentifier | 352 | 247 | 304 | 57 | 295 | 48 |
| ... | ... | ... | ... | ... | ... | ... |
| Census_InternalBatteryType | 93 | 50 | 79 | 29 | 64 | 14 |
| Census_InternalBatteryNumberOfCharges | 52838 | 24610 | 41088 | 16478 | 36360 | 11750 |
| Census_OSVersion | 579 | 365 | 469 | 104 | 475 | 110 |
| Census_OSArchitecture | 3 | 3 | 3 | 0 | 3 | 0 |
| Census_OSBranch | 40 | 21 | 32 | 11 | 29 | 8 |
| Census_OSBuildNumber | 216 | 105 | 165 | 60 | 156 | 51 |
| Census_OSBuildRevision | 325 | 254 | 285 | 31 | 294 | 40 |
| Census_OSEdition | 40 | 30 | 33 | 3 | 37 | 7 |
| Census_OSSkuName | 32 | 29 | 30 | 1 | 31 | 2 |
| Census_OSInstallTypeName | 9 | 9 | 9 | 0 | 9 | 0 |
| Census_OSInstallLanguageIdentifier | 41 | 39 | 40 | 1 | 40 | 1 |
| Census_OSUILocaleIdentifier | 151 | 135 | 147 | 12 | 139 | 4 |
| Census_OSWUAutoUpdateOptionsName | 6 | 6 | 6 | 0 | 6 | 0 |
| Census_IsPortableOperatingSystem | 2 | 2 | 2 | 0 | 2 | 0 |
| Census_GenuineStateName | 6 | 5 | 5 | 0 | 6 | 1 |
| Census_ActivationChannel | 6 | 6 | 6 | 0 | 6 | 0 |
| Census_IsFlightingInternal | 4 | 2 | 3 | 1 | 3 | 1 |
| Census_IsFlightsDisabled | 4 | 2 | 3 | 1 | 3 | 1 |
| Census_FlightRing | 11 | 10 | 10 | 0 | 11 | 1 |
| Census_ThresholdOptIn | 4 | 2 | 3 | 1 | 3 | 1 |
| Census_FirmwareManufacturerIdentifier | 861 | 575 | 713 | 138 | 723 | 148 |
| Census_FirmwareVersionIdentifier | 58959 | 41348 | 50495 | 9147 | 49812 | 8464 |
| Census_IsSecureBootEnabled | 2 | 2 | 2 | 0 | 2 | 0 |
| Census_IsWIMBootEnabled | 4 | 2 | 3 | 1 | 3 | 1 |
| Census_IsVirtualDevice | 4 | 2 | 3 | 1 | 3 | 1 |
| Census_IsTouchEnabled | 2 | 2 | 2 | 0 | 2 | 0 |
| Census_IsPenCapable | 2 | 2 | 2 | 0 | 2 | 0 |
| Census_IsAlwaysOnAlwaysConnectedCapable | 4 | 2 | 3 | 1 | 3 | 1 |
| Wdft_IsGamer | 4 | 2 | 3 | 1 | 3 | 1 |
| Wdft_RegionIdentifier | 17 | 15 | 16 | 1 | 16 | 1 |
81 rows × 6 columns