There are 7 families of malware used for training and prediction in this machine learning problem. Each is assigned a numerical value from 0 to 6 as follows:
(BHO, CeeInject, FakeRean, OnLineGames, Renos, Vobfus, Winwebsec) = (0, 1, 2, 3, 4, 5, 6)
The feature vectors were developed by extracting a 328 byte vector from each of the malware sample files. First 64 bytes are extracted, followed by appending 264 bytes that begin at the offset that appears in 60th byte. Thus we get a 328 byte feature vector for each sample. There are 6300 such labelled samples.
For the word2vec data, the byte feature vectors are used to compute embedding vectors of length N = 2 and window size W = 7. Each word2vec feature vector is of length 512.
The Model_Predictions_and_Actual.xlsx
contains the comparison of each of these models by the results they predict for the unlabelled data of 700 samples, with the actual labels of this data. Random Forest Classifier seems to work best for this multi-class problem, with an accuracy of 97.14%.