Skip to content

Part of CS 271 at SJSU. 7 families of malware used for training and testing various machine learning algorithms for this problem.

Notifications You must be signed in to change notification settings

botdotcom/MalwareClassificationML

Repository files navigation

MalwareClassificationML

There are 7 families of malware used for training and prediction in this machine learning problem. Each is assigned a numerical value from 0 to 6 as follows:

(BHO, CeeInject, FakeRean, OnLineGames, Renos, Vobfus, Winwebsec) = (0, 1, 2, 3, 4, 5, 6)

The feature vectors were developed by extracting a 328 byte vector from each of the malware sample files. First 64 bytes are extracted, followed by appending 264 bytes that begin at the offset that appears in 60th byte. Thus we get a 328 byte feature vector for each sample. There are 6300 such labelled samples.

For the word2vec data, the byte feature vectors are used to compute embedding vectors of length N = 2 and window size W = 7. Each word2vec feature vector is of length 512.

The Model_Predictions_and_Actual.xlsx contains the comparison of each of these models by the results they predict for the unlabelled data of 700 samples, with the actual labels of this data. Random Forest Classifier seems to work best for this multi-class problem, with an accuracy of 97.14%.

About

Part of CS 271 at SJSU. 7 families of malware used for training and testing various machine learning algorithms for this problem.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published