EMBER – An Elastic Malware Benchmark For Empowering Researchers
A dataset with 1.1 million malware samples from 2017 and 2018 is now available for researchers. This dataset is open source and contains metadata, derived features from PE files, and a benchmark model trained on these features. It is open source and can be used by researchers to evaluate machine-learning techniques. EMBER is available in an EMBER repository.
Researchers can use EMBER to test malware detection models and compare the results to those from the current industry benchmark. A recent study published in Nature Communications found that the EMBER dataset can be used to measure detection rates in the cyber-security space. In a benchmark study, the authors compared a featureless model to a parsed model with a detection rate of 86.8% at a 1% FPR. The results showed that the pretrained model produced a lot of false positives while the EMBER dataset contained 16K benign files.
The pipeline identifies both malignant and benign malware samples. It also provides an exhaustive classification by malware family, threat type, and behavior. While the classification results suffer from known limitations, the results are remarkably comparable to the current state-of-the-art and provide researchers with more comprehensive information.
The EMBER dataset is not complete because it includes overlapping malware behaviors and family members. This means that there are still a number of cases where a malware classification system will fail to accurately classify a certain malware family. Nevertheless, the dataset has improved over time, and the accuracy of malware detection has climbed to near-perfect levels.
The ember model improves detection performance and allows researchers to compare results from different methods. By comparing the results of these systems, researchers can test different methods and create a benchmark for their future research. Researchers can compare performance data to evaluate the benefits of different techniques such as feature selection, model parameter optimization and feature engineering.