This is a non-IMPACT record, meaning that access to the data is not controlled by IMPACT. For access, see the directions below.

This Resource is offered and provided outside of the IMPACT mediation framework. IMPACT and the IMPACT Coordination Council/Blackfire Technology, Inc. expressly disclaim all conditions, representations and warranties including but not limited to Resource availability, quality, accuracy, non-infringement, and non-interference. All Resource information and access is controlled by entities and under terms that are external to the IMPACT legal framework.


Ember: Endgame Malware BEnchmark for Research
External Dataset
External Data Source
InferLink Corporation
56 (lowest rank is 56)

Category & Restrictions



A labeled benchmark dataset for training machine learning models to statically detect malicious Windows portable executable files

The ember dataset is a collection of 1.1 million sha256 hashes from PE files that were scanned sometime in 2017. This repository makes it easy to reproducibly train the benchmark model, extend the provided feature set, or classify new PE files with the benchmark model.
The dataset includes features extracted from 1.1M binary files: 900K training samples (300K malicious, 300K benign, 300K unlabeled) and 200K test samples (100K malicious, 100K benign). The dataset is accompanied by open source code for extracting features from additional binaries so that additional sample features can be appended to the dataset. This dataset fills a void in the information security machine learning community: a benign/malicious dataset that is large, open and general enough to cover several interesting use cases. ; Hyrum Anderson

Additional Details

benchmark, ember, malware, ember: endgame malware benchmark for research, 1146, endgame, 2017, source, corporation, external, external data source, inferlink, inferlink corporation, dataset, files, malicious, machine, training, learning, windows, executable, detect, portable, statically, models, labeled, 300k, features, benign, 100k, additional, samples, pe, model, appended, classify, security, 1m, anderson, 900k, extracting, repository, 200k, extracted, sample, accompanied, cover, feature, scanned, hyrum, community, easy, test, reproducibly, extend, void, sha256, train, binary, hashes, fills, binaries, other, code, unlabeled, includes