Dataset Category Codes

Each dataset record in IMPACT is assigned a Category and Sub-Category code. These groupings serve two important purposes. Categories group datasets into functional topics allowing easier searching. Sub-Categories further refine these topics, but are also established to allow administrators to manage data of similar sensitivity together.

Dataset Category Codes

Category	Description
Address Space Status Data	These datasets characterize properties of Internet addresses (for example, addresses that respond with different codes, or address that appear to be dynamic, reserved, or assigned, etc.). They may be raw measurements, or inferences derived from such measurements. This data can be used to better understand the utilization, number of users served, or other properties of IPv4 address space across different aggregations (such as address blocks, or when combined with external information, geographic regions or autonomous systems).
Adverse Events	These datasets contain information on events that harm digital assets where malicious intent has not been established.
Application Layer Security Data	These datasets contain information regarding various aspects of cyber security that can be implemented at the application layer. These mechanisms are generally built at a higher level than the transport layer. The datasets can be in widely varying formats depending on the specify security mechanism it represents. In general these datasets do not contain IP addresses. These datasets are useful for research activities such as understanding SSL, certificate usage, as well as large scale digital certificate deployment and use.
Attacks	These datasets contain information on attempts to harm digital assets perpetrated intentionally by malicious actors.
BGP Routing Data	These datasets capture snapshots of the topological state of the Internet by archiving Border Gateway Protocol (BGP) routing tables as seen from various Internet routers. BGP routing table data enables study of overall growth patterns of the Internet or individual carriers and regions. Since BGP data reflects historical trends in the utilization of the two principal Internet resources, IP addresses and Autonomous System Numbers (ASN), it represents the basic backdrop against which many other trends are tracked.
Blackhole Address Space Data	Blackhole address space data is collected by monitoring routed but unused IP address space that does not host any networked devices, e.g., hosts or routers. Systems that monitor such unoccupied address space have a variety of names, including darkspace, darknets, network telescopes, blackhole monitors, sinkholes, and background radiation monitors. Packets observed in the darkspace can originate from a wide range of security-related events, such as scanning in search of vulnerable targets, backscatter from spoofed denial-of-service attacks, automated spread of Internet worms or viruses, etc. While there are no legitimate devices in the darkspace, the unsolicited traffic may incidentally contain information about sending hosts on the Internet that are compromised or misconfigured (for non-spoofed source addresses). Thus, datasets in this category may be subject to specific disclosure control requirements that are passed from the provider to the researcher by legal agreement. Blackhole address space data may be useful for studying the origin and characteristics of Internet pollution, evaluating various collection technologies and developing efficient mitigation strategies.
Cybercrime Infrastructure	These datasets include information on cybercriminal activities distinct from attacks on digital assets, as well as information on the infrastructure and operations used by malicious actors to perpetrate attacks.
Cybersecurity Controls Data	Firewalls are used in the Internet by organizations that wish to protect themselves from malicious activity that is directed towards their networks. Firewall policies (configurations) identify what traffic is allowed and what is denied and a software or hardware component implements these polices. An intrusion detection system (IDS) is responsible for scanning traffic on a network in order to detect unauthorized or malicious activity. While a firewall essentially blocks unwanted or suspicious network traffic, an IDS system is essentially a sensor that is watching a network for signatures that indicate malicious activity, and when it detects an attack, it can trigger protective actions or send alerts. Both firewall and IDS logs are extremely important to study as they enable studies regarding the evolution, rise, and decay of such attack traffic.
DNS Data	DNS datasets may include DNS traffic data (queries and/or responses), DNS server logs, and other DNS related metadata. These datasets may be collected at or near clients, from DNS recursive resolvers, or DNS servers for an enterprise, top-level, or root zone. Data may be anonymized. Possible uses of data in this category include: studying DNS performance, detecting malicious behavior, inferring network behavior, and inferring relative popularity of applications.
Exploits	These datasets contain information on how attacks may be perpetrated in general, but not when a particular system has been targeted by a malicious actor.
Generic Network/Behavior Data	These datasets describe regular online activities (e.g., network traffic, mobile device geolocation traces), most of which is benign but some may be malicious.
Geolocation Data	These datasets map Internet resources (IP addresses, routers, Autonomous Systems, etc.) to (inferred or actual, as noted in dataset description) geographic locations at various granularities, such as countries, cities, actual addresses, latitude and longitude. These datasets are useful for attributing Internet structure and behavior to geophysical locations.
IP Packet Headers	These datasets are comprised of headers of traffic data, containing information such as anonymized source and destination IP addresses and other IP and transport header fields. No packet contents are included. Depending on the specific dataset, this category of data can be used for characterization of typical Internet traffic, or of traffic anomalies such as distributed denial of service attacks, port scans, or worm outbreaks.
Infrastructure Data	Infrastructure data is information and metadata about the Internet's component physical systems. Infrastructure data can be used to analyze the growth properties of the network, interpret observed changes in the network topology, and to correlate real-world organization names, geography, and history with network features as measured and observed from within.
Internet Population Data	These datasets include information on online services and providers (e.g., lists of websites, mobile apps).
Internet Topology Data	These datasets include both raw and curated forms of topology data gathered from across the global Internet. Typically, this data is obtained by carrying out traceroute-like probes from monitoring points around the network. Raw IP topology data may include IP addresses on machines that a packet traverses along the forward path to a target destination, allowing heuristic-based inference of Internet topology and routing. IP addresses are typically not anonymized since they represent measurement traffic rather than end-user communications. Some topology datasets are curated into router-level or Autonomous System-level topologies for ease of researcher use. Datasets in this category can be used to support modeling and simulation of malware outbreak, spread, distribution, containment, and countermeasures, as well as macroscopic vulnerability assessments, longitudinal analysis, and modeling the evolution of Internet.
Offline Characteristics	These datasets include information on offline attributes that may be of interest to researchers (e.g., demographic data, economic indicators, industry categorizations).
Other	Unique or outside of the standard IMPACT categorization system.
Performance and Quality Measurements	This category contains datasets that characterize the performance or quality of networks and network services, including response times, throughput, goodput, reliability and resilience, mean-times-between-failure, jitter, diurnal variations, and other measurements, and indicators of Internet quality. Examples of appropriate uses of data in this category include analyses and comparisons of network performance, reliability, value, and suitability to different tasks.
Synthetically Generated Data	These datasets are generated by capturing information from a synthetic environment, where benign user activity and malicious attacks are emulated by computer programs. In these environments, it is possible to capture and distribute full network packets, firewall logs, application logs, and malicious attacks, without any risk of compromising the privacy of real people. Additionally, in these synthetic datasets, one can know and document complete "ground truth". i.e., which traffic is benign and which traffic is malicious.
Traffic Flow Data	Network traffic can be represented as flows between two endpoints. This dataset contains traffic flow information, which includes a variety of attributes such as source and destination IP address, source and destination port, protocol type, and packet and byte counts. This data can be in different formats generated by a range of different collection tools. IP addresses in these files are anonymized. These datasets are useful for research such as network economics and accounting, network planning, analysis, security, denial of service attacks, network monitoring, as well as traffic visualization.
Unsolicited Bulk Email Data	Unsolicited bulk e-mail, known as spam, constitutes a significant fraction of all e-mail connection attempts and routinely frustrates users, consumes resources, and serves as an infection vector for malicious software. The collection and analysis of datasets in this category enable a wide range of research including: characterizing spam trends, detecting bots, and developing spam mitigation algorithms. These datasets may include spam logs collected at individual organizations, reputation lists data (such as those provided by Spamhaus, SORBS, and others), and e-mails, including both headers and contents, captured at spam traps or otherwise specifically identified as spam. It may include IP addresses or e-mail addresses of suspected spammers and potentially known spam e-mail message contents. Datasets in this category may be anonymized. The recipient IP address of the unsolicited bulk email is anonymized unless such spam has already been openly disclosed.
Vulnerabilities	These datasets contain information on weaknesses in digital assets that can be exploited by an attacker.