Description: This project focuses on developing an Intrusion Detection System (IDS) to monitor and analyze network traffic for signs of suspicious activity or potential threats. By using predefined rules and anomaly-based detection techniques, the system identifies unauthorized access, malware, or policy violations in real-time. The IDS enhances the overall security posture of a network by alerting administrators to potential intrusions, enabling faster incident response and reducing the risk of data breaches.
The DARPA98 dataset is one of oldest and famous dataset, which was created by Defence Advanced research project Agency in 1998 at MIT Lincoln Laboratory using emulated environment. It is used widely because of it is available to the public. The Traffic in DARPA is in the form of packets and contains the payload for TCP, ICMP, or UDP packets. It contains 4GB of data that has 7 weeks of training data in which 7M connection records with labelled attacks & 2 weeks in which 2M of connection records with unlabelled test data are present.
KDD CUP 99 created in 1998. It is freely available for download. A local area network was used in this study to collect simulated raw dump data using Transmission Control Protocol (TCP) over a period of nine-weeks. Seven weeks of network traffic yielded approximately connection records of “5” million, while “2” weeks of testing data resulted into approximately connection records of “2” million. The testing data consisted of “39” different attacks, as opposed to the 22 in the training data.
This widely used dataset was created in 1998. Tavalle et al performed the analysis on KDDCUP99 and pointed out some problems like redundancies so an improved version of KDDCUP99 dataset was created known as NSLKDD. The creators of NSL-KDD eliminated duplicates from KDDCUP99 and more sophisticated subsets were created. Now the new data set have approximately “150,000” data points and is splitted into testing & training sets for (ID) intrusion detection methods. The NSL-KDD belongs to the category other than packet and flow, and shares the similar attributes as KDD-CUP-99. The dataset is open to the public. There are approximately “4900000” single connection vectors in KDD training set, each having “41” features & labelled as attack or normal, with only “1” particular attack type.
PU-IDS (Panjab university- intrusion dataset) is subset of NSL-KDD dataset, which is widely used but is very old and has single level of labelling, so a new dataset known as PU-IDS with two levels of labelling is generated by taking the basic characteristics of NSLKDD dataset. Approximately 200,000 data points are in PU-IDS & that have similar format & attributes as NSL-KDD.
It is available publicly. There are two versions of DEFCON dataset DEFCON-8 and DEFCON 10. DEFCON-8 was generated in 2000 and includes buffer overflows and port scanning attacks.
Measurement & analysis on WIDE Internet (MAWI) dataset contains real life traffic data between japan and USA and is freely available to public. . It contains both attack & normal flows. In MAWI dataset, every trace includes the traffic captured for “15” minutes in a particular day.
Lawrence Berkeley National Laboratory and ICSI – 2004/2005 (LBNL) dataset was created in 2004 has full header network traffic which was recorded at a medium sized site and is without payload. The main purpose of generating the LBNL dataset was to analyse the characteristics of the network traffic within organizational networks instead of publishing the data on intrusion detection.
The Kyoto 2006+ dataset was given by the Kyoto University, is available to public, and is generated using “4” tools like darknet sensors, honeypots, web crawler and e-mail server. In Kyoto 2006+ dataset real network traffic & a limited range of realistic user behaviour can be observed. It took 3 years to capture the data. Since the dataset was recorded for such a long period, there are about 93 million sessions. There are 24 features in each session, out of which 14 are conventional features derived from KDD CUP 99 data set and gives statistical information, and 10 remaining features are based on flow, like as anonymized IP addresses, durations, and ports.
The (CAIDA) Centre of Applied Internet Data Analysis has proposed CAIDA data set and was collected in 2007, which consists of several datasets, like CAIDA DDos, CAIDA Internet traces 2016 and RSDoS Attack Metadata. There are 5-minute pcap files and 1 hour of DDos attack traffic in CAIDA DDOS which contain passive traffic traces collected from Equinix-Chicago. This type of attack interrupt normal traffic of a computer or network with the flood of packets and prevents regular traffic to reach its legitimate computer.
In 2008, one of the first intrusion detection dataset based on flow was released by Sperotto et al in Twente University and is available to public. The Twente data set have traffic from a honeypot server which provides FTP, web, & SSH services of “6” days. The dataset only contains honeypot network traffic, and nearly all flows are malicious, with no normal user behaviour.
Cyber Defence Exercise (CDX) dataset was collected in 2009. Sangster et al put-forward an idea for generating network-based datasets from network warfare competitions and thoroughly explains benefits & downside of such a methodology. In 2009 a network warfare competition was held for 4 days, and CDX dataset contains traffic from that event. The traffic is set-down in packet-based format & is open to the public. CDX contains both normal user behaviour and various kinds of attacks Additionally, there is a plan, which describes metadata about the network structure as well as IP addresses, but no individual packet is labelled. In addition, there is an availability of log files based on host and warnings of IDS. This dataset has many shortcomings like lack of volume and diversity of enterprise network. Also, it does not have real attack traffic.
The DARPA 2009 intrusion detection data set was developed at the MIT Lincoln Lab using simulated traffic to simulate traffic between a /16 subnet (172.28.0.0/16) and the Internet. This data set covers a 10-day period between November 3rd and November 12th, 2009. It includes simulated SMTP, DNS and HTTP background data. The data set includes a variety of security events and types of attack. This includes denial-of-service attacks and worms that have been parameterized to exhibit different propagation characteristics. The dataset consists of approximately 7000 pcap files totalling approximately 6.5TB in size. Each pcap file is just under 1000MB in size. Depending on the traffic rate, each file typically covers one to two minutes of time. The dataset was analysed using various tools. The main tools used were tcpdump and Argus. The data in the pcap files is aggregated into per day flows for analysis. That was accomplished with the argus tool. A number of Argus tools were used to analyse and generate statistics about the dataset. The DARPA dataset includes four kinds of attacks: probing, (U2R) user-to-root, (DOS) denial of service, and (R2L) remote-to-local. The DARPA dataset has labelled and unlabelled records.
The UNIBS 2009 dataset, like LBNL, was not designed for detection of intrusion. It is still added to the list because it is mentioned in other works. To identify applications network traffic based on their flow (like Skype, mail clients or web browsers) Gringoli et al. used the dataset. UNIBS 2009 includes approximately “79,000” flows that do not exhibit malicious behaviour. Network traffic is not classified as attack or normal because the labels only define the flow's application protocols. As a result, the property label is set to no in the categorization scheme. The dataset is freely accessible to the public. Table 9 shows the work who used UNIBS dataset.
In 2010, ISOT cloud intrusion dataset was generated by merging malicious traffic from honeypot project in French chapter & normal traffic from Ericsson Research lab in Hungary and (LBNL) Lawrence Berkeley National Laboratory and is first cloud specific dataset available to public. For several days with 1-2 hr per day data was collected in cloud using special collectors at different layers of open stack-based production environment (network layer, guest hosts layer, and hypervisor), and forwarded the data collected (system logs, network traffic, memory dumps, and CPU performance measures) to ISOT lab log server for storage and analysis. The network traffic data collected was stored in packet format. The attacks present are inside and outside attacks. The size of dataset is 8 TB which consists of 55.2 GB of network traffic data. Total 22372418 packets were captured from which 15649 were malicious in phase 1 and 11509254 packets were captured from which 2006382 were malicious in phase 2.
IEEE Test Systems Task Force created IEEE 300-bus test case in 1993, led by Mike Adibi." There are 60 LTCs, 69 generators, 195 loads and 304 transmission lines in this IEEE 300-bus system. This data-set contains a electrical and topological structure of the power grid, that is particularly useful in the smart grid for detecting false data injection attacks. The system has a degree (k >) of 2.74 and 411 branches on average.
Beigi et al. proposed the botnet dataset. The ISOT botnet dataset combines several publicly available malicious and non-malicious datasets (ISOT, ISCX 2012, CTU-13). Information regarding the behavioural pattern of the user as well as botnets are components of the resulting dataset. The Botnet data set is divided into “8.5” GB test data having 16 types of botnets and “5.3” GB training data having 7 types of botnets, both in packet-based format. In the training set, the different types of botnet include Virut, Zeus, Zeus control (C&C), SMTP Spam, NSIS, Rbot, and Neris. In the testing set the different types of botnet include Sogou, Neris, Menti, Rbot, etc. In terms of botnet topologies, they can be randomized, centralized, or distributed (e.g., P2P). Table 14 shows the work who used Botnet dataset.
CSIC HTTP 2010 dataset was generated at “Information security institute” of CSIC (Spanish Research national council) and is available to public. It was captured from live network environment in 2010. In this dataset traffic were generated from e-commerce web application, where users can register themselves using their personal information and buy things using shopping cart. This dataset comprises 36k normal requests & 25k anomalous requests. The requests of HTTP are labelled as anomalous or normal. The attacks in this dataset are buffer overflow, SQL injection, information gathering, CRLF injection, files disclosure, XSS, server side include, parameter tampering etc. There are 3 types of anomalous requests: static attacks (it requests hidden resources), dynamic attacks (it modifies request arguments which are valid), unintentionally illegal requests (they do not have malicious intention, but also do not follow normal behaviour of web application). Paros & W3AF tools were used to generate attacks. The dataset is divided into 3 sets: training phase (has normal traffic) and two sets for test phase (one with normal traffic & other with malicious traffic)
Sutaharan and colleagues contended that existing synthetic datasets in Wireless Sensor Networks (WSNs) lacked proper labelling. To address this gap, they implemented a sensor-data network with both multi-hop and single-hop configurations. From this network the data generated is labelled and categorized to various categories of irregularities. The dataset contains readings recorded from actual temperature-humidity sensors utilizing Crossbow TelosB motes. These measurements were gathered at five-second intervals over a span of six hours. Controlled anomalous situations were deliberately induced. Irregularities within sensor networks can appear at varying levels, encompassing individual readings or attributes related to traffic of network, the entirety of data regarding neighbouring nodes, and a collection of sensor nodes within the network. The labelled Wireless Sensor Network Dataset for Anomaly Detection and Recognition (LWSNDR) is now available.
Using controlled environment, dataset SSENet-2011 was generated for “4” hours in 8 different geographically separated area in Tamil Nadu India using Tstat tool in 2011 and is available to public. The traffic was captured in pcap (packet based) format and then is converted to data points using Tstat tool. Every data point has 23 attributes that are represented as 4 connections based and 19 network attributes. Participants browsing activities generates normal user behaviour. There are 3 types of attacks in it flooding, probing and privilege escalation attacks.
The dataset TUIDS was generated at Tezpur University campus in 2011 and contains “3” parts TUIDS Intrusion data set, TUIDS coordinated scan data set, & TUIDS DDoS data set. Data were produced using virtual environment that includes about “250” clients. Information of Traffic was collected in format that is bidirectional flow based and packet-based. Every subset covers “7” day period & all “3” subsets are comprised of (250,000) flows. The features are of 24 types and are classified in 3 groups basic, connection based and time window-based features. This dataset has normal user behaviour & attacks such as Denial of Service Using Targa, Probing Using nmap, Coordinated Scan Using rnmap, User to Root Using Brute Force ssh, Distributed Denial of Service Using Agent-handler Network, and Distributed Denial of Service Using IRC Botnet.
Shiravi et al. generated the ISCX dataset in 2012, which includes activity of network for "7" days (malicious & normal) and is available to public. Malicious activity contains I. Brute Force SSH, II. Distributed Denial of Service, III. HTTP Denial of Service, & IV. internal network infiltration. Dynamic approach is used to generate this dataset. The author’s strategy is divided into two parts: beta and alpha profiles. Beta profiles describe normal behaviour of the user such as writing emails or surfing the web whereas alpha profiles describe attack scenarios. New dataset is generated using these profiles in bidirectional flow-based format & packet-based format. The various types of attacks are DDOS, DOS, or SSH brute force. A real trace is analysed in order to generate profiles for IMAP, HTTP, SMTP, SSH, FTP & POP3 protocols. As HTTPS accounts for about 70% of today's network traffic & this dataset does not include HTTPS traces. Furthermore, the simulated attack distribution is not based on statistics of real-world.
The Malicia dataset is available publically and contains the data from 7th march 2012 to 25th march 2013. There are 5 files and the main file is MYSQL database which has all the information about collected malware, collection time, source of malware, malware classification and details about exploit server. Also, there is a figure that captures the database schema, a tarball with the malware binaries, another tarball with icons extracted from those malware binaries, and a signature file for the Snort IDS produced by our FIRMAtool. The database consists of eight tables. The MILK table is the most crucial one, as it contains a row for each instance when malware was collected from an exploit server. Each row includes the timestamp of when the malware was collected, the landing URL, and identifiers that establish connections to other tables in the database. The FILES and LANDING_IP tables are also significant. The FILES table has a row for every unique malware binary, identified by its SHA1 hash, along with classification information. The LANDING_IP table contains a row for each exploit server, identified by its landing IP, including details such as the installed exploit kit, the server's autonomous system number, and the country code it belongs to. The malware tarball consists of 11,363 samples in the form of .exe and .dll files. In the FILES table of the database, various information can be found related to each malware binary, including its network traffic, icon, screenshot labels, and the ultimate family label. Additionally, the icons tarball contains 5,777 icons that have been extracted from the executable files. These icons are provided for convenience, as they can be extracted from the provided malware itself.
Booters was published by Santanna et al in 2013 which includes traces of 9 different booter attacks that were executed against a null routed IP address within network. The dataset is based on packet format and has network traffic of 250 GB. Packets are not labelled and booter attacks are divided into different files. In order to start locating Booters, on weekly basis use of a crawler is made to prepare the list of potential URLs. A URL can either be a Booter, or a normal web page unrelated to Booters, which may not be related to booters, or a booter indicating the name of a web page (e.g., video services, blogs, reports). The google’s custom search is used to enlist all the keywords including "Booter", "Stresser", "DDoS-for-hire" and "DDoS-as-a-Service" for preparation of candidate URLs. Second, every candidate URL on the weekly list is manually investigated to see if it is a Booter or refers to one. As a result, a second list containing the Booter is produced. Since July 2013, crawler and manual URL investigation has been conducted. Booters dataset is accessible upon request.
The CTU-13 dataset is captured in 2013 in a network of CTU University at Czech Republic and contains normal, background and botnet traffic. Traffic was labelled in 3 stages. With all the traffic to and from infected hosts labelled as botnet, normal traffic matches particular filters and consequently the background traffic can be labelled as malicious or normal and dataset is divided in to test and training subsets. Using 13 botnet scenarios malicious traffic was generated. The data set is based on “3” different formats including unidirectional flow, bidirectional flow and packet.
Creech and Hu proposed ADFA2013 dataset in University of New south Wales in 2013 that employs payloads and vectors to attack the Ubuntu operating system. The payloads include “add new superuser”, “C100 Webshell”, “java based meterpreter”, “linux meterpreter payload”, and “password brute-force”. This data set includes “3” data types: attack data, training data, and validation data. “4373” traces are in training data set. There are 833 traces in the normal validation data. Each vector in the attack data has ten attacks. Disadvantage of the ADFA is lack of attack diversity and behaviour of attacks are not well separated from normal behaviour. However, ADFA2013 is publically available dataset.
For detecting SSH attacks, Hofstede et al proposes SSHCure tool. Two data sets (each for one month) were collected from a university of Twente’s campus network to assess their work in November and December 2013 and January and February 2014 respectively. The network has routable /16 IPV4 address block from which 25k addresses are actively used. The resulting data sets are accessible to the public and contain only SSH traffic. The authors gave additional log files based on host that could be used to determine whether or not SSH login attempts were successful, instead of labelling flow-based network traffic directly. The number of honeypots, servers, workstations and attacks in dataset1 are 13, 0, 0, 632 and in dataset2 are 0, 76, 4, 10716.
The dataset SANTA (scenarios for Analysis, Network Traffic, and Attacks) was created by researchers at the University of California, Santa Barbara in ISP environment in 2014, which contains Real network of mixture of varying types of traffic and is available to public. The intent of SANTA dataset is for use with external network attack. By doing exhaustive manual procedure network traffic is labelled & kept in session format which is Similar to NetFlow but with additional features calculated using packet-based data. The additional attributes could enhance the intrusion detection method
From the dataset SSENet-2011 the attributes of packet-based files are extracted to create SSENet-2014 dataset. The dataset contains over 1.2 million network packets captured over period of 10 hours. The traffic was captured from a simulated environment that includes 10 different virtual machines running various operating systems, applications and services. The traffic includes both benign traffic such as web browsing and email as well as various types of attacks such as port scans, buffer overflow attacks and denial of service attacks. The dataset includes a total of 45 features for each packet, including information such as source and destination IP addresses, protocol type and various flags and options. The authors from every data point “28” attributes were extracted that describe network & host-based attributes. Attributes were created in accordance with KDD CUP 99. This dataset is balanced includes “200,000” labelled data points and is split into test and train set.
Gonzalez et al. proposed the android validation dataset in 2014. For discovering various relationships between Android apps, two features were extracted: (1) N-grams (characterizing the.dex file) and (2) meta-information (which accompanies each.apk file). This dataset introduced the followed definitions of applications relationships: Stepsiblings, Twins, False stepsiblings, Cousins, Siblings, and false siblings. This dataset includes “72” original applications that perform the functions like, insert junk code, replace strings, insert junk files, different aligns, replace icons, and replace files and the complete set with transformed applications contains 792 applications.
The Gas Pipeline Dataset, created by Mississippi State University’s Critical Infrastructure Protection Center, features labeled RTU telemetry streams representing normal operations, command injection attacks, and data injection attacks in a gas pipeline system. The dataset includes telemetry data such as PLC setpoint commands, pressure readings, and responses, with features parsed for machine learning applications. It encompasses various attack scenarios, including data injection and command injection attacks, making it a valuable resource for research in pipeline system security and anomaly detection.
The Power System Datasets, created in 2014 by researchers at Oak Ridge National Laboratories, provide detailed measurements and logs of electric transmission system behaviors under normal, disturbance, control, and cyberattack conditions. Spanning 37 scenarios, the datasets support Binary, Three-Class, and Multiclass classifications, with data collected from four phasor measurement units (PMUs) and enriched by logs from Snort alerts, control panels, and relays. This dataset is a valuable resource for research in power system monitoring, cybersecurity, and fault classification
The Gas Pipeline and Water Storage Tank dataset, created by Wei Gao and Tommy Morris, contains cyber-attack data from two lab-scale industrial control systems. Captured over the Modbus protocol (Serial Line and TCP), it includes 214,580 packets, with 60,048 linked to attacks such as response injection, command injection, reconnaissance, and DoS. The dataset, structured in ASCII and ARFF formats,
The Aegean WiFi Intrusion Dataset was created by Kolias et.al. in 2015 and made freely accessible by the University of the Aegean's information security lab (AWID). This dataset was specifically allotted for Wireless IDS. With the help of a compact network environment 11 users in packet-based format, its developers captured WLAN traffic. Thirty-seven million packets were captured in one hour. One fifty-six attributes are extracted from each packet. During 16 targeted attacks against the 802.11 network, malicious traffic was generated. The AWID2 dataset is labelled and broken up into training and testing subsets.
IRSC dataset was collected at Indian River state college, Florida, USA in 2015 and reflects the network traffic from real world environment. Two types of datasets are created by collecting (FPC) full packet capture (which has all data from network) & network flows (which has an aggregated unidirectional summary of network traffic between two networked devices). The attack traffic includes uncontrolled attacks (On IRSC network these are real attacks not created by team) & controlled attacks (intentional attacks created by team). Attacks from internet & normal user behaviour were captured using real network traffic. Manual attacks were carried out in addition. Labelling was done using manual inspection & IDS SNORT. Data is being captured on 24-hour basis.
The UNSW-NB15 dataset was generated in 2015 with the help of tool known as IXIA Perfect Storm in a low range virtual environment in Cyber range lab of the ACCS and contains 31 hours of malicious and normal network traffic based on packet format. It includes “9” distinct types of attacks, such as exploits, backdoors, fuzzers, DoS, & worms. There is also a format based on flow with extra attributes available for data set. The UNSW-NB15 is divided into two sections: training & testing splits. There are “45” distinct IP addresses in dataset and is freely accessible to the public.
KENT dataset was compiled over the course of “fifty-eight” days on the Los Alamos National Laboratory network in 2016. This includes approximately “130” million network traffic based on unidirectional flows, & several log files based on host. For privacy reasons, heavily anonymized network traffic is used. This dataset is unlabelled & is available for download from the website. Authentication events are gathered from computers as well as Active Directory domain controller servers. From individual Windows computers the start & stop events of processes are collected, DNS lookups are gathered from internal DNS servers, key routers are used to gather network flow data & a series of well-defined red teaming events were conducted within the past 58 days that displayed bad behaviour. Using the "5" data elements, the dataset measures 12GB in size and has 1,648,275,307 events associated with 62,974 processes, 17,684 computers and 12,425 users.
This dataset is noteworthy due to its creation as a network security attack composition. Based on these results, the authors claim that existing network traffic can be salted using overlay methodologies such as NDSec-1 which were collected in 2016 in format based on packets. This data set is freely available to public. It includes more syslog and Windows event log data. NDSec-1 attacks include botnets, (DoS) denial of service, (SYN flooding, HTTP flooding, UDP flooding,), brute force attacks (against SSH, HTTP & FTP), exploits, spoofing, XSS/SQL injection & port scans.
NG-IDS dataset includes host-based log files as well as packet-based network traffic. A virtual environment was created by generating normal behaviour of user & attacks using the IXIA Perfect Storm tool (such as DoS attacks or worms) from “7” different attack families. As a result, IXIA Perfect Storm hardware primarily determines the quality of the created data. The labelled data set includes about “1” million packets, & is freely available to the public. The NGIDS-DS is made up of “5” different kinds of files, all of that could be downloaded from NGIDS-DS download, 2016: (i) NGIDS.pcap network packets; (ii) 99 csv host log files; (iii) ground-truth.csv; (iv) readme.txt; and (v). feature-descr.csv.
UGR16 is a data set with flows in only one direction. The data set’s primary goal is to capture timely effects in an ISP environment. This includes “16,900” million single directional flows & It lasts for “4” months. Anonymized IP addresses are used, & flows are classified as background, attack or normal. Within that data set, the creators explicitly carried out many attacks (DoS, port scans & botnet). The corresponding flows has been labelled as attacks, & some other attacks have been found & labelled as attack manually. Traffic that matches fixed pattern & Normal user behaviour is injected and labelled as normal. yet, the majority of traffic is classified as background, that can be an attack or normal. The dataset is freely accessible to the public.
The Tor-nonTor dataset was created by Arash Habibi Lashkari et al and is splitted into “2” categories: binary Tor-nonTor classification & multi class Tor traffic classification. Data samples of Tor and non-Tor traffic are given for the binary classification, while Tor traffic from “8” different apps (chat, audio streaming, browsing mail, file transfer, video streaming, VOIP, & peer-to-peer) for the multi class classification were collected making use of Wireshark & Tcpdump. To extract all important information from the traffic data ISCXFlowMeter was used.
Mamun et al. proposed URL dataset, which includes “5” different types of URLs: I. phishing URLs, II. Spam URLs, III. Benign URLs, IV. Defacement URLs, & V. malware URLs. From Alexa’s top website 35300 URL’s which are benign are collected in the Benign URL. From WEBSPAM-UK2007 dataset 12000 URL’s which are spam are gathered in Spam URL. From OpenPhish repository of active phishing sites 10000 URL’s which are phishing are gathered in phishing URL. From DNS_BH, which is a maintained list of malware sites, 11500 URLs are gathered in malware URL. From Alexa ranked trusted websites that host hidden or fraudulent links, 45450 URLs are gathered in Defacement URL.
Draper-Gil proposed VPN-nonVPN dataset, which collected a session over (VPN) virtual private network & regular session. At the Canadian Institute of Cyber Security this dataset is publically available. This data set contains “15” commonly used applications, such as YouTube, Facebook, Netflix, and Vimeo, that have been encrypted using various encryption protocols. The VPN-nonVPN data set is composed of labelled network traffic such as email (SMPTS), web browsing (Firefox), chat (Skype), file transfer (SFTP), streaming (e.g., YouTube), peer-to-peer (P2P) (uTorrent) and VoIP (Hangouts voice calls).
DDoS 2016 dataset was collected as there was no existing datasets that has modern DDoS attack like SIDDOS, HTTP flood etc. It was published by Alkasassbeh et al in 2016 using network simulator NS2 and is a based-on packets. The dataset comprises of four types of DDoS attacks: smurf, UDP flood, SIDDOS, & HTTP flood. The dataset is labelled and there are 2.1 million packets, & 27 features.
Almomani and colleagues developed a dataset, which encompasses both attack and normal -related network traffic to assess the effectiveness of an Intrusion Detection System (IDS) designed for Wireless Sensor Networks (WSNs). The scientists employed the network simulator NS-2 to model a network that includes numerous wireless sensor devices implementing the LEACH (Low Energy Aware Cluster Hierarchy) protocol, which is a hierarchical MAC (Medium Access Control) protocol specifically, designed for Wireless Sensor Networks (WSNs). Each node observed the communication of five adjacent nodes and sent a report to the central sink during the data collection phase throughout the entire Wireless Sensor Network (WSN). To simulate malicious situations, four categories of Distributed Denial of Service (DDoS) attacks were introduced. Grey-hole, black hole, flooding, and scheduling attacks. A malicious node selectively drops traffic from its neighbours in Grey-hole attack, in black-hole attack a malicious node collects and discards all traffic from its neighbours, in scheduling attack the attacker configures the network to prompt simultaneous transmissions from all devices and flooding attack entails sending a significant volume of routing messages to disturb the network. The entire dataset consists of 374,661-labelled records, from which 23 features were derived. These features include the quantity of received and sent packets, node ID, and topological details regarding the WSN (Wireless Sensor Network).
Hekmati and colleagues emphasized that effectively detecting Distributed Denial of Service (DDoS) attacks constitutes a crucial step in their prevention. The prospective creation of a NN model for detection is expected to pose challenges, given the widespread infiltration of IoT networks by malicious entities and the complex nature of botnet attacks. The availability of substantial and pertinent datasets are pivotal, yet many existing datasets do not specifically address environments of IoT. To address this deficiency, the researchers introduced an urban-IoT DDOS-dataset obtained from a substantial real-world experiment conducted in an urban Internet of Things (IoT) system within a significant city. This dataset comprises 4060 spatially distributed sensors activated by events. The recorded data encompasses the binary activity status of each node, documented at 30-second intervals over the course of a month. Furthermore, three metadata fields such as timestamps for the node's activity status, geolocation (longitude and latitude) of the node, and node ID are included. The researchers also provided a script for generating attacks within this dataset. It's important to note that alterations in node activity status are recorded as "zero" for normal activity, and "one" for occurrences of attacks. Both the script for generating attacks and the dataset are available in CSV format.
To maintain the network security Miettinen and colleagues created a system of security called IoT SENTINEL, with the aim of recognizing devices within a network and then analysing the traffic originating from susceptible devices. To evaluate the effectiveness of their system, the researchers set up a testing environment that included diverse IoT devices. They network traffic was recorded to create a dataset which they used in the development of their device classification models. The researchers made this dataset publicly available in the format of unprocessed packet captures. The experimental setup included 31 consumer IoT devices representing various categories such as lighting, health, cameras, appliances, and home automation. Traffic recording took place while setting up the devices initially, and this setup process was replicated Twenty times. It is crucial to emphasize that the dataset does not have any signs of attacks. The recorded traffic predomi" "tly comprises IP traffic, originating straight from the devices in most instances or through a gateway for devices utilizing Z-wave or ZigBee communication protocols.
This CIC-IDS 2017 dataset was proposed by Sharafaldin et al and includes the network traffic based on packet and flow generated over a period of 5 days from 2017-july-3-friday to 2017-july-7-Monday. It is publicly available. There are “80” features of network flow which are generated from network traffic extracted using CICFlowMeter tool. The attacks present are SSH, Heartbleed, DOS, Brute-force, web attack, botnet & DDOS, Brute force FTP, & infiltration. Furthermore, based on protocols like FTP and HTTPs dataset takes out the abstract behaviour of “25” users.
CIDDS-001 dataset was captured in 2017 using simulated small business environment, and includes unidirectional network traffic based on flow of “4” weeks. External server is included in dataset that was attacked in internet. As opposed to honeypots, clients were regularly accessing this server. Malicious & Normal user behaviour was executed. This data set is available to public. To generate malicious traffic on the network, Port Scans, Brute Force attacks and (DoS) Denial of Service, were used. Labelling the recorded NetFlow data was simple because the targets, origins, and timestamps of the executed attacks were all known. Adding network traffic outside the OpenStack environment was achieved by deploying an external server. The server provides a file synchronization service (Seafile) as well as an HTTP web server to clients. This server was vulnerable to real and current internet attacks because it had a publicly accessible IP address.
Based on scripts of CIDDS-001, CIDDS-002 is a port scan dataset. This dataset includes unidirectional network traffic based on flow of “2” weeks from a simulated small business environment. CIDDS-002 includes both a variety of port scan attacks and normal user behaviour. The dataset’s additional meta-information, including anonymized external IP addresses are included in technical report. The dataset is freely accessible to the public.
In 2017 TRAbID dataset was proposed by Viegas et al. For evaluating intrusion detection systems TRAbID database include “16” types of scenarios. Every scenario was recorded in a virtual environment (100 clients and 1 honeypot server). The traffic was captured for 30 minutes and few attacks were carried out in each scenario. The authors labelled the network traffic using the clients' IP addresses. Every client uses Linux system. The majority of clients only performed user requests, which were normal to the honeypot server while some clients only perform attacks. HTTP, SSH, SMTP, and SNMP traffic are examples of normal user behaviour, whereas malicious network traffic includes DoS attacks and port scans. TRAbID is readily available.
Unified Host & Network Dataset have data based on host and network collected in a real-world setting, Los Alamos National Laboratory (LANL) enterprise network. In bidirectional network traffic based on flow files, Timestamps and IP addresses were anonymized for privacy reasons. For “90” days the network traffic was collected with no labels. This dataset is freely available to the public
This Android Adware dataset comprises “1900” apps of three types: Adware (250 applications), Benign (1500 applications) and General Malware (150 applications). Lashkari et al. go over the specifics of the Android adware dataset. The adware category includes the following well-known families: Shuanet, Airpush, Mobidash, Kemoge, and Dowgin. To verify the relationships between categories of each app (general adware, benign and malware) Droidkin is used which is a lightweight detector of Android app. TO run apps the NEXUS 5 (Android smart phones) are used, and gateway is used to capture generated traffic, which is labelled into three types (adware, general malware and benign,). This data set is publically available.
CICAndMal2017 is the name of this dataset, which contains both benign and malware applications. Shiravi et al. proposed the CICAndMal2017 dataset. Published on Google Play in 2015, 2016, and 2017. In the CICAndMal2017 dataset the malware are divided into “4” types: (1) SMS Malware (e.g., Zsone, Mazarbot, FakeInst, BeanBot ... etc), (2) Scareware (e.g AndroidSpy, Penetho, VirusShield, AndroidDefender, ..), (3) Ransomware (e.g LockerPin, Jisut , WannaLocker ,Pletor, Charge, ...), and (4) Adware (e.g Shuanet, Ewind, Dowgin, Selfmite..). Furthermore, the dataset CICAndMal2017 includes more than 80 features of traffic network in packet format (.pcap files).
Using only DNS connections, Sharma et al. collected flow-based PUF data on a campus network over three days. Out of 298,463 are single direction flows, 38120 are malicious, whereas rest are normal user activity. An intrusion prevention system log is used to label all flows. IP addresses have been eliminated from the dataset for privacy reasons. The creators made PUF available to public
The (CSE) Communications Security Establishment and (CIC) the Canadian Institute for Cybersecurity proposed the dataset CSE-CIC_IDS 2018. This data set consists of “7” types of attack: Heartbleed, DDoS, brute-force, DoS, botnet, web attacks, and network infiltration from inside. “50” machines are included in attacking framework, while the victim organisation contains “5” departments & “420” machines & “30” servers. “80” features of network flow which are generated from network traffic were extracted using CICFlowMeter tool.
Pahl et al gave the dataset DS2OS in 2018, which is an IIOT dataset. DS2OS dataset have attacks on applications & sensors. It has details of several anomalies & attacks in IIOT applications like smart buildings, homes, & factories, etc. The dataset DS2OS includes “13” features & “357952” samples and has “10017” anomalous data values & “347935” normal data values with eight classes. The attacks present are data type probing, DOS, malicious control, wrong setup, malicious operation, scan, spying.
The traffic for N-BaIOT dataset was collected from IOT devices in which access points were connected via WI-FI that are then connected to router through a wired central switch. Port mirroring on the switch is performed for sniffing the network traffic and data were recorded using Wireshark in pcap format. Twenty-three features were extracted from five-time windows of recent 100ms, 500ms, 1.5sec, 10sec, 1min. there are 2 types of attacks in this dataset like BASHLITE and Mirai. In BASHLITE there are attacks like scan, junk, UDP, TCP, COMBO. In Mirai the attacks present are scan Ack, Syn, UDP, UDPplain.
The CAN-OTIDS dataset, introduced in 2018, focuses on in-vehicle network security, particularly the CAN bus system. It includes 2.37 million normal CAN messages and various attack types, such as DoS, spoofing, fuzzy, and impersonation, offering a comprehensive resource for studying and enhancing security mechanisms in vehicular communication networks.
The Survival Analysis Dataset for Automobile IDS 2018 focuses on three attack scenarios—Flooding, Fuzzy, and Malfunction—that disrupt in-vehicle communication systems. It includes normal driving data and attack data, with detailed attributes like Timestamp, CAN ID, DLC, and DATA fields, making it valuable for studying and mitigating vehicular network vulnerabilities.
Car-hacking datasets encompass a variety of intrusions, including fuzzy attacks, Denial-of-Service (DoS) attacks, drive gear mimicry, and RPM gauge emulation. These datasets were created by recording Controller Area Network (CAN) traffic from a real vehicle's OBD-II port during message injection attacks. Each dataset includes 300 intrusion instances, each lasting between 3 to 5 seconds, resulting in approximately 30 to 40 minutes of CAN traffic. The DoS attack involves injecting '0000' CAN ID messages every 0.3 milliseconds, resulting in 3,665,771 messages, with 3,078,250 normal and 587,521 injected messages. In the fuzzy attack, random CAN ID and DATA values are injected every 0.5 milliseconds, yielding 3,838,860 messages, with 3,347,013 normal and 491,847 injected messages. Spoofing attacks on drive gear and RPM information inject one message per millisecond, resulting in 4,443,142 and 4,621,702 messages, respectively, with normal message counts of 3,845,890 and 3,966,805, and injected message counts of 597,252 and 654,897. Lastly, in the GIDS Attack-free dataset, there are 988,987 messages, 988,872 of which are normal, and none are injected. Each dataset includes attributes like timestamp, CAN ID, DLC, DATA, and Flag to distinguish between normal ('R') and injected ('T') messages.
Sikeridis and collaborators introduced the publicly available BLEBeacon dataset as a component of their investigation into Bluetooth Low Energy Beacons that are compact Bluetooth devices with limited power capacity that intermittently transmit information. In this investigation, 46 participants were equipped with a Gimbal Series Ten Beacon each, emitting a BLE beacon after a second. Thirty-Two BLE gateways detected these Beacons, utilizing Raspberry-Pi, strategically positioned throughout the test environment, which was a 3-story building of campus. The received messages was transmitted to a central server by gateways. The study took place between 15 September and 17 October, 2016. The messages received by a gateway in the dataset is accompanied by its Received Signal Strength Indication (RSSI), along with timestamps that denote the moments when a particular beacon entered and exited the coverage zone of the gateway.
61
Kitsune Network
2018
Real
Yes
23
1359
Recon,Man in the middle , Denial of service and Botnet Malware.
Mirsky et.al collected dataset kitsune. The dataset presents real attacks detected in an IP camera surveillance network with two deployments, each housing four HD cameras powered by PoE. Attacks target video uplink availability and integrity, including man-in-the-middle attacks like video injection, and SYN floods. Various attack types such as Recon, DoS, and Botnet Malware (Mirai) are included, captured via packet capture technology. Additionally, an IoT-populated Wi-Fi network with Mirai-infected security cameras is used to assess Kitsune's performance in a noisy environment.
Stiawan et.al. presented TCP FIN FLOOD dataset. The dataset originates from a testbed network containing diverse hardware elements, such as MQ2, DHT22, soil moisture, and water level sensors, in addition to a WeMos D1 microcontroller outfitted with an ESP8266 WiFi module. The accompanying software comprises a MySQL database, denial-of-service (DoS) utilities like Hping3, Apache Web Server, and Snort serving as an intrusion detection system (IDS). Hping3 carries out TCP FIN flood assaults on the network. The testbed adopts a star topology, with two laptops, four sensor nodes, and one server for sniffing and attacking purposes. Each sensor node and the server connect to the network via a wireless router using DHCP for IP address configuration. The dataset, accessible to the public, is produced by executing three scenarios: regular traffic, and TCP FIN flood attack traffic, and a combination of normal data with TCP FIN flood attack traffic. Each scenario is executed for five minutes at sensor nodes and the server, with sniffer modules capturing and saving traffic packets in raw data format (pcap).
Pinto presented dataset OPCUA. The dataset captures OPC UA traffic generated within a laboratory CPPS testbed, where OPC UA standard facilitates both horizontal and vertical communications. The testbed comprises seven nodes, each housing a Raspberry Pi device running the Python FreeOpcUa implementation. Two production units, each with three devices, and a Manufacturing Execution System (MES) node form the network. Devices function as both OPC UA servers, publishing sensor data updates, and clients, subscribing to updates from other devices. The MES serves solely as an OPC UA client, subscribing to all variables from all devices. An additional attack node, representing a potential threat, is integrated into the network. Tshark captures OPC UA packets, exporting the traffic to a CSV dataset, encompassing normal and anomalous behaviors. Anomalous behavior, introduced by the malicious node, includes attacks like Denial of Service (DoS), Eavesdropping (MITM), and Impersonation (Spoofing) targeting device nodes and the MES.
Sivanathan et.al. Proposed dataset IEEE TMC 2018 UNSW. The dataset records network activity within an actual "smart environment" containing a wide range of IoT and non-IoT devices, enabling communication with Internet servers through a gateway. In the laboratory configuration, the WAN interface of a TP-Link access point connects to the public Internet via the university network, while IoT devices are linked to LAN and WLAN interfaces. The smart environment accommodates 28 unique IoT devices across various categories, alongside several non-IoT devices. Traffic on the LAN side was captured using the tcpdump tool, spanning from 1-Oct-2016 to 13-Apr-2017, totaling 26 weeks. An Apache server on a virtual machine in the university data center hosts the trace data, with a script transferring daily logs to the server. Two weeks of trace data are publicly accessible for download, ranging in size from 61 MB to 2 GB, averaging 365 MB per day.
Nguyen et.al presented dataset DIOT. A vast amount of data was collected to analyze the communication behaviors of IoT devices, covering various settings including laboratory setups and real-world deployments. The observed devices include 33 common consumer IoT devices such as smart power plugs, IP cameras, sensors, and light bulbs were classified into 23 unique device types through device-type-identification methods. To collect the dataset, a laboratory network was established by configuring a laptop running Kali Linux with hostapd. This laptop was set up as a gateway and access point, equipped with both WiFi and Ethernet interfaces to connect IoT devices. Network packets generated by the monitored devices were captured on the gateway using tcpdump.
Liu, Hyunjae Kang et.al presented dataset IOTID19. The dataset IOTID is designed for academic investigation into network attacks in IoT settings. It involves two prevalent smart home gadgets, SKT NUGU (NU 100) and EZVIZ Wi-Fi Camera (C2C Mini O Plus 1080P), alongside smartphones and laptops, all interconnected within the same wireless network. The dataset includes 42 unprocessed network packet files (pcap) that were captured at different times using the monitor mode of a wireless network adapter. wireless headers are removed using Aircrack-ng. Attacks are captured through simulations with tools like Nmap, excluding those in the Mirai Botnet category. For Mirai Botnet attacks, packets are generated on a laptop and manipulated to mimic origination from an IoT device. This dataset provides a diverse range of network attack scenarios.
A new dataset Bot-IoT was proposed by Koroniotis et al, which differs from older datasets in the IOT environment. Bot-IOT dataset contains records over “72,000,000”, including DoS, DDoS, Keylogging, OS & Service Scan, & Data Extraction attacks. To reproduce the network behaviour of IoT devices Node-red tool were used. To connect (M2M) machine-to-machine communication the data set used MQTT protocol that is a light-weight communication protocol. The testbed employs “5” IoT scenarios: weather station, remotely activated garage door, motion activated lights, smart fridge, and smart thermostat.
CUPID is publicly available dataset. Incorporating human-guided traffic into the CUPID dataset is an important feature. CUPID was developed with the help of “10” (pentesters) ethical penetration testers. These ten pentesters were seen performing the similar activities as scripted users for a period of one hour, and then capturing malicious traffic for a subsequent hour, or whenever the pentester stopped operating (Generally, if the timer had expired and the server had been successfully exploited). On the basis of the Kali instance’s IP address, benign traffic was labelled with a '0,' while malicious traffic was labelled with a '1'. During the first sampling day of April 2019, one of the 24-hour baseline data samples was collected. (4,346,077) packets are included in raw 042219 1000.pcapng data file and takes up about storage space of “3.3 GB”. The size of whole CUPID dataset is around “50 GB”. 179 distinct hosts are contained in sample outside the local address space. This sample consisted primarily of TCP packets (75%) using protocols such as Distributed Computing Environment (DCE) / Remote Procedure Call (RPC), Internet Control Message Protocol (ICMP), Hypertext Transfer Protocol (HTTP), Kerberos, Simple Mail Transfer Protocol (SMTP), Network Basic Input/Output System (NetBIOS) / Server Message Block included (SMB), and Lightweight Directory Access Protocol (LDAP). The pcap includes UDP-based data (7.5%) as well as (DNS) Domain Name System information and (NTP) Network Time Protocol. The remaining traffic is made up of addressing protocols such as ARP & 802.1Q Virtual LAN. CUPID contains a large variety of protocols due to enterprise-specific services such as DNS lookups, email, & active directory access. SSL/TLS traffic accounts for 25,341 (0.6%) of the total packets.
CIRA-CIC-DoHBrw-2020 is freely available to public. Described by: “Mohammadreza Montazeri Shatoori et al. in their paper “Detection of DoH Tunnels using Time-series Classification of Encrypted Traffic”. DNS over HTTPS (DoH) is a protocol proposed by the (IETF) Internet Engineering Task Force. By encrypting DNS queries & sending them through a covert tunnel to enhance privacy & protects against man-in-the-middle attacks, ensuring data is not compromised. The DoH protocol within an application was implemented in dataset and capturing Malicious-DoH, Benign-DoH, and non-DoH traffic using “4” servers and “5” different browsers and tools. The proposed 2 layered method uses a statistical features classifier for layer 1 to differentiate non-DoH traffic & DoH traffic, and a time-series classifier for layer 2 to distinguish malicious from benignDoH traffic. Mozilla Firefox, Google Chrome, dns2tcp, Iodine and DNSCat2 are among the tools and browsers used to capture traffic, while Cloudflare, AdGuard, Quad9 and Google DNS, are among the servers which respond to DoH requests.
There is a dataset called InSDN that identifies attacks specific to SDN that is publicly available to researchers. The dataset is divided into three groups based on traffic type and target machines. normal traffic is included in first category. attack traffic directed at the Mealsplotable-2 server is included in second category. On the OVS machine attacks are included in final group. The tool known as Tcp dump is used to record traffic traces at the target machine & the SDN controller interface for each category. For the inSDN dataset To extract flow features the CICFlowMeter tool is also used. Features information, like Destination IP & Source IP, were used for labelling processing. For normal and attack traffic total instances in dataset are “343,939”. Whereas normal data contains 68424 instances, attack traffic includes 275,515 instances. More than 80 features in 56 categories were gathered. To make things easier, the features are divided into “8” groups, as follows: Packet-based attributes, Network identifiers attributes, Bytes-based attributes, Interarrival time attributes, Flow timers’ attributes, Flag attributes, Flow descriptors attributes, Subflow descriptors attributes.
In January 2020, Stratosphere Laboratory in Czechia made IOT-23 dataset available initially and is captured in real environment. It contains network traffic that has 20 malware & 3 benign sets. Number of flows in benign is 30,858,735. Number of flows in malicious captures are: Part-Of-A-Horizontal-Port Scan(213,852,924flows), Okiru(47,381,241 flows), Okiru-Attack (13,609,479 flows), DDoS (19,538,713 flows), C&C-HeartBeat (33,673 flows), C&C (21,995 flows), Attack (9398 flows), C&C (888 flows), C&C-Heart Beat Attack(883 flows), C&C-File download (53 flows), C&C-Tori (30 flows), File download (18 flows), C&C-HeartBeat File Download (11 flows), Part-Of-A-Horizontal-PortScan Attack (5 flows), C&C-Mirai (2 flows). There are 21 features present in IOT-23 dataset. The attributes are mixed some are nominal, some numeric & some taking time stamp values.
The LITNET-2020 was collected in real network for 10 months. The traffic is captured in nfcad binary format files, single nfcapd file is collected per week for 2 capture periods. The size of the files is 1.3GB. there are 19 attacks types. Networks IP addresses are anonymized. There are total 85 attributes, 49 attributes specific to NetFlow v9 protocol, the extended dataset has 15 attributes, the generator has 19 attributes for recognition of attack type.
NetML dataset was released in “open challenge network traffic analytics using machine learning workshop sponsored by intel corporation” in 2020. This dataset was created by obtaining 30 traffic data from stratosphere IPS. In JSON format flow features are extracted, and are listed in output file line by line. For each flow metadata features are extracted and if flow sample have packets for protocol, then DNS, HTTP, TLS features are extracted. To obtain NetML dataset a unique number is assigned to identify every flow & label information from raw traffic packet capture file and is appended to output JSON file. There are 484,056 flows & 48 feature attributes. IP addresses of source and destinations are replaced by IP masked string to mask the IP’s. Dataset is divided into training, test-std, & test-challenge sets.
The dataset LoED (LoRaWAN at the Edge), developed by Bhatia and colleagues, is a publicly available dataset containing network data sourced from a LoRaWAN sensor network located in the urban area of London, U.K. The authors collected LoRaWAN packets over a period of four months in 2019 and 2020. This involved capturing data from nine strategically positioned gateways in different urban environments, including 5 outdoor and 4 indoor locations. Each packet recorded in the dataset contained diverse information, including signal modulation characteristics, sender's address, and raw payload. The researchers also performed an initial analysis to glean insights from the recorded traffic. This analysis involved examining metrics such as daily or per-node packet count and the distribution of specific traffic attributes like frequency, packet type, or LoRa spreading factor.
To distinguish seemingly identical LoRa devices, particularly those of the same model, Al-Shawabka and collaborators compiled a dataset that includes a significant items of LoRa devices radio data. This data was collected over multiple days in two different environments, and interested parties can request access to the dataset. The setup included 100 sensor devices, each composed of a Pysense sensor which is connected to a FiPy radio board. To collect data, every device was configured to send ten consecutive bursts. Each burst include 100 successive measurements of device voltage, temperature, and humidity, and gap between measurements is 10ms . A one-second interval separated each burst from the previous one. Data collection occurred in both indoor and outdoor testbed environments on two occasions. As a result, collected packets per device were 2000, totaling 200,000 packets. To Label the datasets, a file format using JSON to describe radio data, known as an additional SigMF metafile, was employed and expanded to encompass LoRa-specific characteristics. Elmaghbub and Hamdaoui undertook a comparable investigation, Twentyfive identical LoRa Pycom devices were employed by testbed for radio fingerprinting. Their dataset was recorded in various experimental scenarios, introducing variations in sender-receiver location, recording times and distance. The dataset encompasses radio information presented in SigMF files, accompanied by descriptive metadata.
Within the project GHOST EU, Anagnostopoulos and colleagues created the GHOST-IoT-data-set, which is openly accessible and was developed as part of the project's framework. This dataset includes network traffic derived from an actual Smart Home setup implemented in a dwelling occupied by two individuals. The Smart Home environment includes nine devices: two door sensors, four motion sensors, an emergency button, a blood pressure meter, and a weight scale. These devices employ various communication protocols of network , including IP, Z-Wave, ZigBee, RF869 and Bluetooth. The emergency button employs the RF869 protocol, which is a proprietary protocol designed to function within the ISM (Industrial, Scientific, and Medical) bandwidth, specifically around 869 MHz. To enable communication between ZigBee, Bluetooth devices and the Internet, a gateway is included in the configuration. This gateway uses the PPP to establish a direct connection to a router, enabling Internet access. The network traffic of these devices was recorded in October 2019 over a period of ten days. The normal behaviour of the devices were captured in initial 9 days, On the tenth day, seven attack scenarios were executed. These scenarios included connecting unknown devices, and modifying the devices' locations, simulating physical battery drain or firmware modification by removing the devices' batteries, and inundating the network with an abundance of consecutive measurements. The entire dataset includes 3,811,419 number of raw packets. Supplementary files are incorporated into the dataset, consolidating traffic into flows and furnishing statistical features which are descriptive. It's important to note that the attacks within the dataset are not explicitly labeled.
Vigoya and colleagues generated the publicly available DAD (Dataset for Anomaly Detection), comprising annotated traffic data of network of IoT. The objective was to create an extensive dataset with diverse scenarios and annotations that could be employed by machine learning algorithms to identify anomalies in sensor networks of IoT. DAD dataset was obtained from a simulated virtual environment mimicking the behaviour of sixteen IoT temperature sensors deployed in a data center. Using MQTT over IP, After every five minutes these sensors sent measurement samples. Network traffic was captured over a period of seven days at the MQTT broker. To introduce attack scenarios, alterations were made to the sensors on five recorded days, employing three different methods: Interception (withholding certain temperature measurements), Modification (altering temperature measurements), and Duplication (sending more packets than usual). 101,583 packets are in complete dataset each appropriately labeled as normal or anomalous. MQTT packets constitute 63.3% of the dataset, with 16% of them identified as malicious.
Guerra-Manzanares and co-authors contended that the existing datasets intended IDS based on Machine learning in IoT environments were limited in both quantity and size. The dataset MedBIoT were publicly introduced to solve this inadequacy. Both emulated and genuine IoT devices are combined in annotated dataset within a network containing 83 devices of IoT. Among them, Three devices are physical devices consisting of one smart bulb, and two smart switches, while eighty devices are emulated in Docker containers. To generate attack traffic malicious software, including BASHLITE, Mirai, and Torii, was deployed, and from all endpoints and servers, data was gathered during propagation of botnet. The researchers concentrated specifically on the early phases of botnet deployment, specifically the processes of propagation and Command and Control (C&C) communication. The network was segmented into three parts: the monitoring network (for storing and processing collected data from the network switch), the Internet network (for connectivity), and the IoT LAN network (to allow controlled malware spread). The utilization of the Kitsune auto-encoder tool facilitated the extraction of machine learning features from raw data (PCAP files), resulting in the generation of 100 statistical features computed across various time windows.
Vaccari and colleagues created the MQTTset dataset, a publicly available dataset specifically designed for the MQTT protocol. The researchers employed a network comprising eight Smart Home sensors that measured variables such as humidity, temperature, and smoke. These sensors were emulated through the use of the IoT-Flock, IoT traffic generator and were connected to an MQTT broker using Eclipse Mosquitto. Both legitimate activities and attacks targeting the MQTT protocol are included in the complete traffic. The traffic was captured at the broker over the course of a week. The dataset contains 34 features and is available in either CSV Format or raw packet capture. These features include a label which indicates whether the traffic is classified as malicious or legitimate.
Al-Hawawreh et.al presented dataset X-IIOTID. X-IIoTID is an innovative intrusion dataset tailored for Industrial Internet of Things (IIoT) environments, designed to accommodate system heterogeneity. It encompasses a wide array of attack types, protocols, and multiview features, meticulously evaluated using machine learning algorithms to advance security solutions. Data collection spans end-to-end network traffic, including physical field devices to edge gateways, and from edge gateways to cloud and enterprise devices. To ensure accurate and comprehensive data capture, a dumpcap tool is installed in the edge gateway (a Raspberry Pi B+ with a 64-GB memory card), periodically capturing network traffic in pcap file format for up to 2 hours. System activity reporter (SAR) tools are employed to gather edge gateway resource data, while OSSEC logs track online mode alerts and edge gateway activities. The dataset captures normal and background traffic alongside attack scenarios, spanning a four-month period from December 5, 2019, to March 23, 2020. Various attack experiments, including Ransom Denial-of-Service (RDoS) and brute force, are conducted repeatedly from 7 January, 2020, to 27 March 2020, showcasing distinct attack vectors such as weaponization, reconnaissance, exploitation, command and control (C&C), lateral movement, crypto ransomware, tampering, RDoS and exfiltration
Ullah and Mahmoud presented dataset IOTID20. The IoTID20 dataset is derived from a test environment that integrates IoT devices and interconnected systems, mimicking the setup of a standard smart home. The dataset is created using Wi-Fi cameras from EZVIZ and SKT NGU, which act as IoT devices vulnerable to attacks, and are linked to a Wi-Fi router of smart home. Router is connected to other devices like tablets, smartphones, and laptops and they act as attacking device. Attacks including Mirai (http flooding, Brute force, udp flooding), DoS (syn flooding), Scan (OS, host port) and MITM (ARP spoofing) are simulated. Features are extracted from pcap files using CICflowmeter application, this leads to the formation of a CSV dataset containing 80 network characteristics and three labeling attributes (binary, category, sub-category). The dataset mirrors modern IoT network communication trends and It stands out as one of the scarce publicly accessible datasets for IoT intrusion detection.
This dataset pioneers the simulation of an MQTT-based network, offering a comprehensive exploration of network behaviors and associated attacks. Generated through a simulated MQTT network architecture, the dataset comprises an attacker, a broker, twelve sensors, and a simulated camera. It records five scenarios: aggressive scan, normal operation, UDP scan, MQTT brute-force attack, and Sparta SSH brute-force. Three levels of abstracted features are extracted from the raw pcap files: packet characteristics, one-way flow attributes, and two-way flow attributes. The dataset includes csv feature files tailored for Machine Learning (ML) applications, facilitating predictive analysis.
Sarhan et.al presented dataset NF-UQ-NIDS. The NF-UQ-NIDS dataset is a comprehensive amalgamation of four prominent network intrusion detection datasets: BoT-IoT, UNSW-NB15, CSE-CIC-IDS2018 and ToN-IoT consolidated within the NF-collection by the University of Queensland. This unified dataset aims to standardize network-security datasets, enabling interoperability and facilitating larger-scale analyses. It incorporates an extra label attribute that identifies the source dataset of each flow, allowing comparison across different testbed networks. Attack categories have been consolidated for clarity, with specific attacks grouped under parent categories such as DoS, DDoS, brute-force, and injection attacks. The dataset featuring a diverse range of attack types including Analysis , Backdoor , Benign , Brute Force , Bot, DoS , Fuzzers , Exploits , Generic , Infilteration , Shellcode ,Reconnaissance , Worms ,Theft , DDoS , Injection , Password , MITM , Ransomware , XSS and Scanning.
Hafeez Ibbad et.al presented dataset IOTKeeper. The test environment utilizes IOT-KEEPER as its network gateway, employing tcpdump to record all incoming and outgoing traffic through both wired and wireless interfaces on the gateway. When an IoT device communicates with the Internet through an IoT hub using protocols such as Weave or ZigBee, their device-to-internet (D2I) communications are observed by capturing network traffic at the IoT hub. The dataset includes traffic data from both harmless and harmful network actions; harmful activities are mimicked using Raspberry Pis and IoT devices infected with Mirai. To minimize potential hazards, the gateway is set up to block all unfiltered outgoing traffic, thus averting the propagation of malicious behavior onto the public Internet.
Meidan et.al presented dataset IOT-DeNAT. The dataset records genuine network traffic from different IoT and non-IoT devices situated behind a NAT configuration. A Cisco Catalyst 2960-X switch was divided into VLANin and VLANout, symbolizing the home network and the telco side, respectively. IoT devices, laptops, and smartphones were linked to VLANin via a wireless access point. VLANin was connected to a NAT router (Cisco 3825) with NetFlow installed, which bridged to VLANout connected to the Internet. NetFlow data was gathered using nProbe on a server and Raspberry Pi for centralized training and local deployment simulations. Additionally, port mirroring from VLANin and VLANout facilitated pcap file capture with Wireshark for deNAT-related studies. Commercial IoT devices from seven manufacturers, including Amazon and Samsung, were deployed alongside non-IoT devices, representing popular home IoT types like Media/TV, Surveillance, and Home Automation for empirical evaluation.
Perdisci et.al. presented dataset IOTFinder. The IOTFinder has many datasets. One among them is IOT DNS dataset which comprises various IoT devices, including cameras, voice assistants, smart speakers, TVs, IoT hubs, smart appliances, lights, game consoles. In a laboratory setting with human presence, these devices are deployed, where cameras monitor movements, voice-based assistants record conversations, and devices are occasionally activated or utilized by occupants of the lab. Additionally, thermostats monitor room temperature changes, and appliances like Roomba vacuum cleaners are occasionally operated. The dataset records all DNS traffic generated by the IoT devices over a period of approximately 1.5 months. It encompasses 53 active IoT devices sourced from diverse vendors.
This dataset was created to support research in CAN analysis, specifically signal extraction and translation. The dataset consists of 40 logs of CAN traffic collected by sending OBD queries at regular intervals during controlled driving conditions. Each log file is named according to the corresponding PID used for the query. Certain vehicle parameters, particularly those related to the powertrain, have significant values only when the vehicle is in motion. For instance, vehicle speed returns '00'h when queried while the vehicle is not moving. The CAN communication system's lack of authentication for connected nodes allows straightforward access via the OBD-II port, using the widely adopted Kvaser CAN interface device or other commercial CAN interfaces like Raspberry-Pi and Arduino. To enable cross-analysis, diagnostic responses and normal CAN messages were collected concurrently, using a 200 ms transmission interval to avoid interfering with normal CAN traffic, even though shorter request intervals are ideal. The dataset attributes include Timestamp (recording time in seconds), CAN ID (in hexadecimal), DLC (number of data bytes from 0 to 8), and DATA[0~7] (data values in bytes). In total, 40 dump files corresponding to distinct PIDs were generated, each containing around 127–128k CAN messages, including 300 diagnostic response messages.
The ROAD dataset includes 30 minutes of CAN-bus attack data from 33 unique scenarios and 3 hours of normal data, collected from a vehicle’s OBD-II port. Featuring diverse attack types—like Fuzzing, Fabrication, Masquerade, and Accelerator Attacks—it provides signal-translated and anonymized data for realistic IDS training and testing in vehicular networks.
The CAN-FD Dataset provides real-world in-vehicle CAN-FD traffic data, including one hour of normal driving and injected attack scenarios: Flooding, Fuzzing, and Malfunction. Collected in 2021 vehicles, it includes attributes like Timestamp, Arbitration ID, DLC, Data[0-64], and Labels, offering a valuable resource for advancing intrusion detection systems in CAN-FD networks.
The M-CAN Intrusion Dataset 2021 features normal and abnormal CAN traffic data, including DoS and fuzzing attacks, collected from a Genesis G80 during a 36-minute drive. With attributes like Timestamp, CAN ID, DLC, Payload, and Labels, it supports research on in-vehicle communication network security.
The B-CAN Intrusion Dataset 2021 includes normal and attack traffic data from the Genesis G80, targeting low-speed systems like BCM lights and smart key modules. With attributes like Timestamp, CAN ID, DLC, Payload, and Labels, it features DoS and fuzzing attacks injected into normal city-driving data, supporting intrusion detection research.
HIKARI-2021 dataset is most recent dataset for IDS and was captured between 28 march and 4 may 2021 with each capture session lasted for 3 to 5 hours. As network traffic is evolving with time, it is important to use the dataset, which is up-to-date. HIKARI-2021 was generated with a mixture of ground truth data and contains the network traffic with encrypted traces. From CICIDS-2017 dataset 80 Features were incorporated, with the addition of 4 more features, like destination IP, source IP address, destination port, and source port. This dataset contains 555 278 entries. It has synthetic benign traffic and malicious traffic. Packet traces are processed to anonymize the background traffic like payload and IP address and features are extracted. The dataset constitutes packet traces, documentation and extracted features.
This is the new generation of network traffic, operating system, & IOT/IIOT dataset. TON-IOT dataset is named because of heterogeneity of data sources which were collected in 2021 from telemetry datasets of network traffic, operating system datasets of windows 10 & 7 ubuntu 14.04 & 18.04 TLS, IOT & IIOT sensors. It was developed at the IOT Lab of yhe UNSW Canberra cyber, the school of engineering and information technology (SEIT), UNSW Canberra and was simulated on large and real testbed network that have the property of NVF SDN, & SO to allow communication between layers of fog, edge & cloud. There is a labelled data of attack & normal events. TON-IOT network dataset includes 22339021 records. There are 23 csv files, which were extracted using Zeek tool, 43 attributes and 2 attributes of class label are present in each file in the form of pcap files. Dataset is divided into train and test set. There are nine attack categories such as DOS, scanning, DDOS, backdoor, Ransomware, Cross-site scripting, injection, man in the middle & password cracking attacks.
USB-IDS dataset was developed at (USB) university of Sannio at Benevento Italy in 2021. It is a multilayer dataset in which both network traffic and application-level facets like server-side performance, configuration and defences. This dataset has attacks based on Dos protocol. Data is in the form of comma separated values (csv) files. There is a labelled bi-directional network flows, 84 values per record including label are present. Flows are obtained using CICFlowMeter. In USB-IDS-1 Only attacks are present, it doesnot record bening network traffic profiles over large network. The attacks present in this dataset are Hulk, TCPFlood, Slowloris, Slowhttptest.
AWID3 adds to the applicability of the widely known AWID2 by capturing traces of attacks received at IEEE 802.1X Extensible Authentication Protocol (EAP) systems. A significant contribution to intrusion detection is expected from AWID3. Raw cleartext pcap files for AWID3 are publically available. “254” features, i.e., “253” generic, plus “1” more for labeling, in CSV format were manually taken out. This "AWID3-CSV" dataset adds to the original one given in pcap format. The taken-out features spread across both the MAC and application layers of the recorded pcap files, and are separated based on the corresponding layer.
The VHS-22 data set contains over “27” million flows, approximately 20 million out of which are regular traffic and approximately 7 million are network attacks. Flows with only one packet were identified in both of these groups and were referred as zero-duration flows. They are responsible for total traffic of “45%” and total attacks of “83% (mainly DoS-related). It was discovered that UDP protocol was used in “62%” of flows, whereas others use TCP. In Zero-duration flows there are ”66%” of UDP protocol flows & 38% of TCP protocol flows. Flows longer than 160 seconds are also included in the dataset. The length of majority of flows are less than “200k” packets, but the length of longest flow is nearly “11.7M” packets. According to the distribution of attacks, “5.8%” Botnet & “93.9%” DoS related attacks cause the greatest no. of network flows labelled as attacks. Whereas remaining attack related traffic are “0.15%” Malware, “0.05%” web attacks & “0.8% brute force attacks.
This dataset includes harmless Audio Video Transport Protocol (AVTP) packet captures originating from our physical automotive Ethernet testbed. It also incorporates a demonstration of a replay attack conducted on the automotive Ethernet to construct an intrusion dataset. In this scenario, we simulate a hypothetical situation in which an attacker inserts arbitrary AVTP data units (AVTPDUs) into the In-Vehicle Network (IVN) to create a single video frame on a terminal application connected to the AVB listener. This is achieved by strategically injecting previously generated AVTPDUs within a specific timeframe. To illustrate this attack, we extract 36 continuous AVTPDUs from one of our AVB datasets (contained in single-MPEG-frame pcap), collectively forming a single video frame. Subsequently, the attacker carries out a replay attack by repeatedly transmitting these 36 stream AVTPDUs. For the results of this replay attack, you can refer to the *_injected.pcap files. Additionally, the dataset includes four benign (attack-free) packet captures: driving_01_originalpcap (approximately 10 minutes), driving_02_original.pcap (approximately 16 minutes), indoors_01_original.pcap (around 24 minutes), and indoors_02_original.pcap (approximately 21 minutes). Table 68 shows the work who used AUTOMOTIVE ETHERNET INTRUSION dataset
Zolanvari and colleagues identified the absence of datasets specifically designed for Industrial Internet of Things network security. To address this deficiency, they suggested a compact test environment created to mimic a standard Industrial Control System (ICS), focusing on a water level monitoring system in particular. The experimental setup includes a Programmable Logic Controller (PLC), four actuators, three sensors, and four additional versatile devices, with one of them operating as the attacker. Using the Modbus protocol, communication between actuators and sensors was enabled. The researchers equipped the testbed with measuring instruments and monitored the networks activity under normal conditions and during a simulated attack, capturing data over a period of 53 hours. To model potential attack scenarios, four categories of attacks were executed: reconnaissance, command injection, Denial of Service (DoS), and the installation of a backdoor. A portion of their recorded data was released to the public as the Dataset WUSTL-IIOT-2021, presented in a CSV format. There are 41 features and one feature is assigned as attack label. These characteristics offer a synopsis of the network traffic documented within the testbed. Researchers in their study, focusing on the application of Explainable Artificial Intelligence (AI) to enhance the security of the Industrial Internet of Things (IIoT), utilize this dataset.
Liu and team demonstrated that existing intrusion detection technologies face challenges, such as their limited applicability in scalable and dynamic environments such as contemporary IoT, there exists a trade-off between the limitations in resources of IoT devices and the increasing volume of data traffic. To tackle these challenges, they introduced the CCDINID-V1 dataset and utilized it, alongside two additional datasets. The dataset CCD-INID-V1 has a real network traffic of smart labs and smart homes, which was produced in experimental setup. The experimental configuration involves four Raspberry Pis equipped with sensors, functioning as devices for IoT sensing. Temperature readings are gathered by the IoT devices in smart home Scenario, whereas in the smart lab scenario both temperature and pressure measurements are gathered by IoT devices. Through a Wi-Fi IP connection, using HTTPS the data recorded was sent to a cloud server. Authors mimicked five attacks, comprising UDP Flood, ARP Poisoning, Hydra Bruteforce, ARP DoS, with Asterisk protocol, and SlowLoris. There are categorized as either normal or malicious in dataset entries. Additionally, NFStream extracted 83 features, including various timestamps, packet and byte counts, MAC addresses, and destination and source IPs. Although interested individuals can acquire it by reaching out to the authors directly, as dataset is not accessible publicly.
Sousa et.al presented dataset DOS and MITM. This dataset encompasses a wide array of Denial of Service (DoS) and Man-in-the-Middle (MiTM) attacks and is available publicly. The dataset originates from the base architecture deployed in Fed4Fire+ testbeds, specifically virtual Wall2 and Grid5000, where data collection occurred. Each component within the architecture possesses both data and control interfaces, with pertinent information, such as Modbus TCP data packets exchanged during process control tasks, captured via data interfaces. Capture operations were conducted at the bridge node using the dumpcap tool. The PLC master node initiates queries to other PLC nodes for sensor information stored in specific registers, while PLC slaves hold sensor data in holding registers based on PLC internal mechanisms. Multiple PLCs, all based on the OpenPLC v3 version, exist within the testbed. Notably, file sizes range between 10 and 20MB.
Kalupahana Liyagage et.al. Presented dataset NSS MIRAI. The dataset originates from an IoT network simulation, featuring a Mirai-style attack scenario, managed through OpenStack. It encompasses virtual machines (VMs) including seven gateways, a security manager, 65 IoT devices, one victim of the attack, two external bots, a command and control (C&C) server, a loader, and a port scan generator, all running on Ubuntu 18.04 and custom Linux images. The experiment, spanning 60 minutes, encompasses activities such as network scanning, port scanning, and brute-force login attempts. The dataset, called NSS, is publicly available and captures various activities including login attempts, port scan, vulnerability scan, malware loading, C&C communication (both successful and failed), scan out activities (ports and login), DDoS attacks (volumetric and reflective), and noise (false alerts). This comprehensive dataset enables the study and analysis of IoT botnet behaviors and attack patterns under controlled experimental conditions
Trajanovski et.al presented dataset IOT-BDA. The dataset reveals results from a study carried out by the IoT-BDA Framework, the analysis involved 4077 distinct IoT botnet samples collected via honeypots. Using sandbox execution, the framework conducted static, behavioral, and network analyses to detect signs of compromise and attack, as well as tactics such as anti-dynamic-analysis, anti-static-analysis, anti-forensics, and persistence utilized by IoT botnets. Each sample underwent scanning via Virustotal, with the AVClass malware classifier attributing the most probable malware familyThe dataset enables the grouping of IoT botnet samples according to their static, behavioral, and network characteristics through clustering techniques. It encompasses the botnet samples (ELF files), recorded system call behaviors, and captured network traffic (.pcap), providing comprehensive insights into IoT botnet behaviors and characteristics.
Erfani et.al presented dataset BOTIOTTONIOT. This dataset amalgamates two prominent IoT datasets, BoT-IoT and TonIoT. This merging is justified by shared cybersecurity attack types (DDoS, Scanning, DoS) and five common IoT devices: smart fridge, weather station, remote garage door motion-activated lights, and smart thermostat. All experiments are conducted within identical environments. BoT-IoT and TonIoT offer 33 and 22 features respectively, with 12 features in common. The dataset employs PCAP files from both sources, supplemented by 40 new features proposed by the authors to enhance attack classification. Visualization and dimensionality reduction techniques such as t-SNE are utilized to compare datasets with and without these additional features, facilitating deeper insights into attack patterns and behaviors
105
Malicious Network traffic PCAPS and binary image visualization
Rosa et.al presented malicious network traffic Pcaps and binary image visualization. This dataset offers a carefully selected assortment of PCAP files obtained from actual malware traffic detected in a virtualized smart home setting within the Cyber-Trust testbed. The (SOHO) smart home setup comprises virtualized devices organized into distinct groups, with each group facilitated by a separate Ubuntu VM acting as the gateway. The dataset contains curated PCAP files originating from various authentic attack scenarios, including zero-day exploits, DDoS assaults leveraging Mirai and Black Energy botnets, infections by Zeus malware on Linux and Windows platforms, Java-RMI and distcc exec backdoors, UnrealIRCD backdoors, Web Tomcat exploits, Ruby DRb code execution, Hydra FTP and SSH brute force attacks, SMTP user enumeration, and NetBIOS-SSN incidents. PCAP files were created by executing live demonstrations of each attack scenario and capturing inter-device network communication using tcpdump. Furthermore, captures of regular network traffic were acquired from unaffected devices through routine network tasks, like file transfers, SSH sessions, media streaming, and API interactions, reflecting typical behavior within a smart home network
Ahmed and team developed the dataset ECU-IoFT, which captures the wireless Drones network traffic, specifically the Tello drone. The drone was linked to access point of WI-FI and user controls it by corresponding smartphone application to compile this dataset. At the same time, a perpetrator initiated assaults on the drone, executing three distinct types of attacks: a WPA2-PSK Wi-Fi cracking, Tello API exploit, and Wi-Fi deauthentication. The wireless traffic produced during these actions was captured and stored on the attacker's device, resulting in a dataset spanning approximately 3 minutes and comprising 54,492 packets. The dataset is organized in a CSV file which has 10 features of traffic recorded. whether the traffic is normal or linked to an attack is indicated by Label.
The latest dataset under consideration is the CoAP-DoS by Mathews et al, which is publicly available. Specifically tailored for IDS based on Machine Learning, this dataset stands out as the sole one identified that addresses the CoAP protocol, focusing on DoS attacks against this protocol. The authors installed measuring instruments on four devices, comprising a server, which is a Raspberry Pi, and three omputers, among which two functioned as malicious entities. These devices generated both normal and Distributed Denial of Service (DDoS) traffic, which was meticulously recorded over a span of 16 hours. The dataset can be obtained in two formats: total of 661,304 extensive captures of packets, CSV file containing 17 extracted traffic features. JSON files contains traffic which is malicious.
Ferrag and colleagues constructed a comprehensive hybrid testbed for generating and capturing network traffic in both IoT and IIoT domains, Extending across various abstraction layers. incorporating orchestrating services, down to the lower perception layer involving IIoT and IoT actuators and sensors, along with intermediary mediator layers. There are 13 actuators and sensors, including sound, water level and temperature sensors, as well as DC motors and servo, among others in perception layer. They outfitted their testbed to produce the dataset Edge-IIoTset to make IIoT dataset suitable for Federated Learning. The dataset comprises both malicious and benign traffic spanning various protocols based on IP, including protocols of application layer such as DNS, MQTT, Modbus, and HTTP. Although Traffic was not continuous, it spanned 50 days. The dataset offers complete captures of packets in PCAP format. To generate attack traffic, a total of 14 attacks were implemented during collection phase. Subsequently, 61 features were extracted from the traffic, which also include the labels for the attacks.
Graveto et.al. presented dataset KNX. The dataset comprises data collected from a single-family house equipped with a KNX home automation system controlling lighting, blinds, heating, and alarms. Collected in PCAP format, each packet includes a raw message and timestamp. Enriched through CSV files with additional information from the exported ETS project file, it spans from March 7 to April 13, 2020, yielding 379,875 valid packages after processing.
The IDSAI dataset is a recently developed well-balanced dataset designed for assessing the effectiveness of supervised machine learning methods in identifying intrusions through the analysis of traffic captures of network in IoT communications. Formed in a real-world attack setting, the dataset comprises intrusions, totalling 1,000,000 samples. Initially, it featured twenty four (24) attributes, but after initial pre-processing some features like ports and IP addresses susceptible to manipulation by attackers are removed and dataset was refined to two label columns and nineteen (19) variables, resulting in a 1,000,000 × 21 matrix. The dataset is evenly split between non-intrusion (500,000 samples) and intrusion (500,000 samples) categories. Within the intrusion class, there are ten distinct types, each represented by 50,000 data samples. The ten intrusion types encompass, SYN/ACK and RST Flooding, ICMP echo request Flood/Ping Flood, SYN/ACK Flooding, ARP spoofing, SYN Flooding faster, DDoS MAC Flood, IP Fragmentation, Brute Force SSH, TCP Null, and UDP port scan. The dataset has been made publicly available for research purposes.
The authors meticulously curated a dataset to evaluate machine learning methodologies for detecting intrusions through the analysis of network traffic captures in perimeter intrusion detection systems (PIDS). Sourced from strategically placed security cameras capturing instances of intrusions in authentic attack scenarios, this dataset was used to validate an innovative machine learning-driven approach, aiming to enhance both the effectiveness and efficiency of PIDS for heightened security measures. It includes variations in zoom, roll, and yaw to replicate real-world conditions. Over a continuous 15-day recording period, videos covered day, night, rainy day, and night scenarios, with the camera consistently set to autofocus mode. The authors manually examined the entire 10 days' worth of footage, selecting a subset of 30 hours containing instances of intrusion. From this curated set, 17,000 frames were meticulously chosen for further analysis. The authors have openly made this dataset accessible to facilitate future research efforts in the field.
The TII-SSRC-23 dataset is a thorough compilation of network traffic patterns meticulously assembled to facilitate the development and research of Intrusion Detection Systems (IDS). Herzalla et al. introduced this dataset, accessible at kaggle. It adopts a dual structure, where one segment provides a tabular representation of extracted features in CSV format, and the other supplies raw network traffic data for each traffic type in PCAP files. With a total size of 27.5 GB in PCAP format, the dataset encompasses a diverse range of network scenarios, capturing both benign and malicious instances, featuring 32 traffic subtypes and 26 distinct attacks. Each subtype is enhanced with modification in traffic parameters, establishing it as a valuable resource for researchers in the field of machine learning. The traffic is divided into two main categories, distinguishing between benign and malicious, encompassing 8 traffic types (background, audio, text, bruteforce, video, DoS, Mirai botnet and information gathering), and includes 32 subtypes (26 malicious and 6 benign). Despite an observed imbalance favoring malicious samples, the dataset stands as a crucial asset for the advancement of machine learning research.
The ROSIDS23 dataset is derived from network traffic data collected from an autonomously operated robotic system saved in pcap format. From pcap files traffic characteristics were extracted utilizing CICFlowMeter. This diverse dataset features 4 attacks: Denial of Service (DoS), unauthorized publish, subscriber flood, and unauthorized subscribe. While the 3 attacks are specific to ROS, the one attack is a general network security attack. Each dataset entry comprises a timestamp, 83 features, and a label field, offering five possible values: DoS, benign, unauthorized publish, subscriber flood, and unauthorized subscribe. This dataset encompasses various records, with the largest category being (62,511) benign instances. 31,000 instances of Denial of Service (DoS) attacks and 30,064 instances of Subscriber flood attacks. In response to particular security issues, emphasis is placed on unauthorized subscribe (5289 records) and unauthorized publish (7817 records). The dataset includes both raw and processed files, with attack start times logged in the ∗_attacktimes.txt file during relevant network traffic sessions. The drawbacks of this dataset include its focus on only four types of attacks, specifically targeting UDP/TCP and HTTP protocols inherent in ROS. This might result in an incomplete representation for systems incorporating custom or additional protocols
The X-CANIDS Dataset 2017 provides raw CAN messages and deserialized signals from a 2017 Hyundai LF Sonata, enabling research on in-vehicle intrusion detection. It includes benign driving data and balanced intrusion scenarios (fuzzing, masquerade, replay, etc.), supporting both message-based and signal-based analysis for enhanced vehicle network security.
The new dataset, named Linux-APT Dataset 2024, captures Advanced Persistent Threat (APT) attacks and other sophisticated payloads. It consists of several parts, including two combined files for analysis. Because of a constraint of 10,000 records per file, logs are segregated into files according to specific date ranges, totaling 17 files from October 1st, 2023, to January 7th, 2024. The dataset is available on Zenodo and Mendeley repositories. The 'Processed Version' and 'Combined' files consolidate all data, with one in its original raw form and the other compiled. Data is available in raw format, XML configuration as well as in Comma Separated Value (CSV) format. The collected data includes both qualitative and quantitative information, detailing APTs, malware, and associated vectors. Qualitative data involves the selection of APTs, payloads and malware, simulated in a mostly Linux-based environment. Since APTs are time-consuming, a broad timetable was necessary for accurate evaluation. In total, there are 125,898 records containing both generalized and malicious traffic/logs. The data gathering process covers recent intrusions, published CVEs, Linux privilege escalation payloads, and APTs like APT29, APT41, APT28, and Turla. Additionally, threat emulations such as key-loggers, Apache Struts vulnerabilities, and backdoor malwares are considered.
The CAN-MIRGU dataset is a comprehensive resource for IDS development in autonomous electric vehicles, featuring six weeks of benign CAN traffic and diverse attack scenarios, including DoS, fuzzing, replay, spoofing, suspension, and masquerade attacks. Captured at 500 Kbps under realistic and controlled conditions, it highlights timing and frequency impacts, enhancing IDS research for detecting sophisticated CAN-bus intrusions.
The LSPR23 dataset is sourced from the Locked Shields 2023 live-fire cyber defense exercise, featuring network traffic from a virtual Blue Team. It contains approximately 16 million network flows, with 1.6 million labeled as malicious.
The Farm-Flow dataset is a publicly available intrusion detection dataset for smart agriculture (AG-IoT) networks. It includes traffic from various IoT devices and captures eight types of cyberattacks, such as ARP Spoofing, DDoS, and MQTT Flood, alongside normal traffic. With 1,309,887 instances and labeled for both binary and multiclass classification, the dataset provides structured network flow data in CSV format.
The Gotham Dataset is a large-scale IoT network traffic dataset generated using the Gotham testbed, an emulated IoT environment designed for realistic and heterogeneous network security research. It includes traffic from 78 IoT devices operating on multiple protocols such as MQTT, CoAP, and RTSP. Network traffic was captured in PCAP format using tcpdump, covering both benign and malicious activities. The malicious traffic was generated through scripted attacks, including DoS, Telnet Brute Force, Network Scanning, CoAP Amplification, and Command & Control (C&C) communication. Processed in Python using Tshark, the dataset is available in both raw PCAP and labeled CSV formats. Collected in a distributed manner at the interface between IoT gateways and devices, the dataset provides a valuable resource for developing Intrusion Detection Systems and security solutions for complex IoT environments. It is publicly available on Zenodo.
The IoT Environment dataset encompasses traffic generated by various IoT devices such as EZVIZ, NUGU, Hue, TP-Link and Google Home Mini. These devices engage in diverse activities, including smart speakers answering queries or playing music, home cameras streaming images to cell phones, and smart bulbs controlling lighting functions. The dataset includes instances of Port Scan, HTTP Flooding Attack and OS & Service Detection. Following the setup of the IoT device environment, packet capture was performed using Wireshark over a duration of approximately 34 minutes, resulting in a total of 125,182 packets