Everything about the datasets and data sources mentioned in the survey paper "Data-driven cyber security incident prediction and discovery".
- Research articles by areas
- Dataset types
- Organization reports and datasets
- Executables
- Network datasets
- Synthetic datasets
- Webpage data
- Social media data
-
[1] Proactively predict organization’s breaches incidents:Cloudy with a Chance of Breach: Forecasting Cyber Security Incidents (Usenix, 2015)
-
[2] Predict risk distributions of different data breach incidents:Prioritizing security spending: A quantitative analysis of risk distributions for different business profiles (WEIS, 2015)
-
[3] Discover previously unknown malware: The dropper effect: Insights into malware distribution with downloader graph analTytics (ACM SIGSAC, 2015)
-
[4] Predict whether a file is malicious or not based on first 5 seconds execution: Early Stage Malware Prediction Using Recurrent Neural Networks (Computers & Security, 2018)
-
[5] Predict the resilience of different software protection transformations against automated attacks: Predicting the Resilience of Obfuscated Code Against Symbolic Execution Attacks via Machine Learning (Usenix, 2017)
-
[6] Discover the correlation between mismanaged networks and maliciousness of the networks: On the Mismanagement and Maliciousness of Networks (NDSS, 2014)
-
[7] Discover black keywords used by underground economy: How to Learn Klingon without a Dictionary: Detection and Measurement of Black Keywords Used by the Underground Economy (S&P, 2017)
-
[8] Predict whether a currently benign website has high risk of becoming malicious in the future: Automatically detecting vulnerable websites before they turn malicious (Usenix, 2014)
-
[9] Predict malicious websites which are under surface before by identifying the infection campaigns: Delta: automatic identification of unknown web-based infection campaigns (ACM SIGSAC, 2013)
-
[10] Exploit Twitter to predict realworld vulnerability exploits: Vulnerability disclosure in the age of social media: Exploiting twitter for predicting real-world exploits (Usenix, 2015)
-
[11] Discover and generate Indicators of Compromise (IOCs): Acing the IOC Game: Toward Automatic Discovery and Analysis of Open-Source Cyber Threat Intelligence (ACM SIGSAC, 2016)
-
[12] Discover, identify and encode cyberattack events: Crowdsourcing cybersecurity: Cyber attack detection using social media (ACM CIKM, 2017)
-
[13] Predict mobile apps security-related behaviors: AUTOREB: Automatically Understanding the Review-to-Behavior Fidelity in Android Applications (ACM SIGSAC, 2015)
-
[14] Predict the future structural changes of the network: Modeling dynamic behavior in large evolving graphs (WSDM, 2013)
-
[15] Discover security events on a specific event category: Weakly Supervised Extraction of Computer Security Events from Twitter (WWW, 2015)
-
[16] Discover vulnerable code: Cross-Project Transfer Representation Learning for Vulnerable Function Discovery (TII, 2018)
-
[17] Discover zero-day applications in traffic classification system: Robust Network Traffic Classification (TON, 2015)
-
[18] Discover hidden sensitive operations: Dark Hazard: Learning-based, Large-scale Discovery of Hidden Sensitive Operations in Android Apps (NDSS, 2017)
Related paper | Dataset | Introduction |
---|---|---|
[1][2] | VERIS community database | The vocabulary for event recording and incident sharing |
[1] | Hackmageddon | Information security timelines and statistics |
[1] | Web Hacking Incidents Database | Recording web hacking incident |
[3] | VirusTotal | Analyzing suspicious files and URLs to detect types of malware |
[3] | National Software Reference Library (NSRL) | Providing a reference data set (RDS) of benign software |
[3][10] | Symantec’s Worldwide Intelligence Network Environment (WINE) | Security related data set, including malware, vulnerabilty exploited and so on |
[17] | KEIO, WIDE-08 and WIDE-09 traces | Public traffic data repository |
[10] | ExploitDB | Offensive Security’s Exploit Database Archive |
[10] | Microsoft’s Exploitability Index | Recording exploitability information |
Related paper | Dataset | Introduction |
---|---|---|
[14][18] | VirusTotal | Providing executables samples, such as Windows 7 executable samples and Andriod apps |
[14] | Softonic | App news and reviews, best software downloads and discovery |
[14] | PortableApps | Offering free, commonly used Windows applications that have been specially packaged for portability |
[14] | SourceForge | Open Source applications and software directory |
[18] | Google Play | Offical app store for the Android operating system |
Related paper | Dataset | Introduction |
---|---|---|
[6] | Open recursive projects | Open Resolvers pose a significant threat to the global network infrastructure by answering recursive queries for hosts outside of its domain. They are utilized in DNS Amplification attacks and pose a similar threat as those from Smurf attacks commonly seen in the late 1990s. A list of 32 million resolvers that respond to queries in some fashion are collected in this project. |
[6] | Verisign. Inc | Verisign, Inc. is an American company based in Reston, Virginia, United States that operates a diverse array of network infrastructure, including two of the Internet's thirteen root nameservers, the authoritative registry for the .com, .net, and .name generic top-level domains and the .cc and .tv country-code top-level domains, and the back-end systems for the .jobs, .gov, and .edu top-level domains. Verisign also offers a range of security services, including managed DNS, distributed denial-of-service (DDoS) attack mitigation and cyber-threat reporting. |
[6] | Alexa Web Information Service | The Alexa Web Information Service (AWIS) offers a platform for creating innovative Web solutions and services based on Alexa's vast information about web sites, accessible with a web services API. |
[6][14] | University of Oregon Route Views Project | The University's Route Views project was originally conceived as a tool for Internet operators to obtain real-time BGP information about the global routing system from the perspectives of several different backbones and locations around the Internet. |
[6] | Spoofer project | The team is developing and supporting open-source software tools to assess and report on the deployment of source address validation (SAV) best anti-spoofing practices |
[6] | Zmap | Zmap is a modular, open-source network scanner specifically architected to perform Internet-wide scans and capable of surveying the entire IPv4 address space in under 45 minutes from user space on a single machine, approaching the theoretical maximum speed of gigabit Ethernet. |
Related paper | Dataset | Introduction |
---|---|---|
[5] | Synthetic obfuscation C code | 5 obfuscating transformations apply to each of 4608 synthetic C programs with security check. Totally, 23,040 synthetic obfuscation C programs are included in this dataset. |
[14] | Sythetic network graph | A simple graph represented by four main node patterns: “center of a star”, “edge of a star”, “bridge nodes” (connecting stars/cliques), and “clique nodes”. |
Related paper | Dataset | Introduction |
---|---|---|
[7] | SEO, porn and gambling webpages | Webpages marked as “evil” by Baidu |
[8] | Malicious and benign websites | Malicious websites are collected from PhishTank blacklists and the “search-redirection attacks” list; benign websites are gathered from entire.com zone file and validated by multiple reputation blacklists, including PhishTank blacklists, “search-redirection attacks” list, DNS-BH, Google SafeBrowsing, and hpHosts blacklists |
Related paper | Dataset | Introduction |
---|---|---|
[10][15] | Tweets crawled from Twitter | Twitter is a social media platform which includes from breaking news and entertainment to sports and politics. |
[13] | User reviews from Google Play | Each review is manually labeled as one or more security-related behaviors (spamming, financial issues, over priviledged permissions and data leakage) |
[11] | 71,000 articles from leading technical blogs | Technical blogs: (1)Dancho Danchev (2)Naked Security (3)The hacker news (4)Webroot (5)Threat Post (6)TaoSecurity (7)Sucuri (8)PaloAlto (9)Malwarebytes (10)Hexacorn |