ThreatKG: A Threat Knowledge Graph for Automated Open-Source Threat Intelligence Gathering and Management

ThreatKG is a system for automated open-source cyber threat knowledge (OSCTI) gathering and management. ThreatKG automatically collects a large number of OSCTI reports from a wide range of sources, uses a combination of ML and NLP techniques to extract high-fidelity threat knowledge, constructs a threat knowledge graph, and updates the knowledge graph by continuously ingesting new knowledge.

Read the paper

Overview

ThreatKG consists of three phases: (1) OSCTI report collection, (2) threat knowledge extraction, and (3) threat knowledge graph construction. Each phase consists of one or several processing steps (e.g., Parser, Extractor).

ThreatKG Architecture
OSCTI Report Parsing and Threat Relevance Checking

OSCTI Report Parsing. Once the crawlers collect the OSCTI reports, the porters group multi-page report files. The parsers are source- dependent; each parser parses the specific layout structure of the corresponding OSCTI source and converts the report files into unified threat knowledge representations (UTKRs).
Threat Relevance Checking. ThreatKG employs a set of checkers that operate on the UTKRs produced by the parsers and filter out reports that are irrelevant to cyber threats. The filtered UTKRs are then passed to the extractors for further enrichment.

Threat Knowledge Extraction

Threat Knowledge Entity Extraction. We construct a set of regex rules to extract IOCs. For other types of entities, ThreatKG employs a DL-based extractor to perform neural NER.
Threat Knowledge Relation Extraction. Dependency parsing-based RE and Neural RE are used to extract relations that capture both low-level threat behaviors and high-level threat contexts.
Data Programming. We leverage data programming, which programmatically synthesizes annotations via unsupervised modeling of sources of weak supervision.

Scalable and Extensible System Architecture

Threat Knowledge Graph Construction. ThreatKG constructs the threat knowledge graph from the UTKRs and stores it into the backend database for persistence.
Scalability and Extensibility. To make the system scalable, we parallelize the system components for the processing steps (e.g., crawlers, parsers, checkers, extractors). To make the system extensible, we adopt a modular design, allowing multiple system components in the same processing step to work together with the same input/output interface.
Continuous Knowledge Integration. To provide the latest threat knowledge timely, ThreatKG is fully automated and continuously running, with new reports being collected and new knowledge being extracted and integrated into the threat knowledge graph.

Frontend Web GUI
ThreatKG UI

To facilitate threat search and knowledge graph exploration, we built a web GUI using React and Elasticsearch. The GUI interacts with the Neo4j database and provides various types of interactivity.

Publications

ThreatKG Paper

A System for Automated Open-Source Threat Intelligence Gathering and Management

Peng Gao, Xiaoyuan Liu, Edward Choi, Bhavna Soman, Chinmaya Mishra, Kate Farris, Dawn Song

SIGMOD 2021 Demo. (Virtual) Xi'an, Shaanxi, China (June 20 - June 25, 2021).

ThreatKG Full Paper

ThreatKG: A Threat Knowledge Graph for Automated Open-Source Cyber Threat Intelligence Gathering and Management

Peng Gao, Xiaoyuan Liu, Edward Choi, Sibo Ma, Xinyu Yang, Zhengjie Ji, Zilin Zhang, Dawn Song

arXiv. 2022.

People

Peng Gao

Virginia Tech

Xiaoyuan Liu

UC Berkeley

Edward Choi

UC Berkeley

Sibo Ma

UC Berkeley

Xinyu Yang

Virginia Tech

Zhengjie Ji

Virginia Tech

Dawn Song

UC Berkeley