ThreatKG consists of three phases: (1) OSCTI report collection, (2) threat knowledge extraction, and (3) threat knowledge graph construction. Each phase consists of one or several processing steps (e.g., Parser, Extractor).
OSCTI Report Parsing.
Once the crawlers collect the OSCTI reports, the porters group multi-page report files. The parsers are source- dependent; each parser parses the specific layout structure of the corresponding OSCTI source and converts the report files into unified threat knowledge representations (UTKRs).
Threat Relevance Checking.
ThreatKG employs a set of checkers that operate on the UTKRs produced by the parsers and filter out reports that are irrelevant to cyber threats. The filtered UTKRs are then passed to the extractors for further enrichment.
Threat Knowledge Entity Extraction.
We construct a set of regex rules to extract IOCs. For other types of entities, ThreatKG employs a DL-based extractor to perform neural NER.
Threat Knowledge Relation Extraction.
Dependency parsing-based RE and Neural RE are used to extract relations that capture both low-level threat behaviors and high-level threat contexts.
Data Programming.
We leverage data programming, which programmatically synthesizes annotations via unsupervised modeling of sources of weak supervision.
Threat Knowledge Graph Construction.
ThreatKG constructs the threat knowledge graph from the UTKRs and stores it into the backend database for persistence.
Scalability and Extensibility.
To make the system scalable, we parallelize the system components for the processing steps (e.g., crawlers, parsers, checkers, extractors). To make the system extensible, we adopt a modular design, allowing multiple system components in the same processing step to work together with the same input/output interface.
Continuous Knowledge Integration.
To provide the latest threat knowledge timely, ThreatKG is fully automated and continuously running, with new reports being collected and new knowledge being extracted and integrated into the threat knowledge graph.
To facilitate threat search and knowledge graph exploration, we built a web GUI using React and Elasticsearch. The GUI interacts with the Neo4j database and provides various types of interactivity.
Virginia Tech
UC Berkeley
UC Berkeley
UC Berkeley
Virginia Tech
Virginia Tech
SJTU
UC Berkeley