Acing the IOC Game: Toward Automatic Discovery and Analysis of Open-Source Cyber Threat Intelligence

Abstract

To adapt to the rapidly evolving landscape of cyber threats, secu- rity professionals are actively exchanging Indicators of Compro- mise (IOC) (e.g., malware signatures, botnet IPs) through public sources (e.g. blogs, forums, tweets, etc.). Such information, of- ten presented in articles, posts, white papers etc., can be converted into a machine-readable OpenIOC format for automatic analysis and quick deployment to various security mechanisms like an in- trusion detection system. With hundreds of thousands of sources in the wild, the IOC data are produced at a high volume and veloc- ity today, which becomes increasingly hard to manage by humans. Efforts to automatically gather such information from unstructured text, however, is impeded by the limitations of today’s Natural Lan- guage Processing (NLP) techniques, which cannot meet the high standard (in terms of accuracy and coverage) expected from the IOCs that could serve as direct input to a defense system. In this paper, we present iACE, an innovation solution for fully automated IOC extraction. Our approach is based on the obser- vation that the IOCs in technical articles are often described in a predictable way: being connected to a set of context terms (e.g., “download”) through stable grammatical relations. Leveraging this observation, iACE is designed to automatically locate a putative IOC token (e.g., a zip file) and its context (e.g., “malware”, “down- load”) within the sentences in a technical article, and further an- alyze their relations through a novel application of graph mining techniques. Once the grammatical connection between the tokens is found to be in line with the way that the IOC is commonly pre- sented, these tokens are extracted to generate an OpenIOC item that describes not only the indicator (e.g., a malicious zip file) but also its context (e.g., download from an external source). Running on 71,000 articles collected from 45 leading technical blogs, this new approach demonstrates a remarkable performance: it gener- ated 900K OpenIOC items with a precision of 95% and a coverage over 90%, which is way beyond what the state-of-the-art NLP tech- nique and industry IOC tool can achieve, at a speed of thousands of articles per hour. Further, by correlating the IOCs mined from the articles published over a 13-year span, our study sheds new light on the links across hundreds of seemingly unrelated attack instances, particularly their shared infrastructure resources, as well as the im- pacts of such open-source threat intelligence on security protection and evolution of attack strategies.

Publication
The 23rd ACM Conference on Computer and Communications Security (CCS)
Date