December 8, 2016 • Staffan Truvé
The ever-increasing scale and complexity of cyber threats is bringing us to a point where human threat analysts are approaching the limit of what they can handle. We believe the next-generation of cyber threats must be tackled by a combination of machines equipped with artificial intelligence (AI) and human analysts — what we call centaur threat analysts.
One example of this is presented here: a new approach to forecasting malicious IP infrastructure by using machine learning.
Cyber defenders would benefit tremendously from being informed of threats ahead of time, instead of just reacting to identified, ongoing, or already completed attacks. Recorded Future has developed a method for predicting tomorrow’s threats. Such anticipatory intelligence is produced by continuously training a machine-learning model with data sets derived from both technical intelligence (threat lists, etc.) and context from open source intelligence (OSINT).
All threat actors, state-run intelligence agencies to cyber criminals, need to manage their resources and re-use assets (e.g., technical infrastructure and associated IP address ranges). This “business logic” is the underlying reason that predicting future maliciousness is possible, based on large-scale observations of historic and current activity.
The following Hilbert curve visualizations show 2,856 IPv4 addresses with a Recorded Future risk score greater than 70. The mapping keeps addresses close to each other in IPv4 address space close in the 2D visualization. Note the clusters indicating IP address blocks representing infrastructure controlled by threat actors (see picture below for detail).
In recent years, humanity’s ability to make accurate predictions has improved in many fields thanks to a combination of augmented sensor capabilities and new prediction algorithms.
As an example, today’s weather forecasts benefit both from improved sensing by weather satellites and from new algorithms run on powerful parallel computers. In a similar way, web intelligence provides new sensing capabilities which can be combined with novel algorithms to predict future cyber threats.
To identify future malicious activity, we trained a Support Vector Machine model using historic threat lists combined with contextual OSINT data. The resulting model takes into consideration not only CIDR neighborhoods, but also the context in which the neighbors of an IP address are being discussed (e.g., identified as command and control servers or associated with known threat actors, geographies, or malware — some 50 factors in total).
The output from the SVM model is used to assign a predictive risk score to IP addresses, and this score can then be used to alert security operations center (SOC) operators or threat analysts, or even to automatically block the address in firewalls and other network security systems.
The out-of-sample results are very promising so far.
As a concrete example, IP address 88[.]249[.]184[.]71 was flagged by our predictive risk scoring on October 4, 2016 (using a 0.75 threshold), and not until October 14 did it first occur on any threat list, identified as a DarkComet RAT controller:
In a quantitative study, over 25% of 500 previously unseen IPs with a predictive risk score turned malicious (as reported by OSINT) within seven days of being flagged by our algorithms.
In comparison, out of the entire IPv4 address space, typically less than 0.02% of all addresses at any given time have a Recorded Future risk score.
In a larger study, we trained an SVM model on historic data on a Wednesday morning and applied it to the entire IPv4 address space. Of the IPs with a predictive risk score, 327,549 occurred on any external threat lists in the following seven days. Out of these, 185,209 had not appeared on any threat lists in the past three weeks (21 days total). Our model predicted 242,206 (74%) of future threat-listed IP addresses while maintaining greater than 99% precision on a balanced test set.
In general, IP addresses are being flagged as potentially malicious by our predictive risk score 3-5 days before actually occurring on threat lists — giving the defenders the heads-up needed to prepare their defenses ahead of time, instead of just being reactive.
Predictive risk scoring of IP addresses is now available in Recorded Future Intel Cards, and through our API, allowing for easy integration with other tools and resources like security information and event management (SIEM) systems. This is the first of what we anticipate to be a series of novel ways of applying artificial intelligence to improve cyber security.