Building Threat Analyst Centaurs Using Artificial Intelligence
January 14, 2016 • Staffan Truvé
In chess, a “centaur” is a human and computer playing together as a team, to take advantage of their complementary strengths: the speed and storage capacity of the machine and the creativity and strategic eye of the human. For the two decades following the landmark defeat of Garry Kasparov, the reigning world champion in chess, by IBM’s Deep Blue computer in 1996, the combination of human and computer has created the strongest chess players of them all, far surpassing top human players and beating even the best chess computers.
Some claim that computers have become fast enough that the efficiency afforded by human collaboration on a centaur team may no longer make a difference in chess, but the space of possible actions for threat actors is far more expansive than a chess game and there is little tolerance for unexpected blind spots. At Recorded Future we look to the centaur model to create the best possible threat analysts, combining the speed and depth of artificial intelligence with the strategic vision of a human expert.
What Is Artificial Intelligence, Really?
Webster’s Dictionary defines artificial intelligence (AI) as “an area of computer science that deals with giving machines the ability to seem like they have human intelligence,” and even if that definition is fairly vague it actually does capture the difficulty in defining AI quite nicely.
AI has been used in a variety of problem domains, such as natural language processing, robotic navigation, computer vision, etc., and relies on a number of underlying technologies such as rule-based systems, logic, neural networks, statistical methods, and so on. Like in other areas, there is a lot of fashion in what techniques are preferred, as exemplified by the recent hype around deep learning and recurrent neural networks. Our experience is that a mix of different techniques is needed to tackle a complex problem domain.
In the end, like for all other computer systems, it all boils down to two things: data structures and algorithms. Whether what you build using those components is “AI” or not really depends on if a human observer believes the system behaves “intelligently” in its target application domain.
Unlike older “AI systems” that primarily performed one task using one AI technology (such as Deep Blue playing chess or MYCIN and Dendral giving expert advice in a very narrow medical domain), all systems of today must use AI technology in many different places, to automate or facilitate the tasks at hand. A successful use of AI technology today will to a large extent be invisible, like in Google Photos where images are categorized, augmented, and merged into stories using technologies the end user does not and need not be aware of. Another example is Siri, which of course does speech recognition, but also uses other underlying systems, like Wolfram|Alpha, to answer questions in different domains.
It is also worth emphasizing that building an AI-based product is in almost all cases a systems engineering challenge, requiring not only a few clever algorithms but also a massive investment in supporting technologies like scalable computing infrastructure, monitoring systems, quality control, and data curation. These more mundane aspects may not be immediately visible to an end user, but they are essential for a working solution.
The reason AI has become such a focal point of attention for both researchers and entrepreneurs during the last few years is that several factors are contributing to a “perfect storm”:
- Never before has so much information been available in digital form, ready for use. All of humanity is, on a daily basis, providing more information about the world for machines to analyze. Not only that — through crowdsourcing and online communities we are also able to give feedback on the quality of the machines’ work on an unprecedented scale.
- Computing power and storage capacity continue to grow exponentially, and the cost for accessing these resources in the cloud are continuously decreasing. Incredible resources are now available not only to the world’s largest corporations but to garage startups as well.
- Research in algorithms has taken huge strides in giving us the ability to use these new computing resources on the massive data sets now available.
Even though we have not yet reached Ray Kurzweil’s singularity, we are living in a time when the limit of what machines can accomplish is being pushed every day.
Artificial Intelligence at Recorded Future
At Recorded Future, we use what’s usually referred to as AI techniques in four major ways:
- For representation of structured knowledge of the world, using ontologies and events.1
- For transforming unstructured text into a language-independent, structured representation, using natural language processing.
- For classifying events and entities, primarily to help decide if they are important enough to require a human analyst to perform a deeper investigation.
- To forecast events and entity properties by building predictive models from historic data.
We use a combination of rule-based, statistical, and machine-learning techniques to deliver these capabilities, as described below.
1. Knowledge Representation
At the heart of Recorded Future is a structured representation of the world, separated into two parts: ontologies and events.
Our ontologies are used to represent entities such as persons, organizations, places, technologies, and products. The ontologies also contain information about relationships between these entities, such as hierarchies (“X is part of Y” — e.g., “Paris” is a City in a Country called “France,” and “Zeus” is a Malware and a member of Categories “Botnet” and “Banking Trojan”). These ontologies provide, among other things, a powerful way of searching over categories, like “find all ‘Heads of State’ who traveled to ‘Africa’ in 2015,” where “Heads of State” will mean all Person entities with that attribute, and “Africa” captures all geographic entities (Country, City, etc.) on that Continent.
Recorded Future events are used to represent real-world events in a language-independent, structured way. They range from person- and corporate-related to geopolitical and environmental events. Event detectors abstract away from the exact wording used to describe an event. For example, “John flew to Paris,” “John visited Paris,” “John took a trip to Paris,” “Джон прилетел в Париж,” and “John a visité Paris” are all different ways of expressing the same event: a PersonTravel event where “John” is the traveler and “Paris” is the destination. This is the core idea behind events: By being able to search for an event instead of just using keywords, the analyst can focus on abstract concepts and not the many ways and different languages in which an author might talk about them.
Each event type has a set of (sometimes optional) named attributes.For example, a CyberAttack event relates an attacker to a target, and includes additional information about the attack method used and related hacktivist operation hashtags. At least an attacker or a target must be specified, the rest is optional. Multiple mentions of an event are grouped together to simplify analysis, even if the original text is in different languages or uses different words for the attack
2. Natural Language Processing
Natural language processing (NLP) transforms an unstructured, natural language text into a structured, language-independent representation. In our system, this means identifying entities, events, and time associated with those events. There are several steps in this, using different AI techniques:
- To extract the relevant text from an HTML Web page, we have developed a machine-learning-based module using Gibbs Sampling to extract the actual content (e.g., to decide what text that should be ignored, such as advertising).
- We use supervised machine-learning algorithms to classify texts, such as determining in which language a text is written or if a text is prose, a data log, or programming code.
- Supervised machine learning (e.g., using Conditional Random Fields) is also used to do named entity extraction from text.
- We are also using machine-learning classifiers to automatically disambiguate between different entities that have the same name (e.g., “Zeus” the Greek god vs. “Zeus” the malware) based on the context in which they are mentioned.
- For analyzing the structure of sentences, we use a data-driven dependency parser, MaltParser. This is an implementation of inductive dependency parsing, where the syntactic analysis of a sentence amounts to the derivation of a dependency structure, and where inductive machine learning is used to guide the parser.
- For extracting events and temporal information we have developed two distinct proprietary rule-based systems.
The combination of statistical/machine-learning components and rule-based components has allowed us to build a system that maximizes precision and recall, given a certain development budget. In the future we foresee more components being based on machine learning, but the significant costs associated with producing annotated data needed for training such components has so far motivated the hybrid approach.
The use of NLP has allowed us to build a system capable of analyzing millions of documents per day, in seven different languages (English, French, Spanish, Russian, Farsi, Arabic, and Chinese), and to transform that into a representation that gives an analyst access to this information, independent of language skills. This use of AI thus addresses two of the major challenges an analyst faces: the need to find information written in several languages, and the capacity to read and organize the massive amounts of security-related information being published every day.
3. Event and Entity Classification
The third area where AI techniques are used is for classification of entities and events.
We classify the importance of events to help analysts prioritize what they should focus their attention on, and we classify the maliciousness of technical entities like IP addresses both to support analysts and to enable the automatic configuration of network equipment and network management systems.
Event classification is done using statistical methods to detect anomalies (e.g., an unusual number of event references related to a certain cyber attacker or target).
Entity classification is used to decide the maliciousness of an entity — an IP address, for example — and to assign a Risk Score that can be used by an analyst or an automated system to decide how to act. Risk Scores are assigned using two different systems, one rule based and one machine-learning based. The rule-based system is based on human intuition about which contexts, sources, co-occurrences, threat list mentions, etc. are useful in deciding if an entity is associated with some kind of risk. The machine-learning-based classifier, on the other hand, has been trained on a large data set, using trusted threat list sources as ground truth for what constitutes a malicious entity.
One important aspect of both event and entity classifiers is that they must provide not only a judgement (“this event is critical” or “this IP address is malicious”), but also a human-readable motivation for that judgement (“this IP address is considered malicious because it has been called out as a malware Command & Control center by three independent sources”).
Rule-based systems can fairly easily generate motivations like these, but it is equally important in machine-learning systems to be able to generate motivations, at least in terms of which significant features of an entity contributed the most to a judgement stating that it has a certain property.
Automated classification of entities and events allows our system to support the analyst, who can spend significantly less time on deciding what topics to focus on, and instead use that time for improved analysis of prioritized threats.
Another application of machine learning is to generate predictive models that can be used to forecast events or classify entities. We have, for example, created models to predict future risk of social unrest, the likelihood of product vulnerabilities being exploited, and to assess the risk that an IP address will behave maliciously in the future, even though no such activity has yet been observed.
The challenge in all these cases is to identify relevant features on which to base the predictions, and most of all to get access to enough ground truth training data to be able to generate models that can be used to make predictions with the required accuracy.
What Does the Future Hold?
As can be seen from the description above, Recorded Future is not an “AI application” in the old sense, but instead a system where we use AI techniques in multiple places to create functionality that mimics human intelligence, like Google Photo and Siri. This enables analysts to work together with the machines, creating extremely capable centaurs.
We have come quite far in using artificial intelligence techniques to improve the capabilities of human analysts, but there is much more to be done in terms of automating simpler tasks as well as providing support for the more complex parts of threat analysis.
Even though machines and software will continue to evolve with dazzling speed, the complexity of threat analysis means there will be plenty of challenging opportunities for human analysts for a very, very long time. Therefore, I, for one, welcome our new robot colleagues!
1 Ontologies are formal representations of a set of concepts/entities within a domain and the relationships between those concepts/entities.