September 5, 2018 • Zane Pokorny
A big challenge in collecting and analyzing intelligence has always been scalability. Good, actionable intelligence takes expertise to develop. Let’s say you’re a government trying to gather information on a foreign power. You’ll need experts who speak the language, know the culture well enough to blend in, have the right skill sets, and are sympathetic to your goals. Finding enough experts who meet those criteria will be difficult — and even then, it still might not be enough to get regular, actionable intelligence.
You don’t have to be a national government to share these problems — anybody trying to figure out what hackers and other threat actors are up to on dark web forums will face the same information-gathering challenges. And these challenges are only getting worse; yearly ESG research has charted a growing trend of staffing shortages in the cybersecurity industry, with 51 percent of organizations surveyed saying they had a problematic staffing shortage in 2018, up from 23 percent in 2014.
One possible solution to this problem of scale and expertise is the application of machine learning techniques to evaluate large sets of data. And it’s great that machine learning can be used to process a much larger amount of data than any group of human analysts could do on their own. But more data can just get in the way, too, giving analysts more to sort through further down the line and raising the chances of false positives.
That’s why analysts at IDC asked Recorded Future’s customers whether their cybersecurity teams had actually seen time and money savings when they started using threat intelligence. Was all of that extra data actually helping them stay safe and work smarter? And if so, how?
The organizations interviewed by IDC made particular note of the machine learning processes that drove the creation of relevant and timely threat intelligence provided by Recorded Future — intelligence that helps Recorded Future users identify threats 10 times faster on average and find 22 percent more security threats before they have an impact, for example.
More data doesn’t equal better results — sometimes, it just means more work. What we’re all looking for is more good data, leading to threat intelligence that you can actually follow through on.
Data processing takes place at a scale today that requires automation to be comprehensive. Not only that, but data processing should also include the combination of data points from many different types of sources — including open, dark web, and technical sources — to form the most robust picture possible.
It’s worth looking a little closer at how Recorded Future’s machine learning processes work under the hood to understand why. We use machine learning techniques in four ways:
1. Structuring data into ontologies and events.
Ontology has to do with how we split concepts up and how we group them together. In data science, ontologies represent categories of entities based on their names, properties, and relationships to each other, making them easier to sort into hierarchies of sets. For example, Boston, London, and Gothenburg are all distinct entities that will also fall under the broader “city” entity.
If ontologies represent a way to sort physically distinct concepts, then events sort concepts over time. Recorded Future events are language independent — something like “John visited Paris,” “John took a trip to Paris,” “Джон прилетел в Париж,” and “John a visité Paris” are all recognized as the same event.
Ontologies and events enable powerful searches over categories, letting analysts focus on the bigger picture rather than having to manually sort through data themselves.
2. Structuring text in multiple languages through natural language processing.
With natural language processing, ontologies and events are able to go beyond bare keywords, turning unstructured text from sources across different languages into a structured database.
The machine learning driving this process can separate advertising from primary content, classify text into categories like prose, data logs, or code, and disambiguate between entities with the same name (like “Apple” the company, and “apple” the fruit) by using contextual clues in the surrounding text.
This way, the system can parse text from millions of documents daily across seven different languages — a task that would require an impractically large and skilled team of human analysts to do. It’s features like this that explain why the organizations interviewed by IDC found that their IT security teams worked 32 percent more efficiently with Recorded Future.
3. Classifying events and entities, helping human analysts prioritize alerts.
Machine learning and statistical methodology are used to further sort entities and events by importance — for example, by assigning risk scores to malicious entities.
Risk scores are calculated through two systems: one driven by rules based on human intuition and experience, and the other driven by machine learning trained on an already vetted dataset.
Classifiers like risk scores provide both a judgment (“this event is critical”) and context explaining the score (“because multiple sources confirm that this IP address is malicious”). Automating how risks are classified saves analysts time sorting through false positives and deciding what to prioritize.
The context and sourcing provided by the explanation behind these risk scores help IT security staff spend 34 percent less time compiling reports, according to IDC’s research.
4. Forecasting events and entity properties through predictive models.
Machine learning can also generate models that predict the future, oftentimes much more accurately than any human analysts, by drawing on the deep pools of data previously mined and categorized.
This is a particularly strong “law of large numbers” application of machine learning, and the big challenge is to make sure that the predictions are based on the right assumptions.
This strong focus on applying machine learning techniques to solving problems that are not humanly scalable is one of the reasons that organizations interviewed by IDC consistently found that the threat intelligence produced by Recorded Future was relevant and timely.
As noted in the white paper, “instead of spending significant time investigating potential threats and remediating them, organizations can be more proactive in their approach to security threats and concerns.”
Organizations saw their threat intelligence compilation and threat investigation procedures becoming more efficient with Recorded Future in three key ways:
To look more closely at IDC’s findings, download your free copy of the new white paper, “Organizations React to Security Threats More Efficiently and Cost Effectively With Recorded Future.”