Predicting Future Threats With Machine Learning

December 11, 2017 • Amanda McKeon

In this episode, we take a closer look at some of the specifics of artificial intelligence and machine learning, and how cybersecurity professionals can benefit from including these tools in their threat intelligence arsenals. We’ll discuss clustering, natural language processing or NLP, and supervised learning, and we’ll find out why combining the talents of humans with the speed and analytical capabilities of computers, the so-called digital centaurs, could provide even more powerful solutions in the future.

Joining us are two experts in machine learning. Christopher Sestito is manager of threat intelligence at Cylance, a company that’s all-in when it comes to AI technology, and Staffan Truvé, co-founder and chief technology officer at Recorded Future.

This podcast was produced in partnership with the CyberWire and Pratt Street Media, LLC.

For those of you who’d prefer to read, here’s the transcript:

This is Recorded Future, inside threat intelligence for cybersecurity.

Dave Bittner:

Hello everyone, and thanks for joining us. I’m Dave Bittner from the CyberWire, and this is episode 35 of the Recorded Future podcast. In this episode, we take a closer look at some of the specifics of artificial intelligence and machine learning, and how cybersecurity professionals can benefit from including these tools in their threat intelligence arsenal. We’ll discuss clustering, natural language processing (or NLP), and supervised learning, and we’ll find out why combining the talents of humans with the speed and analytical capabilities of computers, the mythical digital centaurs, could provide even more powerful solutions in the future.

Joining us are two experts in artificial intelligence and machine learning, Christopher Sestito from Cylance, a company that’s all-in when it comes to AI technology, and Staffan Truvé from Recorded Future. Stay with us.

Christopher Sestito:

Artificial intelligence is really trying to utilize computer technology to make human-level decisions.

Dave Bittner:

That’s Chris Sestito. He’s the manager of threat research at Cylance.

Christopher Sestito:

So, whether in our case that would be kind of a good or bad decision on software, is this software malicious, or is this software benign?

Staffan Truvé:

My favorite definition is when we’re trying to have machines do something that if a human did the same thing, we would say that, “Well, that’s an intelligent thing to do,” or, “That’s something which requires some kind of intelligence.”

Dave Bittner:

That’s Staffan Truvé. He’s one of the co-founders and chief technology officer at Recorded Future.

Staffan Truvé:

The challenge, sometimes, is that when we hear the word “intelligence” and think about people, we usually think of very, so to say, sophisticated things which people do, whereas a lot of the stuff which we actually do with AI today are things that humans do without thinking. You don’t say that someone is intelligent if they can drive a car, whereas that’s sort of the leading edge of AI, to have autonomous cars, and the same thing for a lot of what we do at Recorded Future. I mean, the core AI part of our product is our natural language processing, which, again, is something which humans read and understand text without putting any effort into it. We don’t think of that as intelligent, but it does require both very sophisticated reasoning and an understanding of the words, so it’s technically AI that’s required to do that.

Christopher Sestito:

We approach threat intelligence utilizing machine learning. Machine learning really thrives in an environment where there’s a lot of data. You see it in a lot of different industries, whether it’s the medical field, whether it’s finance, and in security. It’s an environment where we can learn from quite a bit of data that is available to us. So, in our case, we can see quite a bit of the story of what’s happening with a threat through things like data logs, network censors, and then attributes about a binary, and there’s quite a bit of data there. And when you extrapolate that out over millions of end points that we are kind of absorbing this data in, the best approach, really, is kind of a machine-learning approach where we can use the advantage of computational power to help us make decisions across such vast amounts of data, which a human would not be able to do.

Staffan Truvé:

So, essentially, I like to think that we’re doing two very separate things with AI and the Recorded Future system. So, one is in terms of automation, so that’s really what the whole NLP stuff is. What we’re having the machines do is exactly the same thing as we would want a human to do, but we want to do it at a much larger scale.

So, the whole NLP thing is for us to be able to … Today, I think we harvest around 30 million documents per day. If you were to do that with humans, you would need an awful lot of humans to do the same thing. And the beauty is, of course, with machines, is that whereas for humans, you hire one human and you train that human to do something, and then you take the next one and it takes just as long a time to train that one … But for as machines, once you have your software, you can scale it out over a thousand or a million servers. There’s no additional training. They’re just using the same model. You can copy it, so that’s a big scale advantage, so that’s really taking things which humans can do and automating it so that we can scale up.

And then there’s the other part, which is, in a way, more thrilling. It’s what you could call augmentation, where instead, what we do is we actually can have the algorithms tackle problems which are too complex for people to do. Essentially, because they are high-dimensional problems, there are a lot of variables, and we, with our brains, we’re not very good at understanding those very complex connections between things, whereas for machines, the algorithms scale nicely to higher dimensions.

The thing is that humans are tremendous when it comes to language. We understand very fine nuances, and sometimes, without thinking about it, we sort of incorporate a lot of our knowledge about how the world works into that analysis. And a machine, which has … It has no common sense, it really has to start from scratch and everything. And also, we are very resilient to errors. So, I can say something which is completely ungrammatical to you, but, again, the machine needs to be able to understand things, even if I drop out a word or I mix up a vowel or something, which you’re very good at understanding what I mean. Even if I do those small mistakes, to actually have the machines be capable — even in the situational mistakes — to understand language, is hard.

Christopher Sestito:

Whether you’re trying to take a whole bunch of files and determine which are good and which are bad, you would do that through a couple different types of learning. Supervised learning is one that we definitely utilize, which would be, you take millions of files that we have identified as good, millions of files we’ve identified as bad, and then put those labeled sets through a model that was developed in order to try and identify similarities between what makes a file good and what makes a file bad, based on the files themselves. That starts to be very useful when you have large amounts of data, as you really start to identify patterns at a binary level.

So, my team utilizes clustering, which is a concept inside of machine learning, in order to combine like samples. That will allow us to take samples, put them in subset groups based on their functionality, and we can classify them accordingly. So, a very powerful tool when you’re talking about large amounts of data, and that’s something that we deal with every day.

Dave Bittner:

Does it happen from time to time that the results from the machines surprise you?

Christopher Sestito:

Whenever you’re working with a large amount of data, and really, whenever you look at large numbers in general, you expect certain things to kind of pan out. Like, if you’re trying to identify a specific type of malware, or you’re trying to identify a certain threat function, you end up seeing patterns that you didn’t know were there, and that’s because the machine learning is better at identifying those patterns than you are, just for the sake of being able to look across so many samples. So, it may be an example of, we were trying to find patterns in a specific family of malware, and all of a sudden, we learned that it was something as simple as the file size of something that was dropped, or a specific call out, or a specific function that was used that you weren’t even really looking for. Then, after using a clustering technique, you would learn that this was present across all of them. So, it’s definitely caused some simple solutions that were surprising, that we may or may not have identified if we were just looking at samples ourselves.

Staffan Truvé:

The best example, where you are almost always surprised, is one component we have in our systems, which is not the NLP part, it’s that we’ve started, over the last two years, to do what we call predictive analytics. So, we have one component of the software now which tries to predict next week’s malicious IP addresses, and the way it does that is that we train it on historical threat list or black list data, combined with all the context we have from Recorded Future’s text analysis and NLP processing, and then we add years of historic data. What we can do then is build a machine-learning model where you can use that historic data to train a model, which can allow you to take an IP address, which we haven’t even saved in our system, and actually give it a risk score — how potential that address is to become malicious, say, in the next five to seven days. Every time that works out, it’s a bit magic, actually.

Dave Bittner:

So, I guess, over time, that system becomes more and more accurate?

Staffan Truvé:

Yeah. As we add more data over time, we can train and retrain these models continuously, so over time, they definitely learn. In one sense, you could say … Today, we have so much historic data in Recorded Future. I wouldn’t say we have come to the best possible model, but I think, at least, we’re sort of plateauing out there in terms of what that specific model could do. But if you look forward, what we’ll be able to do as we get more data and we have more historic information, we will be able to do more predictions of this kind for other parts of the data.

Christopher Sestito:

Any time you introduce a new variable from a human, you would definitely want to compare the results before and after there was any human influence there, but you also want to consider that there are strengths on the human side of things, as well. So, an analyst is going to be able to identify something much quicker than a machine-learning model. So, an example: If a sample of ransomware came along with a ransomware message, that’s something that a human is going to identify right away, while machine learning would want to see millions of those samples to identify those patterns. So, there really are different angles in terms of where the strengths will lie, and really classifying, so you’re going to want to use a combination of both. And while machine learning is excellent in a lot of ways, you do want that human element there, as well.

Staffan Truvé:

Personally, I love this analogy with the computer chess. So, Kasparov was beaten. First human grandmaster to be beaten by a chess machine in the mid 90s — ‘97, actually — but ever since then, actually, the best chess player in the world has not been a machine or a human. It’s really been a human, a recently good human chess player working together with a good chess machine. And in computer chess, they actually call that centaurs — the horse-human combination, and I like to talk about us creating threat analyst centaurs by having the machines augment the capabilities of the human analyst. Have the two work together.

Dave Bittner:

Yeah. Let’s explore that a little bit. I mean, what do you see on the horizon? What are the next things that you would like to be able to take on with these technologies?

Staffan Truvé:

So, for me, the predictive part is really the next big challenge for us. As I said, we started doing it in a very specific domain and we’re now broadening that to more domains, so not just predicting the risk of individual IP addresses, but for example, predicting a risk for a domain. So, you can imagine, let’s say, that you notice a new domain being used for the first time, and you want to actually be able to give it a risk score to that, which would be tremendously helpful because that would be a way to detect and warn for potentially malicious domains in, say, phishing emails or in other contexts like that.

Then, I would say, beyond the predictive part, I like to think the next step beyond that would be the prescriptive part, where the machines will actually not only detect these things and predict them, but also be able to suggest what you should do to handle the problem. Essentially, have the machines at least start off to give advice about, you know, “You should probably block this domain in your IDS system because we think it’s 50 percent likelihood that it’s something bad.” As we move forward, the machines will be able to come up with more suggestions about what to do, and eventually, of course, we will be confident enough to actually give them the responsibility to take some of those actions themselves.

Christopher Sestito:

My advice to any security team, or anyone trying to create a solution for threats in today’s world, would be that you have to choose a practical solution. I believe that a machine-learning approach is really the only practical solution that’s going to help you defend your organization. Most of the threats that we see today have not been seen before. My team sees … Only about 20 percent of the threats we see have been identified before. Because of that, you need a solution that can make a decision on the spot. You need to have an approach that has, if you use machine learning, a model that’s ready to make that decision.

We’ve come across the kind of notion that artificial intelligence is hype, and some people will call it new-age techniques, but the reality is, it’s not hype. It’s a tool that’s being used across many different industries to process data and to take large amounts of data and use that to your advantage. Security is no exception. I would recommend, to anyone, that you utilize technology in any form, and artificial intelligence and machine learning are kind of the newest ways to attain the same goal of security. Utilizing machine learning is really about allowing computers to do what computers do well, and allowing humans to do what humans do well. You read horror stories about artificial intelligence taking over the world, and that’s really not the case. It’s about computation, working through large amounts of data in order to look at trends, in order to find similarities and patterns, and allow humans to use that information to protect an environment, in our case, or really, make any decision, whatever you’re applying machine learning to.

Something that’s important to note is that enterprise systems are always changing. There’s constantly new updates going on, there’s modifications being made. Enterprise systems may have to expand for new business cases, and machine learning allows you to learn about the good at the same time, where you gain trends on legitimate software and what it’s trying to do.

Staffan Truvé:

What you need and what you can use depends so much on the maturity of the organization, and we like to talk about intelligence goals. That’s actually what we always do when we sit down with a new company we meet. We sit down and we discuss with them, what are the goals. Are they looking just, for example, to find leaked credentials, or are they interested in doing broader threat analysis, other threats? Are they mostly worried about things like hacktivism, or are they worried about theft of IP, or data breaches, and so on? So, there’s a whole variety of different goals, and I think just having that discussion, first of all, about what it is you are after, I think that helps a lot in defining which tools and which processes you should go on and develop.

Christopher Sestito:

So, threat intelligence is really about understanding attacks and understanding different methods that hackers, or anyone trying to exploit your environment, is taking. It’s not enough just to stop a threat. You need to understand what it was trying to accomplish, as malware is always changing and new techniques are constantly being developed. If you understand what you need to defend against, whatever costs that you’ve endured in order to create that threat intelligence team, or those threat intelligence products, there’s a much better return on investment in understanding how the threat is attempting to exploit your environment, or what the threat is attempting to gain. You can really target your resources in defense that are useful and meaningful, rather than trying to just use blanket, generic solutions to protect yourselves.

Dave Bittner:

Our thanks to Chris Sestito from Cylance and Staffan Truvé from Recorded Future for joining us today.

Be sure to check out the free white paper, “4 Ways Machine Learning Is Powering Smarter Threat Intelligence,” that’s on the Recorded Future website at go.recordedfuture.com/machine-learning. Check it out.

Don’t forget to sign up for the Recorded Future Cyber Daily email, where every day you’ll receive the top results for trending technical indicators that are crossing the web, cyber news, targeted industries, threat actors, exploited vulnerabilities, malware, suspicious IP addresses, and much more. You can find that at recordedfuture.com/intel.

We hope you’ve enjoyed the show and that you’ll subscribe and help spread the word among your colleagues and online. The Recorded Future podcast team includes Coordinating Producer Amanda McKeon, Executive Producer Greg Barrette. The show is produced by Pratt Street Media, with Editor John Petrik, Executive Producer Peter Kilpe, and I’m Dave Bittner.

Thanks for listening.