Feeding Frenzy: The Inside Scoop on Threat Intelligence Feeds
April 17, 2017 • Amanda McKeon
Threat intelligence feeds have become a staple in the diet of analysts and security professionals at organizations large and small. Some feeds are free, others are offered for sale from security vendors. They can also come in a dizzying array of formats, varying sizes, and include threat information that may or may not add value to your organization.
In this episode, we give you the inside scoop on threat intelligence feeds. We’ll tell you what they are, how to select the right ones for your organization, and how to separate the signal from the noise. Join us as we talk about turning those streams of raw information into actionable intelligence. Our guest today is Matt Kodama, Vice President of Product at Recorded Future.
This is Recorded Future, inside threat intelligence for cybersecurity.
Hello everyone and thanks for joining us, I’m Dave Bittner from the CyberWire. This is episode two of the Recorded Future podcast. Today, we’re going to talk about threat intelligence feeds: what they are, how to select the right ones, how to separate the signal from the noise, and turn streams of raw information into actionable intelligence.
Joining me is Matt Kodama, vice president of product at Recorded Future. To start our conversation, I asked Matt to give us the basics and help us understand threat intelligence feeds.
One of the challenges is that these things don’t really have any proper standards definition. They tend to be lists of technical information. They’re structured to be read by a machine, they’re usually produced by a machine, some sort of scanning or detection technique. And what they usually are is sort of a header that says, “This is what this list is about.” Maybe it’s a command and control infrastructure for some particular malware family. Or maybe it’s URLs where there’s been a phish reported recently. And then the technical values. And sometimes that’s all you get. It’s going to be essentially a bare list of IP addresses with some context information around what this list is. Other times it’s going to be more like a data table. You know, you’ll have not just the head value.
I could take the phishing example. You might get the phishing URL, but you might also get the IP address that it resolved to. And you might get the timestamp that whoever is reporting information observed that that particular URL resolved, that when the DNS name resolved, it resolved to that IP. So, some threat feeds have a lot of columns, and some have only one. It’s a pretty broad spectrum.
So of course, the point of this is somebody’s out there seeing suspicious or malicious activity, and thankfully what they’re doing is they’re setting up some automation to report that off to the rest of the world, so that everybody else can get some security in it or threat intelligence advantage out of that information.
Are most of the feeds out there provided for free? Do you have to pay for them?
It’s a mix. There’s a lot that are out there for free, and if I wanted to be snarky I guess I could say they’re free and worth it, but that’s actually not true. Actually, a lot of them are quite good. I think the challenge is, though … If you think about what’s going to be hard in putting together a feed, part of it is sort of the “software or service development” hard. Are you going to just put a file out there and say, “Hey you can access the file and download it,” or are you going to provide some sort of interfaces so it can be more easily pulled through software that reads a web service or creates some sort of standard?
And then the other hard part is … Okay, so you see something suspicious. How sure are you that it’s not a false positive? Or maybe you see something suspicious through your method at time zero, but then it gets cleaned up. How sure are you that you’re gonna see the fact that it gets cleaned up through your method, also? So, are you actually even able to age things out from the list or do you?
And here’s where I think you get to the problem. A lot of that stuff is pretty darn hard. And the person who’s putting a free threat feed out there, I mean, usually they’re doing it out of personal interest or it’s derivative from their main job, or … And it’s great that they’re putting that information out there but they just don’t have time to work through all the rest of that hard stuff. So what you get with a lot of the community threat feeds, the things that are free, is that often they’re good information, but they don’t have the rest of that work done around making it easy to consume and making it easy for you as the consumer to understand how long should you keep the information around.
Take us through the process of how someone should go about evaluating what kind of feeds they’d want to use.
Yeah, that’s a tricky one. I think you’d have to start with thinking about what’s at risk, because there’s so many things out there. It doesn’t make any sense to just go out there and say, “Why don’t I take a census and grab a sample of everything that’s available and start to do something in that kind of an approach?” You’re just never gonna finish. And the truth of it is that a lot of those feeds are very important for somebody else, but not at all important to you, because the risks that we’re exposed to in different organizations are so different.
So, I think the starting point is to be thinking about what types of risks you’re trying to defend yourself against. The feeds tend to be about technical topics. So then you’re going to come into, “Okay, if this is what I’m worried about, what type of malware tool, or what type of attack vector or intrusion method … How is that gonna manifest in ways that I could detect it technically and use security controls to block it?”
And once I’m there, now I can go out there and look more systematically around … Okay, if I feel I’m very exposed to ransomware or botnets or remote access trojans, or if there are feeds … Or phishing. I mean, who’s not exposed to phishing? Then you can be more specific around what you’re looking for.
Now, then you get into the next part of it, and here I have to give some props out to Alex Pinto, who I know, anybody who’s looked at this stuff is gonna be familiar with the stuff he’s done around more quantitative or statistical analysis. Because there is, then … You get to the point where you say … Okay, so let’s say I’m interested in infrastructure and botnet infections. There’s a lot. Some of them are pretty static, some of them have a very high turnover rate, some of them are cumulative — things go on to the list, they never come off. It turns out that a lot of them don’t have that much intersection. The underlying scanning method or detection method is pretty different than what other people are doing, so they see different infrastructure.
So, you can go through a big data analysis process, and Alex has talked about work that he, in his research, has done, and the kind of quantitative results that he gets. The punchline at the end of it is that if you wanna get good coverage, you’re gonna need a lot of feeds. Because there’s not one or two feeds that are the master feeds and all the others are redundant. It would be nice if it was that way, because it would make our jobs easier, but it actually isn’t like that.
It seems like you have a signal to noise problem, sort of an information-overload problem. How do you then filter all of the incoming feeds so that you’re not chasing your own tail?
I think you’re absolutely right that that’s one of the key problems. One of the methods that a lot of practitioners talk about is basically quantifying the hit rate, and then saying, “So, if I bring a feed into my environment … ” And obviously I’m not going to bring in some data stream and suddenly start … Turn it into a blocking control rule. That would be madness.
But let’s say I bring it in and run it as … Feed it into some sort of detect control rule. And then I look at, how many detections do I actually get? How many of them turn out to be true incidents versus, they’re correct rule matches but they aren’t indicative of any incident or malicious activity? And then I sort of score that, right?
Now, if you’ve got a mature process around those types of correlations and alarms and being able to track outcomes and so forth, then that’s terrific. I mean, now you’re in a position where you can bring in information, threat feed-style information, pretty rapidly get a sense of how effective it is, and if it is great, and if it isn’t, rotate it out.
I tend to think that there are not a lot of people who are at that level of automation and sophistication in those processes. So it’s great if you’re there, but if you’re not, what do you do then?
Then I think looking at it from absolutely the other end of the spectrum, you can just look at the data. Some of these things update minute to minute or hour to hour. But whatever their tempo is, sample it over a period of time so you can see how it changes, and then figure out: Does this thing change size dramatically? Is it twice as big one day and then much smaller the next, and that’s normal? Again, some of them are actually cumulative lists. Things go on to it and they never come off. But when you just look at the webpage from the focus of the project who are publishing it, it’s often very unclear.
So you can start from that and just do some very, very simple … I hesitate to even call them even analytics, because we’re really just talking about counts and frequencies and dips, to get a sense of what you’re looking at there, and then you can do sort of like a one time … Everybody’s got some data set around their historical arms, or artifacts in their incidents that turned out to be the malicious infrastructure involved in those incidents. So you can correlate it against things that are known bad from your history to say …
Let’s say I had this in place in the past. What type of detection rate around useful incidents would I have achieved? So that’s another way to go after it that’s, you know, it’s pretty low tech and less effort. In this case we’re not really talking about bang for the buck in terms of buying the data, but we’re talking about bang for the buck in terms of how much of your time you’re going to have to spend putting it into operational use.
So let’s talk about the difference between data that you get over a feed, information, and transitioning that information into being true intelligence.
I think that question gets at something really important, which is, there’s a distinction between threat data or really just data about some sort of suspicious or malicious infrastructure, and then coming up a level to … If we wanna think of it as information, I need to know a little bit more about who’s publishing it and what types of risks or threats it’s likely to give me a window into, to have it structured so that I can actually apply it to my workflows and security and intel.
By the time we’re gonna actually call it intelligence, there has to be a lot of context around, not just what this incoming information is, but how it relates to me, because at intel I’m saying, “I’m actually knowledgeable enough through this data and information to make a determination.” When I get a hit on this, that’s actually indicative of an infection, or an intrusion, or some communication out to get control information or even to exfiltrate data. So it’s a whole different level.
And I think that’s where the rub comes in. The data that you get in threat feeds is really useful, but it’s at the bottom of that value pyramid of building your way up from data to information to intelligence to security actions.
Describe to me the process of how people can go about designing custom feeds.
The hard way that you could go about this is to really deep-dive into feed number one, then feed number two, feed number three, and so forth. You certainly can do it. One of the challenges is that, the space of what feeds exist out there is actually pretty fluid. Whatever the underlying scanning method is, or whatever the detection method is that generates the feed might stop working, because the bad guys get a vote too. Maybe they stop doing whatever it was they were doing that made them susceptible to that detection method. Maybe the group that’s operating the feed, they decide to shut down or merge into something else and the feed just goes away. And new ones are constantly popping up. So, the shelf life on your time spent deep-diving a particular feed, you might not get paid back.
The other approach that you can take is to say … Well, what advice used to be before we all had GPS that if you were in a place that you didn’t know and you’re trying to get directions, that you’d ask two, three, four people. And when you started to get enough people saying the same thing to you in terms of street directions to get to a place, now it was sufficiently trustworthy and credible, even though they’re all strangers, they could say, “Okay, well I’m gonna walk and I’m gonna follow those directions and be confident I get where I’m going.”
That’s the other approach, that you could say, if … And bringing in a larger number of feeds, none of them have I given the complete examination to, but also none of them are completely junk, right? I know that none of them are sort of just a whole lot of false positives cumulatively forever.
Now, if I’m seeing, if I’m correlating those feeds together and saying, “I’m seeing the same piece of infrastructure observed across multiple feeds.” Now I don’t necessarily know a ton about their methods. Some of them give me almost no context, others give me more. But I can have higher confidence in the intersection of all of those.
And I think that when you talk about making a custom feed, if you take those sorts of approaches, right? Now I’ve got a lot richer context by adjoining them all together, that I’ve actually got some ways to select out and say, “I don’t really care about every ransomware under the sun but I care about these three or four families a whole lot.”
And you can also use the multiple points of observations. It’s not the type of confidence that you would get from a human analyst looking at the data and really validating it with their eyeballs and their brain. It’s still a useful confidence metric even though it’s just coming from unattended compute.
Those are the ways that you can work towards starting out with primary collection of this raw threat data and get to where you have essentially a custom composite feed that you built for yourself, and can have a good level of confidence that you can sort of, not set it and forget it later on forever, but on a day to day basis you’re not gonna have to be watching the flow of indicators, because you’ve got more important things to do. It is something that you’ll run and look in on from time to time to make sure that it’s still in good shape and tune it out.
For someone who’s starting up on this journey, for someone who’s just getting started, you know, figuring out how feeds are going to work into their system and into their defensive strategy, what would your advice be?
I think my advice would be to start out with some of the feeds that are moderate volume, that you can tell from comparing them to your historic incidents, are things that you would care about but you don’t already have complete coverage for. Because if you can identify data like that — and it’ll take a little while, but you can — and then start to bring that into your security workflows …
And part of the challenge with this thing is figuring out how you get from soup to nuts, right? You can get data externally, this type of threat feed data or riskless data, whatever you wanna call it. And there are places you can get some of it free, you can get some of it commercially. But then figuring out how you’re gonna bring it into your environment and fit it into either detection or intelligence workflows and get it all the way out to where it’s actually driving useful security actions? That’s the work that you gotta puzzle through.
Now, once you’re there, now you’re in a place to start scaling up and saying, “Do I need more feeds, bigger feeds, feeds with more context, feeds with coverage into areas that I have no insight around?” But you know, when I source that additional data or information into my organization? Now it’s actually gonna have useful security effect. Otherwise, you got a whole lot of stuff sitting on the doorstep and you’re still not clear how you’re gonna turn information into action.
My thanks to Matt Kodama for joining us today.
Before we let you go, don’t forget to sign up for the Recorded Future Cyber Daily email, and everyday you’ll receive the top results for trending technical indicators that are crossing the web, cyber news, targeted industries, threat actors, exploited vulnerabilities, malware and suspicious IP addresses, and much more. You can find that at recordedfuture.com/intel.
You can also find more intelligence analysis at recordedfuture.com/blog.
We hope you’ve enjoyed this show and that you’ll subscribe and help spread the word among your colleagues and online. The Recorded Future Podcast team includes Coordinating Producer Amanda McKeon, Executive Producer Greg Barrette. The show’s produced by Pratt Street Media, with Editor John Petrik, Executive Producer Peter Kilpe, and I’m Dave Bittner from the CyberWire.
Thanks for listening.