Blog

Does It Burn When You Leak? Using Threat Intelligence to Find Lost Data

Posted: 19th December 2018

By: ZANE POKORNY

Does It Burn When You Leak? Using Threat Intelligence to Find Lost Data

Editor’s Note: The following blog post is a summary of a presentation from RFUN 2018 featuring Zachary Hinkel, global cyber threat manager at Hogan Lovells.

Cybersecurity professionals are always on the lookout for intentional exfiltration, but what about accidental leakage?

At this year’s Recorded Future User Network (RFUN) conference, security expert Zachary Hinkel gave a presentation on how Recorded Future can be used to baseline company trends to detect outliers that need to be investigated. He looked at one uncommon example in particular of how credentials and data can leak out of networks — a major leak of credentials on GitHub a few years ago.

‘You Turned Them on and the Lights All Dimmed’

Now the global cyber threat manager at Hogan Lovells, a major international law firm, Hinkel has been in the cybersecurity business for a long time. He got hooked on computers from a young age, getting an Apple IIe when he was just six years old. Apple computers these days are instantly recognizable for their sleek, Space Age looks and tamper-resistant assemblages, a design aesthetic perhaps meant to elevate their status to that of Platonic Ideal — witness their perfection, sprung fully formed from the Jobhead, unrequiring of alteration. But back in the Reagan Eighties, computers — even Apple ones — were a more unregulated affair.

One of Hinkel’s earliest computer-related memories was watching his stepfather solder extra RAM into his Apple IIe. That kind of modularity is unimaginable in Apple products today, with nearly every component already soldered directly to the logic board in an effort to save microns of space, and bespoke Pentalobe security screws (“Pentalobe” being an invented word that evokes, as blandly as possible, how their drive slots resemble a flower with five petals in full bloom) installed to defeat any amateur tampering efforts.

The DIY attitude of Hinkel’s stepfather, who ran a computer company selling automation software to convenience stores, set him on an early path toward cybersecurity. His stepfather’s work meant that he brought home plenty of old computers for Hinkel to tinker with, old leviathans like monochrome IBM machines “that you turned on and the lights all dimmed” and “just filled the house full of, like, tar and nicotine.”

That interest in tinkering quickly developed into a desire to connect with other like-minded people, and as a teenager, Hinkel started seeking them out on bulletin board systems (BBSes), early online forums that users connected to through a terminal program. In many ways, BBSes served as precursors to features we associate with the modern internet, providing services that resembled primitive versions of the social media and messaging that we now take for granted.

Hacking in the Reagan Eighties

The internet of Hinkel’s childhood was not one with an abundance of internet service providers running high-speed lines out to every podunk town in middle America — well, maybe not a lot has changed in that regard. That meant Hinkel had to go to his local library, which had a computer with a dial-up connection, to get online. It’s there that he got his first taste of cybersecurity.

“I learned [that], by dialing to the local library, I could actually dial in to the local university, and in the original Linux distribution, the root password was blank,” Hinkel says. “So I would find machines running Linux and I would start a PPP session” — point-to-point protocol, an older communications protocol used over telephone lines that can directly link two nodes without an intermediary — “all the way back through the library, back to my home computer; and that’s [how] I figured out how to get on the internet.”

That connection led him to services using Internet Relay Chat (IRC), another protocol for text communication that was more popular back then. IRC is a system that allows for group chats over channel forums, as well as private, one-to-one messaging and file sharing — basically Slack, but open source, and beating Slack to the marketplace by a few decades. IRC use has steadily declined in the 2000s in favor of more modern protocols, but it still has its fair share of ardent users, particularly among programmers.

During its heyday, the moderators of a single IRC server or channel were gods, able (and willing) to kick out or ban users for minor or even perceived infractions. One quirk about IRC, however, was its vulnerability to netsplitting, a phenomenon where a sudden disruption between any two nodes in the network would split the entire thing into multiple pieces before it would try to patch itself back together. During a netsplit, everybody is momentarily kicked off the network, and when it attempts to reunify, some forms of redundant data — say, people with the same nicknames — will be also be merged. When that happens to users, both are disconnected from the server, regardless of who the original is and who’s imitating them. Clever and unscrupulous users would intentionally cause netsplits — using techniques that were essentially an early form of distributed denial-of-service (DDoS) attacks — to dethrone moderators and take over channels for themselves. It was in this environment that Hinkel was introduced to many cybersecurity concepts that he applies in his work today.

Learning to Love Metrics

This early start in cybersecurity took an abrupt turn after college into a short career as a pilot — but Hinkel quickly got pulled back into the cybersecurity world. “I loved flying, but at the same time, I felt kind of like — not to bash anybody that's a pilot — but I was just getting people from point A to point B,” Hinkel says. “It was the job that, if I called in sick, there were a thousand other people that could do it.”

So he returned to his roots. After some contract work with a red-team group, Hinkel eventually heard from a friend who was starting up a company and sought his security expertise. It would be a defense role — blue-team business.

“Oh, that sucks. That’s a no. That’s boring,” Hinkel recalls saying. “I love breaking into things, I love the challenge of getting into something. Defending is just not what I’m interested in.”

But Hinkel said “yes” anyway. While blue-teaming never fully became his thing, it led to a series of other jobs in cybersecurity, including contracting work for the government and Fortune 100 companies. He eventually ended up working in the financial sector, where he was first introduced to Recorded Future — as well as “a horrible thing called ‘metrics,’” he says. “I don’t know how many people like metrics, but at the time, it felt like job justification.”

He quickly changed his mind, however, becoming obsessed with the visualization tools offered in Grafana, an analytics and data monitoring platform he used at work. Soon, everything in his life became a metric needing to be measured. Everything. Here are some things Hinkel used Grafana to visualize in his home:

His home network traffic
The number of VPN users on his network
The number of times his young children used the bathroom daily, subdivided into two helpful content categories — number one and number two
The battery levels of the electronic devices in his house
The battery levels of his RV as it sat in his driveway

“I literally ended up building a four-video wall screen in my office just of Grafana dashboards,” he says.

Of course, he also focused on metrics in his actual work. He notes the importance of effective visualizations and narratives when sharing research with other members of your organization, particularly upper-level management, like executives, who may not have a strong technical background. Visualizations can clarify, or they can deceive — but whatever narrative you present, Hinkel notes, it’s important that you present it in a way you can answer to. “[Executives] are going to ask questions, and you can build the most beautiful graphs in the world, but you have to be ready for those questions. You have to be able to provide answers.”

GitHub’s Credentials Leak

One key metric Hinkel kept track of was how often his company was discussed on the web, and who was doing the talking. A few years ago, he noticed a huge spike in mentions of his company’s name on sites like GitHub.

What was the cause of the spike? In exploring the possibilities, Hinkel posed a quick thought experiment to the audience: “Think about your network and where people put in their username and password the most, besides logging in to their individual assets.” He notes that there’s “no right or wrong answer” because “everybody’s set up differently.” But for everyone, there is an answer.

In the case of his company, what employees were logging in to the most was a proxy server. Because many tools they were using didn’t automatically read proxy configurations, people at the company needed to set them manually. They most commonly did that when using Android Studio, an integrated development environment for making Android applications. Now, the sorts of people who would regularly use Android Studio would also probably spend some time on GitHub, a hosting service and code repository where programmers regularly share code. But why his company would suddenly start being mentioned so many more times on GitHub — something like 230 pages of results with 50 results a page of credentials — was a mystery.

GitHub wasn’t helping solve that enigma, either. Although previously, they had let anyone search across all their code repositories for just about any piece of code, allowing more curious users to search for any domains, software, or pieces of code that mentioned those domains, they eventually rolled the search capabilities back to only allow for searching repositories by name (they’ve since reached a middle ground, allowing for a more robust search but still excluding some queries). For Hinkel, that meant he had to turn to Recorded Future to dive deeper.

Recorded Future allowed for much richer and intelligent searches, scraped from the bowels of GitHub’s data. This led Hinkel to discover a serious design flaw in Android Studio where, after users put in their proxy information, every project started in Android Studio copied that same configuration into the source code of the project. That meant that every time the code was uploaded onto GitHub, it contained those same proxy credentials. It was a flaw that affected the data of many major companies — power players like Google, IBM, and Microsoft. “Every single Android Studio project that you could find online at the time, if they used a proxy on there, it actually had it in [GitHub],” Hinkel says.

Now, as mentioned before, a lot of the sort of people who use Android Studio also tend to share their code on sites like GitHub, and here, Hinkel identified a clear connecting line between the two that would explain the leaks of credentials on GitHub. But one somewhat thorny issue remained: Hinkel’s company wasn’t using GitHub to store their code.

The Long Shelf Life of Data

In the age of the internet, it’s easy to become neglectful of just how long the shelf life of data can be. As the access and transfer of knowledge has gotten faster and easier over time, it’s also seemingly become more ephemeral and less significant. Our distant and mighty ancestors painted records of their hunts on stone walls and carved tabulations of grain harvests on clay tablets, leaving indelible records for us to find thousands of years of later. Their descendants strove to make things a little easier, writing with ink on papyrus and vellum and eventually, stamping out books en masse on printing presses. And the transfer of knowledge got a little cheaper and quicker. Some things were lost along the way — when you’re chiseling a message into stone letter by letter, you probably spend a little more time deciding what’s worth saying and how you’re going to say it than when you can just crumple up a sheet of paper and try again, let alone when you can just tap the backspace button and rearrange a few electrons before you hit send, letting all your Twitter followers know just how bad the service was last night at the new Italian restaurant in the North End in an angry diatribe. Because nobody’s really going to read that stuff more than a couple days after we shared it, right? It’s lost in the electric ether.

Yes and no. But also, yes. Lacking a physical form, digital data has that uncanny and unnatural property of being able to be duplicated infinitely (storage issues notwithstanding). Once something’s out there, especially if it’s something that has value — your opinions on Italian food, maybe not so much, but your other personally identifiable information, very much so — it’s probably out there for good. Individual archives may be expunged, but things of value will be duplicated and collected by those who value them. In the case of personally identifiable information, everyone from advertisers to cybercriminals is keenly interested in keeping track.

But if you’ve got the money and power, you can make some data very difficult to find. This is especially true if you’re one of the internet’s gatekeepers, like Google — in their case, they dealt with the GitHub credential leaks by simply redacting as much of that information from its searches as possible. Go search for any of it, and “you’ll still see things that are educational, like how to set it up, how to set up your projects, anything that might reference it,” Hinkel explains, “but anything that actually has a password in it, or actually has proxy gateways, they've all been pulled out.”

Searching With Recorded Future

This provided Recorded Future with another opportunity to set itself apart — that data had already been trawled and incorporated into Recorded Future’s own databases. Recorded Future gathers data from technical sources and places on the dark web alongside the usual open sources, but even if you really only care about open source information, you still may not find everything you’re looking for if you only rely on search engines like Google. “Why do I need Recorded Future when I can go to use Google? This is one of those best cases to demonstrate,” Hinkel explains. “There’s things out there that you can find [using Google], but you’re not getting the depth of the search. You’re not covering everything that Recorded Future can find.”

He gives the example of finding malicious content that sometimes quickly — even automatically — gets taken down on code repositories. “We monitor Pastebin very, very well [...] for malware configurations,” Hinkel says. “Sometimes they’re up there for a minute and Recorded Future will grab it even when the Pastebin search won’t.”

However speedy and robust the searches in Recorded Future might be, Hinkel warns that you should never just use searches as a mirror to gaze at yourself. “One thing that many people fail to look at when they start doing things in Recorded Future is, they're just looking about information for themselves,” he says. “But if you have supply lines or vendors or people that regularly interact with you, it's also a good indicator to look at them. [...] We have Recorded Future alerts and other things that monitor for not only our names, but also our suppliers.”

It’s that kind of searching that eventually revealed the true source of the GitHub leaks — Atlassian, a company that produces project management software. Atlassian had made a special agreement with GitHub so that when you logged in to one, you had to make an account for the other. Hinkel’s company was only using Atlassian products for internal repositories, but the default setting for Atlassian was to push all the code produced for Android Studio to GitHub automatically. Simple, but easy to miss — and about as secure as a sieve.