Culturomics 2.0: A Response to UIC
By Chris on September 14, 2011
We read with interest the news analysis work published by University of Illinois researcher Kalev Leetaru that was syndicated by several outlets, including the BBC. The premise is very much in line with the Recorded Future thesis – media is moving online and there is data to be extracted from such media. At a high level, Leetaru’s methodology is not unlike our own: extracting entities, calculating sentiment, etc. His research seemingly does not include extracting events nor time from the text, but still in the same vein. So, at headline level, we like it!
At Recorded Future we’ve been very careful to avoid claiming “predictions of the past”. We’ve stuck out our neck by making a few predictions; sometimes successful, sometimes not. We proudly herald our work on out of sample trading signals because there’s a clear outcome to measure against, and it’s pretty binary whether it works or not. But otherwise, we’re careful.
That said, there are there are several points in Leetaru’s work that we take issue with and warrant closer inspection.
Sentiment and Tone
The way Leetaru calculates sentiment, using the whole document to measure sentiment, is under heavy debate.
Consider this typical news coverage of the Middle East back in April, 2011 which reports on updates from across the Middle East. This single document includes mentions of Libya, Tunisia, Egypt, and Syria, and contains many positive and negative phrases.
Here are several examples:
- “Tik has been released in Damascus, and wanted to pass on the good news!”
- “This will be brief, but at very very long last, we have good news to share…”
- “…well-armed, seemingly well-trained soldiers in full military attire…their very presence has boosted morale on the front line”
- “the 32nd Brigade, one of the best-equipped and trained units, had been sent early on Friday”
- “The destruction cannot be described.”
- “This graphic video is said to show protesters suffering violent physical reactions”
- “…beleaguered rebels battling Col. Muammar el-Qaddafi’s forces…”
- “…dozens of protesters have been killed recently.”
What will sentiment and tone in such a document tell us about Egypt in particular? We have some ideas for how to actually prove out this point and will be back on the subject (if we were good academics, we wouldn’t blog about it ahead, but…). 🙂
Despite this criticism, Leetaru makes some other very good points on sentiment and tone. So, plenty to learn from the paper.
Pinpointing Bin Laden I
The paper makes a very strong claim on where Bin Laden would have been located based on open sources:
“global news content would have suggested Northern Pakistan in a 200 km. radius around Islamabad and Peshawar as his most likely location, and that he was nearly twice as likely to be making his residence in Pakistan as Afghanistan”.
“Indeed nearly 49 percent of all articles mentioning Bin Laden included a city in Pakistan and both Islamabad and Peshawar rank in the top five non–Western cities associated with him.”
Now, remember Islamabad is the capital of Pakistan. A typical news story by the Dawn, Pakistan’s oldest and most widely read English-language newspaper, starts like this:
“ISLAMABAD, Sept 13: A number of National Assembly employees are reluctant to present their academic certificates for necessary verification as directed by the federal government.”
Islamabad is very likely to show up in Dawn stories, and thus, it is less strong a source for computing co-occurrences at a document level.
Pinpointing Bin Laden II
Now, let’s assume that the approach on co-occurrences actually works. Consider this map:
Peshawar is about as far from Kabul (a bit further) as it is close to Abbottabad. Peshawar has 3.6M people, Islamabad 1.3M, and the Rawalpindi/Islamabad Metropolitan Area is the third largest in Pakistan with a population of over 4.5 million inhabitants.
What do I do with such information? It’s not very helpful operationally in my humble opinion. I was trained as an army ranger a long (very long) time ago, and this would not exactly qualify for targeting.
Pinpointing Bin Laden III
The big point on Bin Laden and Abbottabad from an open source analysis perspective, again from my point of view, is that before May 2, 2011, there were basically no documents on the internet mentioning Abbottabad and Bin Laden together. That’s actually pretty amazing.
Now of course, you may find that aggregation of information will show you interesting patterns beyond individual tidbits, but a 200km radius is not that.
In conclusion, I’d like to say that Leetaru has presented overall awesome work, but a few adjustments would make it so much better. Open source or even web source intelligence has lots of promise for essentially every type of analysis. We’re hard at work at this at Recorded Future.