Recorded Future White Paper
By Jason Hines on August 31, 2009
Thy letters have transported me beyond
This ignorant present, and I feel now
The future in the instant. (Macbeth, Act 1 Scene 5)
This white paper describes the underlying philosophy and overall system architecture of Recorded Future and its products.
Search, the Third Generation
The history of search goes back to at least 1945, when Vannevar Bush published his seminal article “As We May Think”, where among other things he pointed out that:
The difficulty seems to be, not so much that we publish unduly in view of the extent and variety of present day interests, but rather that publication has been extended far beyond our present ability to make real use of the record. The summation of human experience is being expanded at a prodigious rate, and the means we use for threading through the consequent maze to the momentarily important item is the same as was used in the days of square-rigged ships.
In the decades to follow, a lot of work was done on information management and text retrieval / search. With the emergence of the World Wide Web, both the need and the ability for almost everyone to use a search engine became obvious.
An explosion of search engines followed, with names such as Excite, Lycos, Infoseek, and AltaVista. All these first generation search engines really focused on traditional text search, using various algorithms but really looking at individual documents in isolation.
Google changed that, with its public debut in 1998. Google’s second generation search engine is based on ideas from an experimental search engine called BackRub. At its heart is the PageRank algorithm [REF], and this is the core of Google’s success (together with clever advertising based revenue models!). The main idea of the PageRank algorithm is to analyze links between web pages, and to rank a page based on the number of links pointing to it, and (recursively) the rank of the pages pointing to it. This use of explicit link analysis has proven to be tremendously useful and surprisingly robust (even though Google continuously have to tweak their algorithms to combat attempts to manipulate the ranking algorithm).
Recorded Future is developing a third generation search engine, which goes beyond explicit link analysis and ads implicit link analysis, by looking at the “invisible links” between documents that talk about the same, or related, entities and events. We do this by separating the documents and their content from what they talk about – the “canonical” entities and events (yes, this model is heavily inspired by Plato and his distinction between the real world and the world of ideas). Documents contain references to these canonical entities and events, and we use these references to rank canonical entities and events based on the number of references to them, and the credibility of the documents (or document sources) containing these references. Co-occurrence of different events and entities in the same or in related documents is also used for ranking – i.e. an event co-occuring with an important event is likely to be of some importance too.
In addition to extracting event and entity references, Recorded Future also analyzes the “time and space dimension” of documents – references to when and where an event has taken place, or even when and where it will take place – since many document actually refer to events expected to take place in the future.
The semantic text analysis needed to extract these entity, event, time, and location references can be seen as an example of a larger trend towards creating “the semantic web”.
The time and space analysis described above is the first way in which Recorded Future can make predictions about the future – by aggregating weighted opinions about the likely timing of future events using algorithmic crowdsourcing. In addition to this, we can use statistical models to predict future happenings based on historical records of chains of events of similar kinds.
The combination of automatic event/entity/time/location extraction, implicit link analysis for novel ranking algorithms, and statistical prediction models forms the basis for Recorded Future’s third generation, analytic, search engine. Our mission is not to just help our customers find documents, but to enable them to get evidence for things happening in the world.
To illustrate these ideas, we’ll present a simple example. Assume we have a set of different sources from the net, as illustrated in this picture:
From these sources, we harvest documents, either from RSS feeds or other forms of web harvesting. An example data set might contain the following documents with short text snippets in them:
Our analysis first detects entities mentioned in the document, and decides which entity category they belong to (in this example, blue for Companies, Orange for Persons, and green for Cities):
Next, events involving these entities are detected; in this example five different kinds of events:
These are the canonical events; we now add event references / instances derived from the different documents (and the same for entity instances, but for the sake of graphical clarity these are not included in these pictures):
Once this analysis is completed, we can actually dispose of the original texts, since we have completed the transition from the text to the data domain:
Since we have information about time, we can add more relations to our database, e.g. about which event instances precede others, as indicated by green arrows in this picture:
This completes the transition from the text domain of documents to the “idea world” of canonical events and entities, references to/instances of these, and relationships between these instances. Once this vital step is taken, all kinds of analysis can be used to further enrich the data set, and allow both algorithmic models and human users to explore the data and its implications in various ways.
The Recorded Future system contains many components, which are summarized in the following diagram:
The system is centered round the database, which contains information about all canonical event and entities, together with information about event and entity references (sometimes also called instances), documents containing these references, and the sources from which these documents were obtained.
There are four major blocks of system components working with this database:
– Harvesting – in which text documents are analyzed to detect event and entity instances, time and location, text sentiment etc. This is the step that takes us from the text domain to the data domain.
– Processing – in which data is analyzed to obtain more information; this includes ranking of events, entities, documents and even sources, as well as synonym detection, ontology analysis and anomaly detection. The ranking of events and entities we compute we obviously use in presentation/user interfaces.
– Prediction – in which different statistical and AI based models are applied to the data to generate predictions about the future, based either on actual statements in the texts or other models for generalizing trends or hypothesizing from previous examples.
– Experience – the different user interfaces to the system, including the search interface, overview dashboard, alert mechanisms, and the high level time oriented query language TSQL.
This is a brief introduction to how Recorded Future works – we will be back to dive deeper into algorithms, ranking methodologies, predictions, extensibility, etc.
Staffan Truvé, [email protected]
 We do keep references to the original documents, but we do not need to store the entire text – this is good both from a storage cost point of view and for IPR reasons.