What Does Watson Predict for the Databases of the Future?

One of the greatest achievements in artificial intelligence occurred earlier this year when IBM's Watson supercomputer defeated the two reigning human champions in the popular Jeopardy! TV show. Named after the IBM founder Thomas Watson and not - as you may have thought -Sherlock Holmes' famous assistant, Watson was the result of almost 5 years of intensive effort by IBM, and the intellectual successor to "Deep Blue," the first computer to beat a chess grand master.

Finding information today may seem as easy as typing the question into Google or Bing.  We take for granted the sometimes very complex processing that these systems undertake to return our search results; but, it's pretty obvious that the results are primarily based on simply matching the words in our queries with the words in the documents.

Winning at Jeopardy! requires a more sophisticated approach.  The solution to a Jeopardy! "anti-question" (Jeopardy requires you specify the question that matches the provided answer) can rarely be resolved by a simple web search.   Winning at Jeopardy! requires both an encyclopedic knowledge of history, science and trivia, as well as the ability to make associations between these facts and to interpret the subtly worded Jeopardy! question.

Watson incorporates many unique insignificant artificial intelligence (AI) innovations.  At the core is a massively parallel natural language search capability built on top of a huge embedded database of facts - Watson is not permitted to search the Internet during its Jeopardy! game.   The database includes a copy of Wikipedia, as well as numerous dictionaries, thesauruses and other reference material. 

Watson uses both finesse and brute force to attempt to resolve Jeopardy challenges. Over time, the raw facts in Jeopardy!'s database have been mined to create a network of assumptions and associations that can be used to formulate possible interpretations of the challenge.  Each of these hypotheses are refined and evaluated until the most likely solution is found.

However, there are a tremendously large number of possible answers for any given question, and sorting through them requires a lot of parallel computing power.  Watson uses 90 IBM servers comprising almost 3,000 CPU cores and 16 terabytes of RAM, so it can evaluate thousands, or millions, of evaluations to quickly narrow down to the most likely answer.

There are many uses for applications that can process vague and ambiguous human language queries and find the correct solution from vast amounts of data.   It's quite possible to imagine a Watson-type system providing technical support, or being built into application online help.  There are also obvious applications in web search, and in medical diagnoses (imagine TV's Dr. House in a box).  

From a database perspective, however, Watson is perhaps most interesting as an example of an important new application type that probably can't be constructed using relational database techniques.  Relational databases can store facts - Wikipedia for instance is stored in a fairly traditional relational database schema - but Watson needs to be able to apply thousands of simultaneous non-trivial computations to the data.  Relational databases can support parallel processing, but, typically, only by parallelizing relatively simple data comparisons, and almost always at much lower levels of parallelism.  Each thread of Watson processing performs complex semantic evaluations - much more than simply scanning a table looking for matching values.

To support this huge amount of parallel processing, Watson's designers chose a variation of the Hadoop framework - the massively parallel open source framework that is increasingly popular for managing large amounts of unstructured data.

Watson represents one possible model for the seemingly intelligent computers that will power increasingly sophisticated web searches and expert systems in the near future - and one in which non-relational databases play a key role.