Data and its analysis has become an important economic battleground for many industries, and nowhere is this more apparent than in the financial industry. Regulation is mandating greater data transparency across firms and trading practices. The increase in automated trading and the continuing search for new trading opportunities has led to exponential increases in the amount of data that must be captured, cleaned, managed and analyzed within a financial institution. To give you some idea of the size of the problem, the Options Pricing Reporting Authority (OPRA) in the U.S. is anticipating trade volumes at peak levels of around one million messages per second by mid-2008. Real-time data processing and the ability to store it for historic analysis have become particular pressure points for many investment banks, asset managers and hedge funds.
Vast volumes of real-time data may seem very specific to financial markets, but the importance and use of real-time data is growing in other industries too. Consider a supermarket: not exactly a hotbed of financial dealings (although it may seem equally as tense, especially on Saturday morning), but large chains are increasingly becoming concerned with real-time inventory management to ensure the shelves are always stocked. Although this is not a large concern for an individual store, management across hundreds of stores in a chain is more challenging and interesting. The data is also analyzed for what is selling well or poorly, which promotions are working and whether the latest store layout is bringing in more business. The trend toward real-time business intelligence is challenging end of day business intelligence solutions built upon more traditional data warehousing technology.
Data Getting Ahead of Technology?
Despite the obvious increase in processing power over the past 20 years, it is still frustrating that some technologies don’t change. For example, loading a spreadsheet still seems to take the same, if not more time now than it did in the early 1990s. Something of real value in an application does not seem to have improved that much at all, yet new versions of applications are constantly released that feign improvements. The discussion over “software bloat” is ongoing, and is somewhat true in the case that software engineers find it easy to use up processing power in new features that users may not perceive as benefits.
In contrast, the financial markets are in an unusual time with the speed of data temporarily overtaking the speed of technology. Traditional database technology struggles to keep up with the data update rates, especially in conjunction with the need to query data in real time. Faced with these challenges, there are a number approaches that the financial industry is adopting:
* Hosting conventional databases in memory, rather on disk
* Installing specialist, high-performance (but proprietary) database technology
* Using high-performance computing (HPC) to distribute analysis load
* Distributed data caching, also known as “data fabrics”
Currently, the most common solution to storing large amounts of historical real-time data seems to be to split the problem across two databases - one held in memory for data in use and the other in disk for historical data not currently in use. In this scenario, data is bulk-copied from the in-memory database to the historic database on disk. Hence, the issue of disk input / output is avoided for capturing the real-time updates, but at the cost of having to join across two databases if a desired query involves analyzing real-time data against historic trends.
Many of the database products in this area are proprietary in nature, having been designed explicitly for historic storage to optimize update and retrieval times (at speeds of several magnitude greater than mainstream relational solutions). The proprietary solutions are high performance, but often at the cost of usability, ease of replication and other mainstream tasks that technologists know and expect from more mainstream database technologies.
Not quite so mature is the usage of HPC (grid/clustering) in conjunction with high-performance database technology, in order to achieve real-time parallelization of data/calculation load across a group of computers. Here I am talking about real-time distribution of ad hoc query loads, and the not traditional HPC usage of huge-scale batch processing of a mathematical problem. Strongly related to HPC usage is the increasing usage and popularity of “data fabrics,” in-memory caches of data that are used to ensure that a cluster does not become “data-bound” with its overall performance inhibited by the slow provision of data to each node in the cluster.
Data Getting Ahead of Users?
So while real-time data is keeping the technologists busy, spare a thought for the consumers of both the data and the software - the end-users. In finance, as in other fields, a level of distrust often exists between the end-users and the technology staff delivering systems. This tension is exacerbated by profits at stake, short deadlines and market expertise that users have and IT staff often only partially understand.
As a result of this disconnect between offices, users resort to using Excel spreadsheets to store, analyze and report on data. Excel has a pressure-relief value for end-users that works around the fact IT delivery timelines do not meet business needs. It is not necessary to be a technologist to use Excel, and as such, it has become the definitive tool of “end-user” computing.
Traders have been using spreadsheets for years, but in the same way that technologists have been challenged by rising data volumes, so has this traditional tool of the end-user. The latest version of Excel has one million rows, but even this is not enough for more than a few days of market data for particular financial instruments. In addition, traders may have to look across many thousands of instruments spread across many markets. Even if technologists can develop systems to handle data volumes, the end-users cannot use their traditional analysis tool to get a complete view over the data and the opportunities available.
Time for Visualization?
Software engineers must continue to address the direct technological challenges of rising real-time data volumes through an optimized combination of all four approaches outlined earlier. A greater but more subtle challenge will be how best for technologists to present this data to end-users, particularly when the end-users’ favorite tool - the spreadsheet - has been surpassed by the amount of data that needs to be analyzed. For around 20 years now, visualization software vendors have been trying to sell into the financial services industry, with very limited success. Perhaps, now is the time when data visualization can finally come of age, bringing new levels of data transparency and understanding to hard-pressed users of real-time data.