IBM Aids British Library in Preserving Information on the Web for Future Generations

Bookmark and Share

IBM is working with the British Library on a project that will preserve and analyze terabytes of information on the web, speeding up the archival process and preventing information from being lost forever. The new analytics software project, called IBM BigSheets, helps extract, annotate and visually analyze vast amounts of web information using a web browser.

IBM BigSheets is an insight engine that helps businesses obtain insights from extremely large data sets easily and in a timely manner. By building on top of the Apache Hadoop framework, IBM BigSheets is able to process large amounts of data quickly and efficiently. BigSheets is an extension of the mashup paradigm that integrates gigabytes, terabytes, or petabytes of unstructured data from web-based repositories; collects a wide range of unstructured web data stemming from user-defined seed URLs; extracts and enriches that data using an unstructured information management architecture; and lets the user explore and visualize this data in specific, user-defined contexts. For example, users can see search results in a pie chart and look at the data in a tag cloud.

"IBM BigSheets does for big data what spreadsheets did for personal computing," explains Rod Smith, vice president, Emerging Internet Technologies, IBM. "Within a matter of minutes, researchers, academics and students will be able to search many terabytes archived web pages from the U.K. domain, analyze the results and effortlessly visualize the results of the search."

Since the web is rapidly changing with new pages being created every day, there is an explosion of data that is disappearing almost as quickly as it is published. Recent research estimates the average life expectancy of a website is just 44-75 days. In turn, every 6 months, 10% of web pages on the U.K. domain are lost.

"We estimate the U.K. web space will contain over 11 million websites by 2011. To take on the enormous challenge of capturing this content, we need a system capable of taking the U.K. web archive to web-scale," adds Helen Hockx-Yu, web archiving program manager for the British Library. Without a solution such as IBM BigSheets, important data would be lost forever. For example, the 2005 election marked the first attempts by U.K. politicians to use the web as a campaigning tool. With the use of web campaigns expected to explode during the 2010 election, the 2005 collection will enable researchers studying the evolution of politics and the web to access valuable primary source material. "IBM can help us analyze the web archive containing millions of pages and unlock embedded knowledge which otherwise is difficult to discover using traditional search methods," says Hockx-Yu.

For more information about IBM's Emerging Technology projects, go here.