How to Harvest the World’s Largest Source of Data—Web Data

Jun 10, 2019

By Gary Read

Web data—it’s the world’s largest untapped source of data. Research firm Opimas speaks of the sweeping power of web data, as it “comprises a valuable portion of alternative datasets revolutionizing the decision-making process for corporations.” It’s the completer piece to the datasets that organizations need to run their businesses. And now web data in large volumes can be captured quickly, efficiently and cost-effectively with the introduction of web data integration (WDI), a practice that identifies, extracts, prepares, and integrates data for consumption by business applications, analytics, and processes. WDI quickly and cost-effectively delivers high-quality web data at enterprise scale, without requiring expensive engineering teams to constantly be writing code, monitoring quality and maintaining logic.

Automation enables extracting, preparing and integrating data so that it can be orchestrated, reused for different purposes and consistently monitored. What’s more, the complementary machine learning, as the name implies, allows web data extractors to learn for themselves, using past experiences to gain new knowledge and to perform tasks more efficiently.

Web data is a formidable asset and tool. According to Gartner, “Your company’s biggest database isn’t your transaction, CRM, ERP or other internal database. Rather, it’s the web itself.”

But why is WDI evolving to more automated forms of data collection? Think of it this way: Analytics is the engine for turning raw data into actionable data—and the more data you collect, the more data-informed your actions become. “It's not who has the best algorithm that wins, it's who has the most data,” said Andrew Ng, inventor of Google Brain and an adjunct professor at Stanford. One benefit of automating WDI is that workflows can be created by subject matter experts, departmental personnel or business unit personnel without requiring help from software engineers.

In that light, the metamorphosis from “web scraping” to automated web data extraction and integration has become a natural process, and it’s used across industries and specialties, such as:

Market research
Competitive analysis
Content aggregation
Brand monitoring
Sentiment analysis

Web Scraping: Slow, Painful, Unrewarding

Let’s take a quick look at the world before WDI. Web scraping, the legacy solution, is nearly as old as the web itself. It became a rudimentary means of automating data extraction, but it has not evolved much over the years. “Web scraping projects are notoriously complicated, expensive and labor-intensive,” Ovum analyst Tony Baer has noted.

Even then their results are limited essentially to parsing HTML documents that are visible on websites. Web scraping requires custom scripts for every type of web page that an organization or individual wants to target for automated data extraction. Teams of skilled programmers must code crawlers, deploy them on servers, debug and monitor them, and perform post-processing of the extracted data.

Web scraping projects are often not resilient to changes on the target website, which means they can break easily. This is because data extraction rules are hard-coded and are informed by a sample of pages from the website. “If the website changes, or the engineer did not sample a sufficient number of web pages when writing the extraction rules, the organization is left with incomplete, poor quality, unreliable and out-of-date data,” accordign to Baer.

And time is the enemy of web scraping: By the time a subject matter expert finally gets a chance to review the data, the web page in question may have changed, and screenshots of the page may not be available.

With all the limitations of web scraping, it was inevitable that WDI would leapfrog that technique.

Website Complexity Puts Web Data Extractors to the Test

The nature of websites can make data extraction either reliable and easy, or complex and challenging. The easiest data to extract is static data on websites that don’t change very often. Similarly, extracting data at low speeds or volumes is relatively easy. But things get more complex and more problematical with the levels of hierarchy of websites and the ubiquitous use of nested menus.

If there is a prevailing truth about data extraction, it’s this: The success of an extractor is proportional to how successfully it interacts with a website during the extraction process.

Let’s look at SOME examples that put data extractors to the test:

Extracting data at large scale
Extracting data that’s not visible on web pages

Extracting data at large scale

This is an example of where interaction with a website is the necessary precursor to locating and displaying the required data. Say a user wants to understand something about every hotel in the United States. I want to understand each hotel’s amenities, availabilities, room pricing and other information. That could require me to pull data from more than a million web pages. The problem is, they can’t take a week to do that because, by that time, the data will have changed. Clearly, the scale of the task and the volume of data involved can create barriers to collecting the data needed for the application.

Or take a real estate site, such as Redfin or Zillow. Those sites are not designed to reveal how to find the data you need, and consequently how the extractor must interact with the site to find and extract that data.

The interaction process, whether for a human or an automation engine, might go similar to this. Enter a ZIP code to find a list of properties within that ZIP code. Extend your search to include all properties in California, which requires entering every ZIP code in California. Now, further refine your search, by filtering on single-family homes.

And it gets more challenging. Let’s say you need to use a colocator to find the number and addresses of homes within an x-mile radius of a particular ZIP code. That means entering all the ZIP codes and all the different radii you need to get complete coverage of the state of California.

Finally, you get the data you need, but you’re going to have a lot of duplicates. Let’s say a home happens to be within a 10-mile radius of both ZIP code A and ZIP code B. That means you must deduplicate the data.

It’s data mining challenges such as these that test the limits of traditional web data extraction tools.

Extracting Data That’s Not Visible on Web Pages

When you want to buy a product from an e-commerce retailer, you search for the product, add it to your cart, note that you want, let’s say, four of the product, and proceed to check-out. But when you go to check out, you get a message that the retailer doesn’t have four of the product in stock.

Now, look behind the scenes. The website “knows” that only three of the product (in this example) are in stock. Most retailers don’t display that information on their websites. Only after taking a labyrinthine manual path through the site do you find that information.

If you use the web data extraction tools available to you, you can quickly find that data, and that’s because it’s there! It’s just not displayed on the page. Using web scraping, you would have to “dive down” to find the data you need because scraping gives you only what you see on the screen.

But the more sophisticated tools that comprise web data extraction can quickly uncover that data.

Today, extraction methods have improved. Users can create workflows that allow automated extraction agents to navigate through websites, extracting data through techniques involving pagination, infinite scroll, list-detail patterns, form-filling, click interaction, authentication, and others.

The Completer Piece: Integrating the Data

One tenet of WDI is that it’s only as effective as the data it uses in terms of accuracy, validity and comprehensiveness—which is the role of the preparation step in WDI. The Prepare process involves data wrangling, performing tasks such as splitting or combining columns, de-duplicating rows, interpolating gaps in sparse datasets, harmonizing data formats, schemas or structures, and generally cleaning the data, according to Ovum.

With the data in order, business and data analysts can then perform discovery, analysis and visualization on the data to derive findings and discern results.

The final prerequisite to consuming the data is to integrate data from multiple sources. That’s where APIs play a key role: they make the web data accessible to the various tools used by analysts, as well as to dashboards. They can also make the data consumable by query tools, as well as usable by other BI solutions or enterprise applications. Perhaps the greatest benefit is that multiple data points for any use case can be easily correlated with each other to derive new insights on a problem.

The more sophisticated granularity of the data integration process now makes it possible for providers to offer their customers service level agreements on the quality of the data that the WDI platform provides.

The Road to Higher Quality Data

Companies must treat quality and accuracy as utmost priorities. The extraction of web data is an ever changing and highly sensitive method of data acquisition, and data quality issues end up costing $3.1 trillion per year, per IBM. With WDI, users get an accurate picture of their web data pipelines’ health and can easily review the quality of the data being extracted in order to take immediate action to maintain a high quality web data pipeline. Users now have the ability to see and act on data quality issues at the source, before those problems reach downstream applications and integrations. Another point of consideration is the timeliness and completeness of the data. If it takes too long to conduct quality assurance on the web data, the data might grow outdated before it is even ready to use.

Be a Good Citizen

WDI is used today to extract targeted web content that may indicate product or pricing fluctuations, consumer sentiment about a product or service, and more.. Data on websites is in the public domain and for years has been accessed manually, allowing organizations and individuals to make data-based decisions of all kinds.

Now that automated technologies for data extraction have become widespread, the power to access and collect this data is thousands of times what it was in the days of manual procedures. That’s why companies employing WDI need to make this their mantra: Be a good web data citizen.

Here’s why and how, in a single example. eBay’s infrastructure is designed to handle thousands of requests per second. It’s the ultimate scalable network. But smaller organizations, by their nature, typically cannot handle the traffic volume that eBay endures without even the hint of an impact on its performance. That means that adding one more load, such as a legacy web scraping crawler, to a website could have a small, but measurable, impact on its performance—even if only momentarily.

And so be a good citizen on the Internet by practicing the principles of fair use. Don’t create “bad bots.” And remember, publishing any of the data you extract could be a violation of copyright laws. I believe that website owners and website users have largely ethical, respectable goals. Let’s all do our part to keep it that way.