Using Web Scraping as a Data Science Tool

By Joyce Wells

Dec 4, 2017

Web scraping can be an invaluable skill to possess when working on data-related projects because many interesting analytics projects often start not with over-explored internal data, but with the treasure trove of information found on the web, according to authors, lecturers, and data scientists Seppe vanden Broucke and Bart Baesens. However, while the web holds a wealth of information, collecting and structuring web data can be a daunting prospect for many data practitioners, believes Baesens who has written a new book on the topic with vanden Broucke titled, Web Scraping for Data Science with Python. Here, Baesens expands on the techniques and uses for web scraping.

Why did you choose to focus on web scraping in this book?

Web Scraping for Data Science with Python book cover Bart Baesens: We felt that there was a need for a modern guide specifically geared toward a data science audience. We know that there are a lot of guides and tutorials already available online, but often found it hard for practitioners and students to link concepts together or to find a reference that can give an up-to-date overview on what’s available. In addition, most existing guides gloss over a lot of best practices and tips, including those regarding the managerial and legal aspects of web scraping. We hope this book can help to resolve this gap and that readers will find it useful.

What is your experience with web scraping?

BB: As data scientists and lecturers of various analytics-related courses, we’ve very often found web scraping to be an invaluable skill to possess when working on data-related projects. Typically, a data science exercise will start with the first step of identifying appropriate data sources and extracting the data from them. In ideal situations, such data will be readily provided by your company’s data warehouse, a business partner, an external data provider, or your academic supervisor, preferably neatly structured and cleaned as well. Of course, the real world is more challenging and interesting: you might not have all the data you need, but might know of some websites that can help you to enrich your dataset. Many truly interesting analytics projects hence start not by the well-known and often over-explored internal data, but do something with the treasure trove of information found on the web. It goes without saying that the web holds a lot of information, but collecting and structuring it is challenging for many data practitioners.

Who is this book directed at?

BB: The book was primarily written with a data science audience in mind, including practitioners already using Python or another programming language, lecturers and students, citizen data scientists, and data managers, though is accessible to a more general audience as well. Basically, the content of the book is split into a technical part giving a complete overview on using web scraping, a managerial part discussing how web scraping fits into a more general analytics setup, and a series of examples that provide some fun insights towards using what you’ve learned in data science projects.

What is it about the technique that you feel is not fully understood or appreciated?

BB: Mainly two aspects—one being technical and one more governance-related. On the technical side, we know that many people who are just getting started with web scraping will often get stuck on a particular use case, not knowing where their setup is failing. The web is an incredibly messy place, and one is often confronted with a particular website that just doesn’t seem to work. We’ve made sure to deal with those edge cases in this book. Instead of providing optimistic examples only, we also challenge you by showing you where things might go wrong, and how to fix them.

What is the more governance-related aspect?

BB: The second aspect that is often overlooked comes once an organization wants to get serious with a web scraping project and include scraped data in a dashboard, reports, or a predictive model. Sure, it might be possible to scrape the data once, but have you thought about whether you’ll need to update this extract later on? If so, do you have a contingency plan for when a site fails or changes? Who will maintain the scraper? These are difficult questions which should be considered at the start of a new project.

Why is Python the language that you recommend?

BB: We don’t have a general preference toward Python, but have found that its ecosystem comes with powerful web scraping libraries which are very well-thought out and generally easy to use. Even when one is not using Python for the data analysis part itself, we oftentimes see the web scraping part being handled by Python regardless. That’s not to say that Python is your only option. In fact, we include a section in the book where we provide pointers on which libraries to look in case you want to make the switch to other languages like R, Java, and JavaScript.

Can you provide examples of scenarios where web scraping is beneficial?

BB: Lots of them, and we include a large overview in the introductory section of our book. To give some examples: Scraping is being heavily applied in HR and employee analytics. The San Francisco-based hiQ startup specializes in selling employee analyses by collecting and examining public profile information, for instance from LinkedIn. Banks and financial institutions are using web scraping for competitor analysis (to check what rates a competitor offering, for instance). Researchers use web scraped data as well in amazing ways, for instance, to develop a model which is able to spot patterns of depression, trained on a collection of scraped tweets. Retail “aggregators” will also often use web scraping to present a portal site bringing information from various sources together in one overview, and so do hotel aggregation sites.

What are some of the technical problems that users may encounter in trying to access web data via web scraping?

BB: This links back to an earlier question we already highlighted: the messy and ever-evolving nature of the web. Unstructured or badly-formatted HTML code is one (though still easy) problem. Dealing with cookies and login screens is another. For sites that depend heavily on JavaScript, one may find that traditional tools will fail to work, so that we have to look for a more advanced approach where we automate a complete browser stack. Finally, anti-scraping techniques such as captcha checks or scraper detection and blocking techniques can also pose trouble.

Are there ethical or legal issues that must be considered in web scraping?

BB: Definitely. There is a reason that site owners continue to fight against forms of automated access. The legality of web scraping is something that continues to be debated and many related laws have not aged well in our digital age. We provide a lengthy discussion on the legalities regarding web scraping, including breach of terms and conditions, copyright law, the computer fraud and abuse act, and so on, though the basics boil down to the following: Don’t cause damage by hammering a site, try to limit your scraping to public information only, don’t use scraping to steal copyrighted material, and do consider contacting the site owner with a kind request.

Web Scraping for Data Science with Python is available for download from Amazon.

Bart Baesens is a professor of Big Data and Analytics at KU Leuven (Belgium) and a lecturer at the University of Southampton (United Kingdom), and author of the Big Data Quarterly column "Data Science Deep Dive." He has done extensive research on Big Data & Analytics, Credit Risk Modeling, Fraud Detection and Marketing Analytics. He has written more than 200 scientific papers, some of which have been published in well-known international journals and presented at top international conferences. He has received various best paper and best speaker awards.

Baesens is the author of 6 books: Credit Risk Management: Basic Concepts (Oxford University Press, 2009), Analytics in a Big Data World (Wiley, 2014), Beginning Java Programming (Wiley, 2015), Fraud Analytics using Descriptive, Predictive and Social Network Techniques (Wiley, 2015), Credit Risk Analytics (Wiley, 2016) and Profit Driven Business Analytics (Wiley, 2017). He sold more than 15.000 copies of these books worldwide, some of which have been translated in Chinese, Russian and Korean. His research is summarized at www.dataminingapps.com. He also regularly tutors, advises and provides consulting support to international firms regarding their big data, analytics and credit risk management strategy.

Seppe vanden Broucke is an assistant professor at the Faculty of Economics and Business, KU Leuven, Belgium. His research interests include business data mining and analytics, machine learning, process management, process mining. His work has been published in well-known international journals and presented at top conferences.

vanden Broucke's teaching includes Advanced Analytics, Big Data and Information Management courses. He also frequently teaches for industry and business audiences. See http://seppe.net for further details.