The Fast World of Big Data Streaming

By Tomer Ben Moshe

Dec 9, 2015

What happens when your data just keeps on growing and growing? And what are the challenges that big data companies face when dealing with live data? There are many challenges, from scalability and availability to real-time demands and structure. Nevertheless, how do big data companies deal with these challenges? This is where data streaming comes into play. The world of data streaming is a complex one as it involves millions of events gathered per day.

The data streaming process begins when an SDK is implemented onto a website, and then each action is tracked as an event. Each event (like shares or clicks) is made into a JSON string and then the properties are created. To better explain this, an example can be used. Say an article was shared, then the event is the share and the properties could be the article that was shared. This begins the enrichment process which automatically adds more information to a certain event (session IP and user event properties, a user profile, gender, email and so on). The collection starts with the SDK and the ETL process starts as soon as the data arrives to the server.

ETL, otherwise known as – Extract, Transform and Load – is a common process in data warehousing. This process extracts data from external sources, transforms it to fit certain needs, and loads it to the end target. Usually all of these work simultaneously, hand-in-hand. While “Extract -> Transform -> Load” sounds pretty simple, it really all depends on how much data has to be collected and analyzed. There are ETL systems which are pretty basic and only use one server and there are ETL systems which collect billions of data points per second. When talking about big data, the ETL system needs to be scalable, smart and super strong. But that is not easy to achieve at all. One of the most common problems a developer faces very early in the process is scaling out. There could be a system that is working correctly now but when handling more streams the system begins to ache, and if a business isn’t prepared in advance and has a strong architecture in place - it could spell trouble. Some of the key problems are below:

Handling the Peaks and Valleys of Data Volume

Whether it’s the Super Bowl, Valentine’s Day or just breaking news, different types of online companies all over the world are dealing with the problem of data peaks. When a business has to handle different volumes of data, (peaks and shallow points) they have to make sure their ETL can scale. For example, online shopping sites go crazy with traffic during certain times of the year and many sites have been known to crash due to inadequate planning for dealing with such high loads. Some statistics reveal that 84% of online shoppers will not return to a site that has performed poorly because of these overloads. These are some of actions that can be taken to prevent this from happening:

Make sure that the hardware and software used is scalable and that servers can be added easily and quickly. Also, the servers should be geographically dispersed to the ETL.
It is advised that current systems should be capable of handling at least tens of percentages more that the regular volumes of data.
Use databases that include support in sharding.

The Challenge of Duplicate Data

Within the ETL process, duplicate records are often encountered in the table. Data cleaning aka data cleansing or scrubbing, deals with detecting and removing errors from data in order to improve its quality. Because of the wide range of data faults and data volume, data cleaning is considered to be one of the biggest problems in data warehousing. A large number of tools are available to support these issues, but often a significant portion of the cleaning and transformation work has to be done manually or by low-level programs that are difficult to write and maintain.

Maintaining Data Integrity

It is very important that the integrity of the data is maintained. Issues that result in dropped records are extremely common. A full inventory count must be completed and ensured in order to verify that the right records went into corresponding tables across the source landing area, staging area, and data warehouse. Find the problems and figure out why they occurred.

Most importantly, all these factors need to work at a fast pace so that any reactions and other adjustments can be done promptly.

Image courtesy of Shutterstock