The variety of operational data generated in the utility industry is orders of magnitude more complex than what is typically found in ICT (information and communications technology) environments. Millions of connected devices, sensors, and network systems operating throughout smart grids are generating billions of data points feeding utilities measurements on power quality, voltage, the energy delivered, and more. Industrial organizations are struggling to integrate millions of unstructured data streams from disparate sources, trapped in siloed legacy systems, all in different formats.
This data can be a valuable resource for driving operational efficiency and increasing productivity, but only if it can be transformed into meaningful intelligence. Without the ability to integrate the raw data into a common model and gain a holistic view of the big picture and potential risks in real-time, utilities will continue to struggle to achieve operational and business benefits from the sensor technology and data integration and management architectures they implement.
Traditional data warehousing models and open source alternatives such as Apache Hadoop and Storm have been touted as solutions to these types of “big data” challenges. However, utilities have found that these approaches cannot handle the scale and complexity of data generated in industrial environments. Additionally, they fail to provide the real-time analysis and situational awareness that utilities need to improve decision making or address critical events in real-time, such as optimizing crews during outages and severe weather events. The challenge is how to leverage tremendous volumes of data, place it in context with other data, and create real time intelligent information for operations.
Utilities have attempted to solve this challenge with a variety of approaches, and recent technologies involving data appliances, NoSQL, and data lakes have had some limited success in typical IT applications. On the industrial side, success has rarely been achieved due to the velocity, volume, and variety of challenges associated with operational technology (OT) data. Industrial environments require new architectures and a new approach to data stores and data analytics that the aforementioned approaches fail to address individually. Hadoop data lakes are one of the recommended layers in this new architecture, but Hadoop alone cannot solve industrial data challenges.
Understanding Hadoop and its Limitations
Hadoop has proven that it can handle massive volumes of data and scale within certain use cases. It is a fundamental layer for handling large scale data requests and is based on a map-reduce concept where data is parsed out as small “chunks” that are then load distributed across multiple systems and disks. This approach to breaking the data down into small easily managed chunks (i.e. jobs) allows for effective distribution of the data load. The Hadoop platform can then easily farm-out the chunks as map-reduce jobs and efficiently scale by improving the I/O capacity of supporting systems and disks.
Image courtesy of Shutterstock.