Enabling Real-Time Analytics at Data Summit 2024

The combination of Apache Flink, Apache NiFi, and Apache Kafka for building real-time data processing pipelines is extremely powerful, as demonstrated by Timothy Spann, principal developer advocate, streaming, Cloudera and future of data meetup, startup grind, AI Camp, during his Data Summit 2024 session, “Building Real-Time Pipelines With FLaNK.”

The annual Data Summit conference returned to Boston, May 8-9, 2024, with pre-conference workshops on May 7.

Kafka offers a simple setup for many tables, provides metadata augmented data, can visually monitor, is easy to combine with NiFi, and more. NiFi offers simple JDBC queries, transform induvial records, support many different data sources, and more. And Flink provides strong control of tables and joins, high throughput and low-latency, automatic records, and more.

“Flink is a nice way to generate your data between systems,” Spann said. “It can give you real-time analytics.”

He also recommended Apache Iceberg, an open-source high-performance format for huge analytic tables. Iceberg enables the use of SQL tables for big data while making it possible for engines like Spark, Trino, Flink, Presto, Hive, Impala, StarRocks, Doris, and Pig to safely work with the same tables, at the same time.

He provided a case study using the FLaNK-MTA project. The project leverages these technologies to process and analyze real-time data from the New York City Metropolitan Transportation Authority (MTA). Real-time alerts are now displayed and pushed to a dashboard so transit riders can know what to expect if buses or subways are delayed.

FLaNK-MTA demonstrated how to efficiently collect, transform, and analyze high-volume data streams, enabling timely insights and decision-making.

“Almost every major city has an open data API,” Spann said. “What they want is for it to show up on Google Maps.”

Many Data Summit 2024 presentations are available for review at