How to Configure and Scale Your Hadoop Data Lake (VIDEO)

Video produced by Steve Nathans-Kelly

There are a plethora of choices and tools available for building an end-to-end modern data platform. But it is important to understand the components and when and why to use them.

In a presentation at Data Summit 2019, titled "Modern Data Platforms & Application Architecture," Florida Blue's Padmesh Kankipati covered key considerations for scaling a Hadoop data lake.

Hadoop has four node types: Master, Worker, Utility, and Edge nodes. Master Nodes control various Hadoop services. Worker Nodes do the heavy lifting for processing. Utility Nodes controls other Hadoop services. And we have Edge Nodes that are mainly used for data landing and contact point from outside world.  The Edge Nodes are where the interactions of the Hadoop system with the outside world happen.”

If you look at the services that reside on these types of nodes, the Master Node services have the Name Node. “So if the Data Node stores the data, the Name Node stores the metadata. It stores the metadata and it provides the file system tree and the Journal Node; it synchronizes the Name Node with a Standby Name node.”

Kankipati explained the roles and responsibilities of various components in a Hadoop configuration.

Basically, if you understand the services and the demand for these services you can define a Hadoop configuration the way you want, said Kankipati, who also defined three configurations with the components that would be necessary, in his presentation.

To access the full presentation of "Modern Data Platforms & Application Architecture," go to

DBTA’s next Data Summit conference will be held May 19-20, 2020, in Boston, with pre-conference workshops on Monday, May 18.

Many presenters have made their slide decks available on the Data Summit 2019 website at