Data Governance for Big Data at Rest and in Transit

Big data is not limited to unstructured or free-form data, but also includes well-defined, structured data (albeit in large volume or at high velocity) such as financial transactions or network activity for cybersecurity applications. Data governance for large organizations is confounded by the need to address the masses of both structured and unstructured data in-flight and at rest. Governance policies must account for the capabilities of existing relational databases, as well as the less-developed data governance capabilities of technologies such as NoSQL and the Hadoop ecosystem.

As the excitement and opportunity provided by big data tools develop, many organizations find their big data initiatives originating outside existing data management policies. As a result, many concepts of formal data governance are either intentionally (too cumbersome) or unintentionally omitted as these enterprises race to ingest huge new data streams at a feverish pace in the hope of increased insight and new analytic value.

Unstructured as well as extremely large, structured data pose challenges to enforcing data quality. Relational database vendors have well-established format constraints on data fields, in addition to access control, encryption, and logging capabilities to address data governance in contrast to the developing solutions around open source big data tools. Streaming data sources add yet another challenge. Complex streaming pipelines may include data coming from thousands of separate devices, and processed in many stages across a variety of systems. Each of the stages may transform, filter, and enrich the data. But equally possible, each stage may expose the data to unauthorized users, corruption, and inadvertent omission of fields or records. The challenge for data governance in a less structured table schema or flat file environment is the implementation of similar relational database security measures (via UNIX permissions, Active Directory, etc.) without impacting the business value of the data itself.

As big data implementations evolve in the enterprise, the data governance body needs to meet regularly to update governance requirements and adjust to technological advances. Modification to existing data governance programs must consider the varied formats and prescribed usage (and possible misuse) of the entire big data framework. Big data governance must also consider who has access to the data and what decisions are made or conclusions drawn from the data pool. Misapplied predictive modeling or misinterpreted causal relationships pose financial and reputational risk to the enterprise. Preventing this misuse requires specific measures and policies, and the governing body needs to include data scientists as well as data architects, management, and data security teams to educate on proper use and management of data and associated monitoring.

Further complicating big data governance is that, with the growth in data volume, the actual cost of storage and maintenance of big data and the increased risk of data misuse can often quickly outpace the actual added value of the information. At the prototyping stage of new data considered for inclusion into an organization’s big data store, it is critical to assess the data’s near-term anticipated value and applicability to existing or prospective business needs. If the data is found to be of marginal value, there must be a policy and process in place for its destruction. Should the data prove valuable for predictive modeling or BI, the governance policy should prescribe the data life cycle and scheduled updates. The data value, usage, and quality require re-evaluation on a regular cadence to verify that the business benefit continues to exceed the cost of persisting and governance procedures to protect the organization from risk of misuse.

Many organizations have data governance policies in place for existing structured data. These policies include controls on access and use of personally identifiable information (PII) as well as required encryption to ensure handling of customers’ personal information adheres to existing regulatory requirements. Big data repositories should (and will) be required to enable the same level of protection. When designing and implementing a big data infrastructure, the choice of platforms and vendors should include consideration of native governance capabilities and the need to create one’s own organizational big data governance procedures. Ignoring the importance of data governance puts the enterprise at risk for reputational and financial damage. Unapproved access, improper use of data for personal gain, or decisions based on poor data quality or misapplied analytics can expose companies to litigation or poor operational performance.


Subscribe to Big Data Quarterly E-Edition