Best Practices in Data Governance

By Scott Zoldi

Apr 8, 2015

Receiving and processing billions of data records per month from hundreds of clients around the globe requires one to implement and execute a robust data governance process. Big data, data security regulations, and evolving national and regional privacy requirements necessitate a data governance process be comprehensive and automated.

After confirming proper transmission and receipt of the data, the second function of the automated processing is to prepare the data for use by analysts and data scientists. Implemented correctly, this preparation includes an expedited process to convert raw data from a variety of data specification versions into a well-defined format and transform the data into a readily usable state, addressing inhomogeneity due to data quality/version differences across clients. To achieve this objective, the data in each client file is checked against expected client-specific metadata statistical distributions and client-specific data element remediation logic (where applicable) defined by analysts based on historical data quality issues. Re-evaluation and automated testing of the remediation logic is performed with each data contribution to ensure the logic is still valid. The original data (nontransformed) is also maintained for auditing purposes. Interactive reports using a web interface provide analysts with details regarding the quality and consistency of the data. The result is a homogeneous consortium dataset that is quickly available for analysis and research.

Compliance and Security

One of the leading data concerns for all businesses is the proper security and use of customer data. Exacerbating the challenge for a company that processes and manages data across multiple international clients are the various implementations of privacy policies across the globe. Addressing the privacy concerns of clients often requires the encryption of any PII data as defined by law or regulation. It is important the encryption algorithm be configured at the client site and by the client to eliminate any possibility of external reverse engineering of the encrypted fields. Upon receipt of data and before acceptance of the file, monitoring scripts must confirm the encryption of sensitive fields. Where there appears to be violations in encryption of PII, such files should be immediately deleted from the receiving system with immediate notification to the client to resubmit.

In addition to encryption of sensitive data, it is critical that an organization has implemented a comprehensive Data Access Policy. This policy ensures that access to read, update, or delete data is limited to those individuals with a legitimate business need. This policy need detail the owner(s) of the data, the individuals with authority to request and approve access, and those with the security clearance to grant access and regularly monitor access to files/data.

Data Quality Assurance

Evaluating the quality and usability of large datasets received from multiple sources requires sophisticated analytics to quantify the integrity of the data. Regular production of data quality reports allows for rapid analysis of contributed data to ensure that data element values align with specifications, that statistics align with historical contributions, and that client-specific, metadata-based remediation logic are revalidated. Beyond statistical analysis of data contributions and regular dialogue with clients, recent research into more sophisticated auto-encoder technologies is proving to be viable and accurate diagnostic components for monitoring data from production and in data contributions. An auto-encoder is trained to learn a compressed, distributed representation (encoding) of data and, once trained, used as an indicator of how representative a new dataset is to the historical or expected data. The auto-encoder self-diagnostic components provide regular monitoring of incoming data to ensure it is similar in character to the client’s historical data. Data monitoring with encoding technologies represents the use of deep learning approaches to monitoring data integrity issues and ensures relationships between data inputs are consistent with expectations. Records where the auto-encoder output deviates significantly from the input records indicate data patterns unseen in previous data gathering or during model training and which are unlikely to be detected using traditional global statistical analysis.

Subscribe to Big Data Quarterly.

Image courtesy of Shutterstock.