The volume of data now being stored by businesses is at a point where the term “big data” almost feels inadequate to describe it. The size of big data sets is a constantly moving target, ranging from a few dozen terabytes to many petabytes of data in a single data set. And it is estimated that, over the next 2 years, the total amount of big data stored by business will be four times today’s volumes.
As business continues its inexorable shift to the cloud, weblogs continue to fuel the big data fire. But there are plenty of other sources as well - RFID, sensor networks, social networks, social, Internet text and documents, Internet search indexing, call detail records, scientific research, military surveillance, medical records, photography archives, video archives, and large-scale e-commerce transaction records.
Examples of big data include:
- Wal-Mart handles more than 1 million customer transactions every hour, which is imported into databases estimated to contain more than 2.5 petabytes of data - the equivalent of 167 times the information contained in all the books in the US Library of Congress.
- In total, the four main detectors at the Large Hadron Collider produced 13 petabytes of data in 2010 (13,000 terabytes).
- Facebook handles 40 billion photos from its user base.
- The FICO Falcon Credit Card Fraud Detection System protects 2.1 billion active accounts world-wide.
Big data is integral to business today, driving processes and profits. All data has value, individually or aggregated, and as such needs to be accessible to analysts for business intelligence purposes and at the same time protected from hacking and other unauthorized use to comply with data protection mandates. And therein lies the major challenge - how to enable the business to benefit from accurate, available, and plentiful data without risking the security of that data.
And yet, even recognizing both the value and the vulnerability of big data, 2012 has already seen the exposure of almost 14 million records through close to 200 major breaches, according to Network World magazine.
What happens when big data is inadequately protected?
Clearly, the following kinds of stories are not what CEOs want to be reading about their companies over breakfast in the morning:
- LinkedIn was hacked on June 6, 2012, resulting in the compromise of approximately 6.5 million passwords
- At New York State Electric & Gas Co, 1.8 million files containing customer Social Security numbers, birthdates and bank account numbers were exposed due to unauthorized access by a contractor.
- At Global Payments, Inc. 1.5 million payment-card numbers were exposed, plus potentially hacked servers with names of merchant applicants.
- The California Dept. of Child Support Services lost 800,000 records of adults and children on storage devices when the devices fell from an unsecured container in transit.
- At Utah Dept. of Technology Services, 780,000 patient files related to Medicaid claims were stolen from a server by hackers believed to be operating out of Eastern Europe.
- At the University of Nebraska, 654,000 files of personal data relating to students, alumni, parents, and university employees were exposed due to unauthorized access
And those are just the tip of the iceberg. The big breach stories continue to appear, resulting in an increasing number of lawsuits. We don’t seem to be making any progress. So clearly, current standards of data protection are simply not working.
Protecting big data: what works – and what doesn’t
The challenge lies in applying appropriate levels of protection to different types of data, depending on the need for access to and use of those different data types. The data sets are so large now that they cannot be replicated out to multiple sites, rendering useless those approaches that require replication, such as masking and obfuscation. And relying only on access controls will not prevent the proverbial disgruntled system administrator or DBA from accessing the data.
So what are the options?
Hashing algorithms are one-way transformation functions that turn a message into a fingerprint and used to secure data fields in situations in which one does not need to use the original data again. It’s useful for passwords, but is not suitable for any environment in which data must be reusable – which is most business uses.
Masking and Obfuscation, as noted above, also become infeasible for large data sets because they require replication because the value replacement is irreversible.. While masking does not interfere with business operations involving the secured data, anyone more skilled than a casual thief will get past it pretty quickly.
Format-Preserving or Data Type Preserving Encryption generates cipher texts of the same length and data type as the input and can simplify retrofitting encryption into legacy application environments. It provides protection while the data fields are in use or in transit, but it also shares with other encryption technologies the need for different keys for different stakeholders and different access needs
Strong Encryption is more applicable to high-risk data than format-preserving encryption and is the “gold standard” for encryption, supported by NIST and other standards bodies. It gives the encrypted text a different data type and length, increasing database size requirements and unable to provide fully transparent protection while the data fields are in use or in transit.
Point-to-Point Encryption provides strong protection of individual data fields and is a great way to protect highly sensitive data that need continuous protection in a data flow. As with any other type of encryption, it uses encryption keys based on a mathematical algorithm; while security may be stronger, the reliance on keys remains a significant vulnerability for the flow of data.
Vault-based Tokenization was originally intended to provide a more manageable, less intrusive solution than encryption while still meeting PCI requirements. It does this by substituting characters for the credit card with a token, using a look-up table. However, the reliance on look-up tables has a major impact on scalability, as the tables rapidly become unwieldy and high levels of transaction latency result.
Vaultless Tokenization is much more manageable than its vaulted cousin, particularly in the context of data that needs to be frequently accessed and manipulated, as it uses small, distributed, random, pre-generated look-up tables instead of a single large table. This reduces or eliminates latency, enabling data to be quickly tokenized and detokenized as needed, and can be infinitely scaled using commodity hardware. Vaultless tokenization enables the security to travel with the data – it’s tokenized throughout the workflow, whether at rest, in use, or in transit.
Vaultless tokenization in practice
A leading healthcare informatics organization that provides SaaS-based clinical business intelligence solutions which connect patient information across multiple medical settings and time periods to generate targeted reports and analyses such as trends in treatment protocols or drug usage has recently adopted vaultless tokenization. Prior to adopting tokenization, the company had relied solely on access control and authentication to protect patient data.
The company takes in data - including social security numbers and other personally-identifiable information - in multiple formats, converting everything to a standardized format. With a current database of more than 15 million patients, a number that is expected to treble within five years, any breach would have quickly become unmanageable with such a minimal level of protection. The company would not have been able to identify which data had been affected, nor would they have been able to identify which individual employees had had access to the data, causing major compliance issues and rendering the organization vulnerable to significant expense in fines and restitution costs.
By adopting vaultless tokenization, the company is now able to apply protection to the data as part of the format standardization process, ensuring that they will be able to run analytics on demand while keeping the data fully secured, an approach that, as we’ve seen, would not be possible with other options for data protection.
The lightweight nature of vaultless tokenization means all the heavy lifting can be done in memory within the data warehouse or database system; the latency issues of vault-based tokenization cease to be a factor. Because big data comprises both structured and unstructured data, neither encryption nor tokenization alone will serve the full spectrum of business needs. Applying vaultless tokenization to structured data, supported by and wrapped in strong encryption, ensures continuous protection and compliance while allowing businesses to take full advantage of big data analysis and manipulation.