Evolving Data Systems to Ensure Right Tool for the Job

Sep 10, 2014

By Simon Metson

Data is a living thing. People have “databases” in all kinds of interesting places and formats, and your organization likely has data stored in many different systems and formats. At some point, the number of systems that process and store your data will become problematic.

This week, I met with a firm that had been created by merger of two large firms. Just as when you move in with your partner and merge record collections, only to find you’ve got two copies of all your favorite albums, these two companies found they had multiple systems doing same role. Among them: four ETL systems (four!), multiple CRMs, and different operational data stores working for different legacy pieces of the combined businesses.

There’s no such thing as a silver bullet, and they knew multiple systems were needed in some instances, but there was room to rationalize down to a simpler architecture. Ideally, they wanted a single system capable of managing both operational data and ETL, like workflows for the whole business.

The Right System for the Job

It’s not unusual to have different parts of your data system solving different problems. From utilizing partitioning in a relational database to user-facing in-memory caches to a Lambda architecture, it’s important that pieces of the system are the “right tool for the job”; appropriate for the task at hand, and not duplicating the functionality of another piece of the system.

While running a number of systems (for different tasks) is likely necessary, you want to avoid having redundant systems (as in four ETL systems, not as in two PSU’s). Proliferation of components in your data processing system will cause issues with data duplication, identifying the system of record (who is the master?) and can also cause problems with data freshness, of either cached or derived data.

While there are technical challenges, the root of these problems is largely non-technical. It’s really a question of how you restructure, how the organization does business and how you prioritize the pain and inefficiencies felt across different departments in the organization.

So, how do you go from an overlapping, tangled legacy system to a simplified, streamlined system where there’s one — and only one — tool for each task?

Identifying the Issue to Solve

Choosing the problem is an art within itself. It is important to identify a single issue and empower a team to experiment and evaluate a few technologies and build a system (possibly a whole new one ... bear with me) to solve the problem. Identifying a single issue allows you to clearly define success.

You want to solve a problem that’s interesting and complex enough to challenge the team and encourage learning and experimentation. On the other hand, you don’t want to tie the team up for years or end in failure without providing insights. It’s OK for the project to fail, but you need to understand why, and you need to be able to learn from the mistakes.

Experiments should build on shoulders of those that have gone before and take and interpret data. Measuring everything against your success criteria as your team builds out the system is vital; you need to know that the team can be productive in the new world, and that they system that it meets performance and cost requirements.

(Re)Building the System

The amount of high quality open source software is likely going to make building a whole new system from scratch unnecessary. Or, you may already have the right components in house, just not “wired up” in the way you need. Your project is likely more integration work, combined with specializations or extensions for your particular problem (ideally that get contributed back to the open source projects in use).

Similarly, low risk, pay-as-you-go cloud resources (DBaaS, IaaS, PaaS) give you the ability to test out your solution at scale, without making large, long term commitments like racking 500 servers or building a new data center.

Technical challenges can often be solved in a non-technical manner. At CERN we avoided a very complex ACL problem by involving humans in a key decision making step. Building the GUI to make that approval process easy and quick was substantially simpler — and faster to deliver — than building an automated system to implement the necessary policy decisions. Ask yourself: are there changes to procedure that could be brought in to make the problem simpler, or vanish entirely? Computers don’t understand good-faith agreements.

Continued Development

Once you have a solution that is meeting your success criteria, it’s time to ask your team and wider organization a few additional questions. Can the solution for the first problem be reused for another? Did the team enjoy working with the system? Is there broader interest from other departments? Can the lessons and skills learnt developing this solution be brought to bear on another problem, even if the system isn’t directly portable?

These kinds of questions will inform your next steps and help identify the next problem to tackle in subsequent rounds of development.

Sound a lot like refactoring code? That’s intentional. If you find yourself in a situation where you’ve got numerous, overlapping systems, all fulfilling business-critical data processing roles, but in a painful or fragile manner, you’re going to need to gradually refactor both your data processing system and your organization. Taking the approach above where each incremental step simplifies the broader problem, and lines up the next step, will set you on a decent path to having a clean, consolidated data processing system.

Simon Metson is the director of product at Cloudant, an IBM Company.