Every now and then, the IT industry - vendors and customers alike - takes a common problem, gives it a catchy name, and drives the buzz (and market) rapidly to create new business opportunities for everybody. It’s happening again with the "big data" phenomenon. It's here, it's real, and yes – it is probably going to cost you a lot of money over the next few years.
In order to understand it we should first try to answer a simple question. What is "big data" and does my system fall into this category? The web is full of various definitions for the term "big data" but there is one very clear definition for that term that’s now taking hold:
Big data is a generic name for big data sets that a conventional RDBMS can't really handle.
In most cases, big data sets are created in places where raw data is collected, and kept for months and even years. Raw data could be a list of phone calls made by all subscribers of a big cellular company (dozens of millions of calls a day), lists of web activities done by ISP subscribers (billions of operations a day), or even credit card transactions that are kept in their raw format.
Now, why do we really need to keep all this data? Frankly speaking – in many cases – you don't. You aggregate it and keep the aggregations for years, and in most cases, aggregations can give good answers to most of your questions. After all, the world kept spinning without big data solutions. But there are many cases where every single piece of data can be converted into revenue.
There are quite a few challenges when it comes to big data solutions, but in the end, it all boils down to the following two:
- Storing and aggregating billions of rows every day is almost an impossible mission for most conventional RDBMSs (unless you are willing to pay millions of dollars for high-end servers, storage arrays and software licenses)
- Aggregations limit the way you can consume your data. You may want to keep the option to process historical data sets in different ways to produce new insights for your business.
This article is going to address what is called "big data appliances" (also known as database appliances) that fit into the category of RDBMSs – big data appliances that store tables (columns or rows,) and can be accessed using standard SQL statements. Most users that already work with a conventional database (Oracle, SQL Server, DB2, MySQL, etc.) find it much easier to migrate to another RDBMS rather than move to a different data structure.
Now the million-dollar question – how do you decide which one to choose? Here are a few guidelines and tips to keep in mind before you take the plunge.
Tip #1: Never believe the sales guy who says that it works like magic. Always run a benchmark/proof of concept (POC).
Many people have witnessed shiny appliances suffering bad moments. The bottom line is this - each one of these appliances has its strengths and weaknesses. There is one that will fit your specific needs, and due to its inherent differences, may not fit someone else's. The reason for this is that each one of these appliances has a different way of doing things. Loading, analyzing, storing … This is why you have to run benchmarks. Don’t be tempted by the vendor to skip the test, they know why! You can narrow down your options by better understanding how they these systems are built, but in the end, you will find yourself testing at least two to three appliances.
Here are the 10 steps for a successful database appliance benchmark/POC:
1. Set your goals and targets – define success ahead of time
Sounds easy, right? Think again. There is no way you can test your whole system on one of these appliances. The effort to migrate is too big, and there isn't enough time. So you must come up with a way to test some of the fundamental parts of your application within these appliances. You may want to choose one data flow (Loading data à processing it à consuming it), or you may want to focus mainly on consuming (loading a huge data set and running many queries on top of it to get a better sense of the performance). Either way is fine.
Tip #2: Most appliances will handle the number of rows you want to push into them this way or another, so don't spend precious time testing their data loading mechanisms.
Put more attention on consuming the data using SQL statements, and look for the peripherals elements such as management, backup and restore, monitoring etc.
Tip #3: Success is measured in numbers, not feelings. Set reasonable numeric goals, such as:
- I want to push 1 day of traffic in no more than 30 minutes
- I want this report to scan 2 years of data and finish in 30 seconds
- I want to run the same report on 4 years of data and I want to see linearity (60 seconds)
- Quality of Service (QoS) – I want to run the same query 100 times in parallel and still get all result sets in less than 10 seconds
2. Build the benchmark/POC team
Don’t expect the vendor to do it all for you. If you count on the vendor to run the benchmark, you can bet that it will make sure the numbers look great. No one will dare to fake the results, but they will apply some changes to the test bed that will position their appliances better. These changes might not be applicable in real-life.
Tip #4: Make sure someone from your DBA team is part of the benchmark team, and that he/she is in the lead.
Never let the vendor lead the process. Your representative will approve the benchmark plan and must witness the tests.
3. Choose your candidates based on features, price, limitations, manageability, mean time to recovery (MTTR), migration cost, third party costs, training and education costs, etc.
There are no rules of thumb here so you must understand the individual appliance architectures. Only in this way will you be able to narrow down your options to two or maybe three benchmark candidates.
For example, some architectures are purely column based tables (works best for some data warehouse environments), some are working in "share nothing" architectures while others in "share everything", some are MPP (Multi parallel processing) and some are not, some are more likely to reach the interconnect traffic limit for your database while others will not.
Tip #5: Contact someone who can give you some advice and is familiar with all of the appliances that you’re evaluating.
There are not many people out there who can do this, but on the other hand, you don't need them for the whole process.
4. Write a decision table (performance isn't everything)
It's very easy to make a decision based on performance metrics only, but performance isn't everything. Here are some other questions you must be able to answer:
- Can I leverage my existing staff to manage the new environment?
- Do I get an informative monitoring tool that can help me on a daily basis?
- How do I monitor the traffic down to the SQL activity?
- How do I monitor the life signs of that appliance?
- Do I get any tools to help me out with loading data?
- Can I check to see if the system operates according to plan or just hope that it does?
And many more.
Tip #6: The better the management tools you get, the better and faster the migration will be, and the day-to-day management costs will be dramatically lower.
2. Backup, restore and MTTR
- How do I back it up?
- How do I restore it?
- What is the MTTR for that solution? Can we live with that?
3. Performance and scalability
- How do I scale with this solution? What are the limits? Is it linear?
- How can I guarantee QoS and service level agreements (SLAs) in high concurrency rate?
- How do I reactively and proactively manage the appliance performance?
4. Migration costs
- What does it take to migrate the data I already have?
- What does it take to migrate all the code (views, stored procedures etc)?
- Is this a new SQL standard I have to adopt?
5. Building the skills set needed - Training and Education
Always important, but this will not be the top priority here. All database appliance vendors will give you nice discounts when you are about to make up your mind, and they will all align more or less at the same price level.
5. Create a sample data set and define the transaction set
The best thing you can do is to work with real data. But some vendors will not let you borrow an appliance, and will ask you to bring the data set to their lab. Some regulations may not allow you to do that. In this case you need to simulate data or scramble real data. The second option is preferred.
Here are some guidelines in order to prepare data sets for the POC:
- Make sure you test the appliance with at least one year of data. All appliances work well on 1-2 months of data regardless to their architecture because it all resides in the cache
- Make sure you load at least 10 TB of data. All appliances give good numbers below 10 TB
- Test at least 2-3 different data sets that reflect your system behavior. Different data sets behave different in terms of distribution among nodes, compression, performance scalability etc
- Make sure the vendor you are about to test supports the data set you want. Some vendors have restrictions for data types, columns length etc
When it comes to the transaction set, don't settle for 5-10 queries (most POCs out there run no more than 10 transactions – which is really nothing). Make sure you test at least 30-50 queries and their variations to make a decision. You must include the following scenarios:
- Full scans of the whole dataset without aggregation. Make sure the performance grows linearly as you grow with the data set size (1TB, 2TB, 4TB, 10TB…)
- Same as above, but with aggregation (summary, average, min, max or any other aggregation you want)
- Aggregate a single column, and then aggregate several columns at once
- Join two large tables or more. Believe it or not, that’s one of the weaknesses for some database appliance vendors
- Run queries while you load data. This creates locking issues with some appliances
- Run queries simultaneously. Some appliances do not keep up with high concurrency rate. Try 2,5,10,20,40,100 concurrent queries
If your applications don't need any of these scenarios – don’t waste too much time on testing them, but it is recommended that you test them on a smaller scale. Someday you will need to apply some new functionality to your system, so you should at least be aware of the limitations. While testing these appliances, your team builds its know-how and skill-set needed for the deployment and migration phases.
6. Assemble a comparison mechanism that can be used with all vendors’ database appliances
You are the deciding factor. You have to be able to compare appliance vs. appliance down to query vs. query. So you must be able to collect all the traffic and performance metrics on those appliances that you are about to test and be able to compare these metrics. To save time, you can use a tool that monitors database appliances so you can make a decision based on numbers rather than feelings.
If the vendor says "We do not want you to install any tools during the benchmark" then this is a good sign that you must do this. Don't worry, the vendor has too much to lose if you walk away. It will give you the green light if it knows that you are serious about this benchmark.
7. Run the tests on all appliances
The shortest step to describe – but the longest to execute. You don't have to run all tests simultaneously on all appliances. You probably cannot afford the luxury of bringing all the appliances together into a single room on the same day. The data that you collect (using a tool or manually) remains in your property and can easily be correlated and compared to performance metrics taken from another appliance you already tested on a different day.
8. Make modification to the queries or schema structure to get better results
Sometimes the vendor will tell you “if you run this query a little bit differently (which suits that particular appliance architecture), you will get better results.”
Tip #7: This is a fair statement.
But how will you be able to compare the queries if they are different? Make sure you tag the SQLs in your lists so that you can compare the tags later on.
9. Compare transaction runtimes, resource consumption and database metrics results
You need to answer a few questions like, who performed better for each test?; how much resources were used by each query?; and, what was the execution time?; or any other metric you find valuable.
Tip #8: Don’t fall into the trap of comparing just the runtime of a single transaction
If this single transaction runs 30 seconds on all appliances, but in one of the appliances it consumed 100% of the resources while on the other one just 20%, you can learn something about the system scalability in terms of concurrency. Always look and correlate all metrics.
10. Declare your winner
Now that you have everything you need, and your decision matrix is filled with numbers and evidence – you are ready to make a decision.
Remember, the journey is just about to begin. There will be many obstacles along the way and many changes you will find yourself doing that you didn't plan to. However, the journey in many cases is as enjoyable as the end result.
About the author:
Irad Deutsch is CTO at Veracity group and CTO of MORE IT Resources (MORE). He also functions as the Israeli DBA Oracle user group manager.