The Evolving Open Source Database Landscape

Apr 19, 2017

By Peter Zaitsev, Co-Founder, Percona

While not the most media-hyped technology, databases are certainly one of the most crucial when it comes to our always-online, always-connected society. Databases power not just the applications and websites we use every day, but the businesses that generate revenue and fuel the economy. The internet relies on functioning and well-performing databases to operate.

The database landscape saw some interesting developments in 2016, and there are more to look forward to in 2017.

We’ve already seen a definite and positive move to open source database solutions, even in enterprise markets. This makes sense. While there can be objections to adopting open source database technology, the pluses greatly outweigh any perceived downsides (which can be mitigated): flexibility, an active community that is constantly providing new features and support, and (of course) the reduced cost.

Generally speaking, the trend seems to be moving toward open source database environments that require less maintenance and more database-as-a-service (DBaaS) deployments. Businesses need to be mobile and flexible when it comes to their application and web use—and this same mobility needs to be mirrored in their database environments.

The “herd-versus-cattle” model is an apt analogy here. A herd needs to be constantly maintained, verified, and guided in order for it to function correctly. Cattle just need to be counted—they perform their function by just existing, with little oversight. One “cattle unit” failing doesn’t derail the whole operation. The mobility that businesses need to meet changing expectations can be derailed by complex or hard-to-implement databases. Businesses don’t want to maintain database “pets”— they want simplicity: to be able to put up and tear down database instances to quickly meet their needs, and then let them do their thing.

We can see this play out with the increase in DBaaS and platform-as-a-service (PaaS) engagements, using such vendors as Amazon Relational Database Service/Amazon Aurora, Google Cloud, and Microsoft Azure.

Another aspect of this desire for database simplicity is a move to “serverless architectures.” Serverless architectures refer to applications that depend largely on third-party services, sometimes referred to as backend-as-a-service (BaaS). They can also consist of code that’s run in containers. This is also referred to as function-as-a-service (FaaS). BaaS and FaaS environments can significantly reduce operational cost and complexity but also increase vendor dependencies.

Finally, the 12-factor application model is going to play a larger role in the database environment. As more and more DBaaS functions are delivered, the applications that provide them are going to require a certain standardization and performance guarantee. The 12-factor app methodology for building SaaS apps is uniquely suitable for deployment on modern cloud platforms by reducing the need for servers and systems administration.

The ‘Cloud’

No set of predictions can ignore the cloud. Cloud-based solutions are going to continue to grow and encompass more and more of the database landscape. The flexibility this approach provides to enterprises cannot be underestimated. To some extent, it also reduces the management and maintenance overhead for companies looking to control costs. However, there are some dangers as well, such as concerns about over/under provisioning, getting proper support, and ballooning costs. It is still important to understand what is going on with your database architecture and environment (or at least to have someone who does, using the appropriate monitoring and management tools).

The cloud is not just Amazon RDS; there are also Google Cloud SQL, Microsoft Azure, Rackspace Cloud Hosting, and other providers. This approach is going to put pressure on database vendors and their licensing models. Permissive licenses are good for the cloud but bad for the vendor. An AGPL-licensed piece of software is not as great for the cloud but is great for the vendor. As more and more businesses look to use the benefits provided by placing some or all of their data in the cloud, licensing will become a bigger issue.

Open source licensing provides greater benefits here.

Containerization

Along with cloud deployments, we will see more movement toward containerization. Containers are lightweight alternatives to full machine virtualization. An application runs “in a container,” with the container housing the operating environment. This provides many of the same benefits of virtual machines, and the application can be run on any physical machine without worrying about dependencies.

Open source Docker has gained recent prominence for containerization. Docker containers can run on everything from physical computers to virtual machines, bare-metal servers, OpenStack cloud clusters, public instances, and more.

While it was the first to bring attention to containerization, Docker isn’t the only container option. CoreOS’s Rocket, LXC, Project Atomic, and others exist, and each of them provide more or less functionality or features. Ubuntu has announced the LXD container engine for its version of Linux, and Windows Server will have Drawbridge and Spoon. Kubernetes is an open source container cluster manager that can automate deployment, scaling, and operations of application containers across clusters of hosts.

Companies that prefer private data centers to public cloud deployments will find containers a good option for their database environments.

Security and Encryption

As always, data security is a hot topic—and will remain so in the future. With more and more systems, applications, and processes going online for remote access, there is more data that is exposed to breaches. Many industries—healthcare, financial services, government, and insurance—have mandated compliance regulations. Many security officers across the globe still appear to equate compliance with security. However, with the near-weekly reports of data theft incidents at institutions that reportedly met compliance mandates, compliance doesn’t necessarily mean you won’t be breached and have sensitive data stolen. The rules for good security haven’t changed, but then neither have the problems: mainly lax procedures and not enforcing safeguards.

Companies must continue to ensure seamless encryption (both in-transit and at rest) with as little overhead as possible. Data in transit is data currently moving from one location to another: over the internet or through a network. Data at rest is data that is stored on a hard drive, flash drive, or in some other way.

Implement robust network security controls to help protect data in transit. Don’t rely on reactive security to protect your valuable company data.

Efficiency

As with anything with a set of consumables, database efficiency will continue to be a focus in 2017 and ahead. Two of the ways to achieve efficiency are speed and compression. For speed, the continued price reductions in SSDs will make them the go-to hardware of choice in the coming year. Currently, the per-gigabyte price for SSDs is dropping below spinning drives (depending on the size and quality).

For compression, we will continue to see quicker and better ways to compress data that doesn’t slow down access. The interplay between algorithm and block size will continue to improve—as we’ve seen this year with Snappy, Zstandard, and LZMA.

Various storage engines will also play a part, including MyRocks and MongoRocks in the MySQL and MongoDB spaces, respectively.

Automation

With organizations’ desire to not have to endlessly babysit the database environment, comes the inclusion of automation procedures. Databases are required to run applications; if the database is down, so is the application (from the user’s perspective).

Automation capabilities eliminate the manual administrative overhead and complexity typically associated with managing and protecting data warehouses for multi-structured big data.

The future will require failover solutions that ensure that if the master disappears, a slave can take over quickly and with an extremely short service level agreement.

Automation features also ensure that you can scale performance in the event of high user or query concurrency, optimize performance for dashboarding and reporting, implement data analytics, distribute data and manage metadata, and implement high availability and disaster recovery.

Many fully synchronous replication solutions are out there already (Percona XtraDB Cluster and MySQL Group Replication), but look for more to be done in the semi-synchronous replication area via tools such as Orchestrator, MHA, and others.

Schema Change Automation

Database change management is a critical component to control and optimize your database environment. You need both predictability and visibility regarding change, and any scripted automation that affects the database must be transparent and understandable (otherwise it will affect applications).

Releasing applications that rely on unknown or undocumented schema changes is a recipe for disaster. It leads to mad scrambling to fix responses to untested or unexpected workloads. What is needed is environmental intelligence to understand how changes will impact applications in production or any other environment and simple ways to fix the schema in production.

Many companies are creating methods of automatically altering the schema via scripts. These approaches include oak-online-alter, pt-online-schema-change, and now Github is pushing gh-ost. These script-based alterations will play a significant role in the MySQL world going forward.

Polyglot Architectures

“Polyglot persistence” simply means using multiple data storage technologies working together. This is sometimes the optimal solution for managing data in your environment. Often, these technologies are chosen based upon the way different applications use data for different needs. In short, it means picking the right tool for the right use case.

For example, an ecommerce platform will deal with many types of data, such as shopping cart contents, completed orders, current inventory, and ordered stock.Rather than using one database to store all this data, which might require extra work to convert data to a useful form for specific applications, they can instead store the data in the database best suited for that type of data. Inappropriately forcing one database to handle a workload it wasn’t designed to handle can cause performance degradation.

The cost, of course, is environmental complexity. Companies using multiple databases to address multiple applications using a variety of data require stringent compatibility oversight. Otherwise, performance and user experience can be degraded. But the benefits of flexibility, mobility, and adaptability can be worth it. And, having the ability to both quickly scale up and out by employing both NoSQL and relational databases can be advantageous.

Polyglot persistence will become much more common. Organizations can no longer be only a MySQL or MongoDB shop. And as organizations grow, the process of managing support, software fixes, upgrades, and dependencies becomes more complicated and important.

Access Convergence

With polyglot persistence and the use of multiple technologies in a database environment, we will see an increase of database access convergence. This means the ability of databases to serve and use data stored in a foreign format (for example, moving relational data into a NoSQL environment, and vice versa).

This year will bring a convergence of different access patterns. In the MySQL world, we’ll find an increase in NoSQL access patterns via mysqlsh (the mysql shell), and the ability to use formats such as GeoJSON. The new MySQL document store feature also looks promising. PostgreSQL will achieve this using JSON functions and foreign data wrappers. The Hadoop world is, of course, focused on getting more SQL on top, via Spark.

Monitoring

Finally, monitoring tools must evolve to meet the changes that are coming. Mainly, database customers will look for monitoring tools that can oversee and provide insight into multiple technologies. With polyglot persistence, the movement to the cloud, access convergence, and the overall increase in database options that can be used in a single environment, a useful monitoring tool will be one that can look at the status and performance of relational, NoSQL, and other databases simply and easily.

Changes Ahead

The open source database landscape is changing to adapt to the evolving needs of businesses and enterprises. The trend is toward more flexibility, with varied technologies all working together to achieve specific goals. With this agility comes a need to easily manage, monitor, and troubleshoot the database environment. These mandates, along with businesses requirements to change quickly and seamlessly to address business goals, will shape open source database features and usability moving forward.