Analytics (OLAP) and real-time (OLTP) workloads serve distinctly different purposes. OLAP (online analytical processing) is optimized for data analysis and reporting, while OLTP (online transaction...
Analytics (OLAP) and real-time (OLTP) workloads serve distinctly different purposes. OLAP (online analytical processing) is optimized for data analysis and reporting, while OLTP (online transaction processing) is optimized for real-time, low-latency traffic.
Most databases are designed to primarily benefit from either OLAP or OLTP, but not both. Worse, concurrently running both workloads under the same data store will frequently introduce resource contention. The workloads end up hurting each other, considerably dragging down the overall distributed system’s performance.
Let’s look at how this problem arises, then consider a few ways to address it.
Understanding OLTP vs. OLAP Databases
There are basically two fundamental approaches involving how databases store data on disk. We have row-oriented databases, often used for real-time workloads. These store all data pertaining to a single row on disk.
Row-oriented storage (ideal for OLTP)
Column-oriented storage (ideal for OLAP)
On the other side of the spectrum, we have column-oriented databases, which are often used for running analytics. These databases store data in a vertical way (versus horizontal partitioning of rows).
This single design decision effectively makes it much easier and efficient for the database to run aggregations, perform calculations and answer queries, retrieving insights such as “Top K” metrics.
The Problem With Concurrent OLAP and OLTP Workloads
So the general consensus is that if you want to run OLTP workloads, you use a row-oriented database — and you use a columnar one for your analytics workloads.
However, contrary to popular belief, there are a variety of reasons why people might actually want to run an OLAP workload on top of their real-time databases. For example, this might be a good option when organizations want to avoid data duplication or the complexity and overhead associated with maintaining two data stores. Or maybe they don’t extract insights all that often.
The Latency Problem
But problems can arise when you try to bring OLAP to your real-time database. We’ve studied this a lot with ScyllaDB, a wide-column database that’s primarily meant for high-throughput and low-latency real-time workloads.
The following graphic from ScyllaDB monitoring demonstrates what happens to latency when you try to run OLAP and OLTP workloads alongside one another.
The green line represents a real-time workload, whereas the yellow one represents an analytics job that’s running at the same time.
While the OLTP workload is running on its own, latencies are great. But as soon as the OLAP workload starts, the real-time latencies dramatically rise to unacceptable levels.
The Throughput Problem
Throughput is also an issue in such scenarios. Looking at the throughput clarifies why latencies climbed: The analytics process is consuming much higher throughput than the OLTP one. You can even see that the real-time throughput drops, which is a sign that the database got overloaded.
Unsurprisingly, as soon as the OLAP job finishes, the real-time throughput increases and the database can then process its backlog of queued requests from that workload.
That’s how the contention plays out in the database when you have two totally different workloads competing for resources in an uncoordinated way. The database is naively trying to process requests as they come in.
When Things Get Contentious
But why does this contention happen in the first place? If you overwhelm your database with too many requests, it cannot keep up. Usually, that’s because your database lacks either the CPU or I/O capacity that’s required to fulfill your requests. As a result, requests queue up and latency climbs.
The workloads contribute to contention, too. OLTP applications often process many smaller transactions and are very latency sensitive. However, OLAP ones generally run fewer transactions requiring scanning and processing through large amounts of data.
So hopefully that explains the problem. But how do we actually solve it?
Option A: Physical Isolation
One option is to physically isolate these resources. For example, in a Cassandra deployment, you would simply add a new data center and separate your real-time processing from your analytics. This saves you from having to stream data and work with a different database. However, it considerably elevates your costs.
Some specific examples of this strategy:
Instaclustr, a managed services provider, shared a benchmark after isolating its deployments (Apache Spark and Apache Cassandra).
GumGum shared the results of this approach (with multiregion Cassandra) at Cassandra Summit 2015.
There are definitely use cases and organizations running OLAP on top of real-time databases. But are there any other alternatives to resolve the problem altogether?
Option B: Scheduled Isolation
Other teams take a different approach: They avoid running their OLAP during their peak periods. They simply run through their Analytics pipelines during off-peak hours in order to mitigate the impact on latencies.
For example, consider a food delivery company. Answering the question like, “How much did this merchant sell within the past week?” is simple in OLTP. However, offering discounts to the 10 top-selling restaurants within a given region is much more complicated. In a wide-column database like Cassandra or ScyllaDB, it inevitably requires a full table scan.
Therefore, it would make sense for such a company to run these analytics from after midnight until around 10 a.m. — before its peak traffic hours.
This is a doable strategy, but it still doesn’t solve the problem. For example, what if your dataset doubles or triples? Your pipeline might overrun your time window. And you have to consider that your business is still running at that time (people will still order food at 2 a.m.) If you take this approach, you still need to tune your analytics job and ensure it doesn’t kill your database.
Option C: Workload Prioritization
ScyllaDB has developed an approach called Workload Prioritization to address this problem.
It lets users define separate workloads and assign different resource shares to them. For example, you might define two service levels: The main one has 600 shares, and the secondary one has 200 shares.
CREATE SERVICE LEVEL main WITH shares = 600
CREATE SERVICE LEVEL secondary WITH shares = 200
ScyllaDB’s internal scheduler will process three times more tasks from the main workload than the secondary one. Whenever the system is under contention, the system prioritizes its resource allocation accordingly.
Why does this kick in only during contention? Because if there’s no contention, it means there is no bottleneck, so there is effectively nothing to prioritize.
https://fee-mendes.github.io/workload-prioritization/
Workload Prioritization Under the Hood
Under the hood, ScyllaDB’s Workload Prioritization relies on Seastar scheduling groups.
Seastar is a C++ framework for data-intensive applications. ScyllaDB, Redpanda, Ceph’s SeaStore and other technologies are built on top of it.
Scheduling groups are effectively the way Seastar allows background operations to have little impact on foreground activities.
For example, in ScyllaDB and database-specific terms, there are several different scheduling groups within the database. ScyllaDB has a distinct group for compactions, streaming, Memtables, and so on. With Cassandra, you might end up in a situation where compactions affect your workload performance. But in ScyllaDB, all compaction resources are scheduled by Seastar. And according to its shares of resources, the database will allocate a respective share of resources to the background activity (compaction, in that case) — therefore ensuring that the latency of the primary user-facing workload doesn’t suffer.
Using scheduling groups in this way also helps the database auto-tune. If the user workload is running during off-peak hours, then the system will automatically have more spare computing and I/O cycles to spend. The database will simply speed up its background activities.
Here’s a guided tour of how Workload Prioritization actually plays out:
OLTP and OLAP Can Coexist
Running OLAP alongside OLTP inevitably involves anticipating and managing contention. You can control it in a few ways: Isolate analytics to its own cluster, run it in off-peak windows, or enforce workload prioritization. And workload prioritization isn’t just for allowing OLAP along with your OLTP. That same approach could also be used to assign different priorities to reads vs. writes, for example.
If you’d like to learn more, take a look at my recent tech talk on this topic: “How to Balance Multiple Workloads in a Cluster.”
The post How to Run OLAP and OLTP Together Without Resource Contention appeared first on The New Stack.