Blogs > Introduction to GaussDB for MySQL

Introduction to GaussDB for MySQL

Manyi Lu Aug 23, 2021
Share
GaussDB for MySQL is a Cloud native database based on MySQL, available as a fully managed service on Huawei Cloud, targeting both the internet and enterprise customers. In this article, Manyi Lu explains common customer workloads and how we leverage the unique capabilities of the Huawei cloud Cloud computing stack to handle this workload.
 

GaussDB for MySQL – unleashing the power of cloud stack vertical integration

 

Cloud computing is experiencing increasing popularity, and one key component in the cloud stack is transactional database services. Applications rely on a scalable, high performant managed database services to fully benefit from the cloud platform. And cloud databases need to efficiently utilize underlying cloud infrastructure to realize the potential of cloud scale operations.

 

GaussDB for MySQL is a Cloud native database based on MySQL, available as a fully managed service on Huawei Cloud, targeting both the internet and enterprise customers. In this article, I will explain common customer workloads and how we leverage the unique capabilities of the Huawei Cloud computing stack to handle this workload.

 

Who are the customers on cloud and what are their workloads?

 

The perception, particularly in China, is that only internet startups adopt the cloud platform, and MySQL is popular as a cloud database due to its popularity among internet companies. But in reality, enterprises already started to embrace the cloud concept years ago, and it is also the current trend in China. MySQL, as the most popular open source database in the world, is widely adopted in all industries as well as internet companies.

 

So what is the typical workload of cloud database customers? Two characteristics we observe are 1) increasingly larger data volume -- single to double digit terabytes of data to begin with and it keeps growing over time, and 2) a mix of simple insert/delete/update/point select and complex analytics queries. In addition, there are occasional DDL operations.

 

The challenge is how to make the database performant when the data volume is large. Customers want to run analytics queries while maintaining the throughput of the core transactional workload. Queries are complex due to the nature of business logic in enterprises. Luckily, MySQL 8.0 added long awaited SQL support for windowing functions and recursive CTE. For unstructured data, MySQL's JSON support is already hugely popular.

 

Overview of the GaussDB for MySQL architecture

 

At a high level, GaussDB for MySQL is similar to AWS Aurora. It is built on top of a shared distributed storage system, and the maximum data volume for one database is currently 128 TB. One master node serves the read-write load, and up to fifteen read-only replicas serve the read load. The SQL engine is a heavily modified MySQL server version 8.0, and therefore 100% compatible with MySQL both in terms of syntax and semantics. There is an RDMA network between compute nodes and storage.

 

The storage system used by GaussDB for MySQL service is a highly reliable, cross AZ cloud storage. On public cloud, storage systems may contain a large cluster with hundreds of nodes. Its scaling in terms of storage nodes can be orders of magnitude larger than what you find on single tenant on-prem solutions. SQL nodes flush redo logs to storage, and pages are materialized in the storage layer, and this design significantly reduces network communication for update intensive workloads. Pages belonging to a single database are organized in slices, and slices are distributed over multiple storage nodes.

 

 

GaussDB for MySQL Architecture
GaussDB (for MySQL) Architecture

 

 

Huawei's unique strength - vertical integration

 

Cloud databases, unlike traditional on-prem databases, allow for vertical integration of all layers in the cloud stack. Huawei, as a leading provider of all layers, is in a unique position to become an industrial leader in cloud.  

 

What comes closest to the database in the cloud stack is the storage. On-prem software-only databases need to work with commodity storage over standard filesystem interfaces, which leaves little room for optimization. Oracle Exadata, on the other hand, already demonstrated the power of integration between storage and database in the form of an appliance.

 

On cloud, this integration plays an even bigger role, as cloud storage has much higher scalability in terms of storage nodes than what you find inside Oracle Exadata, and allows customers to dynamically scale the system based on data volume and load. Since the storage is shared between many tenants, and not all tenants will be running large scans all the time, we can achieve a much higher resource utilization rate by offloading parts of query processing to the storage layer.

 

Improving performance through parallelization

 

A generic approach of improving performance is through parallelization, and this can be done on multiple layers. The community version of MySQL 8.0 only supports single threaded query execution, and cannot fully leverage all cores available in the hardware for complex queries. We have modified MySQL execution engine, to allow a single query to be executed using multiple threads in parallel. Cloud infrastructure, unlike on-prem solutions, allows us to utilize vertical scaling on the compute node. The largest VM shape has 64 cores at the moment, and this gives an indication of the maximum parallelization we can achieve through parallel query execution. This optimization is best suited when data mostly fits into the buffer pool. This will be explained in detail in another article.

 

The customer workload does not only contain DMLs, it also contains DDLs such as index creation or changing data type of a column. Although most DDLs are online in MySQL, some operations can be blocking, and this problem is enlarged by the use of logical replication. GaussDB avoids this issue as it uses physical replication. When a table is large, DDL operations can take many hours to complete. In order to support the data volume we commonly see on the cloud, the need for optimizing DDLs is obvious. We have found an innovative approach to deal with DDL, and this innovation will be explored in a later article.

 

Another layer which allows for an even higher degree of parallelization is storage, as the storage system may potentially have hundreds of nodes and thousands of cores! This massive cloud-scale storage system used in GaussDB for MySQL is one key component we leverage in order to improve query performance. Combined with parallel query execution, we can potentially achieve query performance improvements of more than 100x!

 

 

Single Threaded Query Execution at Compute Layer
Single threaded query execution at compute layer

 

 

 

Parallel processing at compute and storage layers
Parallel processing at compute and storage layers

 

 

Leveraging cloud storage for query processing (NDP)

 

Data in GaussDB for MySQL is organized in slices, and distributed over multiple storage nodes. We leverage this data distribution and the available compute resources to do query processing where the data is located, without fetching data into the compute nodes. In database terminology, we call it near data processing (NDP) or query push down. Basically, we push _parts of the query processing _down to the distributed storage system near the data. Operations we push down are data intensive operations such as table scan and index scan. Projection and evaluation of some WHERE conditions as well as aggregations are executed in the storage layer, so only the matching rows and columns needed for a particular query are returned to the server instead of full pages. In addition to parallelization, this approach also reduces network IO, as the data volume fetched into the compute nodes is significantly reduced. In addition, NDP also allows for full utilization of local bandwidth to caches and storage media.

 

Offloading to storage works best when a query needs to scan a large amount of data and the data is not in already in the InnoDB buffer pool. As an example, the chart below shows NDP and parallel query execution improves the execution time of TCP-H Q12 by an impressive 34x! A dedicated article on NDP will go more into technical details and provide a comprehensive performance analysis.

 

 

Speedup Compared to Baseline

 

 

Future direction

 

GaussDB for MySQL is designed as a cloud native database, and this architecture supports an extremely powerful and flexible framework for vertical integration. Compute and storage resources are decoupled and scaled independently, but still tightly integrated in terms of functionality, and database operations can be executed in multiple layers. In the future, database operations can also be offloaded to network cards and other cloud components, and not limited to compute nodes and storage.

 

We believe deep integration of the cloud computing stack is the key to unleash the power of cloud database, and Huawei is in a unique position to achieve just that, as demonstrated here in GaussDB for MySQL.

 

Please stay tuned, there is more news to come!