17 Huawei Papers Selected by ICDE 2024, the Most of Any Vendor
May 21, 2024
Recently, a top international academic conference, ICDE 2024 was held in Netherlands, Utrecht. At the event, ICDE selected 17 papers from Huawei Cloud's GaussDB, GeminiDB, and data domain, more than were selected from any other vendor. Nikolaos Ntarmos, the Database Lab director of Huawei Edinburgh Research Center, delivered a speech entitled, "Huawei Cloud GaussDB, a Better Way to Database". The speech introduced GaussDB's techniques and various business achievements to academic institutions and representatives from around the world.
The IEEE International Conference on Data Engineering (ICDE), SIGMOD, and VLDB are three of the top international academic conferences in database field. ICDE has significant academic influence around the world.
ICDE collects the cutting-edge and top-level database research from major research institutes and technology enterprises. ICDE 2024, the 40th IEEE International Conference on Data Engineering, selected Huawei's 17 papers. All these achievements come from the joint efforts of Huawei's scientific research teams and partner's teams or organizations. All of the papers will be examined in detail at a later date, but here are some highlights.
GaussML: An End-to-End In-database Machine Learning System
In-database machine learning (In-DB ML) is appealing to database users with security and privacy concerns, as data is not copied out of the database to a separate machine learning system.
One common way to implement in-DB ML is the ML-as-UDF approach, which utilizes User-Defined Functions (UDFs) within SQL to implement the ML training and prediction. However, UDFs may introduce security risks with vulnerable code, and they are prone to performance problems, as they are constrained by the data access and execution patterns of SQL query operators.
To address these limitations, we proposed a new in-database machine learning system, namely GaussML, which provides end-to-end machine-learning capabilities with a native SQL interface.
To support ML training/inference within SQL query, GaussML directly integrates typical ML operators into the query engine without UDFs. GaussML also introduces an ML-aware cardinality and cost estimator to optimize the SQL+ML query plan.
Moreover, GaussML leverages Single Instruction Multiple Data (SIMD) and data prefetching techniques to accelerate the ML operators for training.
We have implemented a series of algorithms inside GaussML in openGauss database. Compared to the state-of-the-art in-DB ML systems like Apache MADlib, our GaussML achieves 2-6x the speed in extensive experiments.
GaussDB-Global: A Geographically Distributed Database System
Geographically distributed database systems use remote replication to protect against regional failures. These systems are sensitive to severe latency penalties caused by centralized transaction management, remote access to sharded data, and log shipping over long distances.
To tackle these issues, we present GaussDB-Global, a sharded geographically distributed database system with asynchronous replication, for OLTP applications.
To tackle the transaction management bottleneck, we take a decentralized approach using synchronized clocks. Our system can seamlessly transition between centralized and decentralized transaction management, providing efficient fault tolerance and streamlining deployment.
To alleviate the remote read and log shipping issues, we support reads on asynchronous replicas with strong consistency, tunable freshness guarantees, and dynamic load balancing.
Our experimental results on a geographically distributed cluster show that our approach provides up to 14x higher read throughput, and 50% more TPC-C throughput compared to our baseline.
QCFE: An Efficient Feature Engineering for Query Cost Estimation
Query cost estimation is a classic task for database management. Recently, researchers have applied an AI-driven model to implement more accurate cost estimations for achieving high accuracy. However, there are two defects in this design that lead to poor cost estimation accuracy-time efficiency.
Existing works only encode the query plan and data statistics while ignoring some other important variables, like storage structure, hardware, database knobs, etc. These variables also have significant impact on the query cost. The other problem is that, due to the straightforward encoding design, existing works suffer heavy representation learning burden on ineffective dimensions of input.
To address these two problems, we first proposed an efficient feature engineering method for query cost estimation, called QCFE. Specifically, we designed a novel feature, called "feature snapshot", to efficiently integrate the influences of the ignored variables. Further, we proposed a difference-propagation feature reduction method for query cost estimation to filter out the useless features. The experimental results demonstrate our QCFE could largely improve the time-accuracy efficiency on extensive benchmarks
TRAP: Tailored Robustness Assessment for Index Advisors via Adversarial Perturbation
Many index advisors have recently been proposed to build indexes automatically to improve query performance. However, they mainly consider performance improvement in static scenarios. Their robustness, i.e., their ability to maintain stable performance in dynamic scenarios (e.g., with minor workload changes), has not been well investigated.
This paper addresses the challenges of assessing the index advisor's robustness in the following ways.
First, we introduced perturbation-based workloads for robustness assessment and identified three typical perturbation constraints that occur in real scenarios.
Second, with the perturbation constraints, we formulated the generation of perturbed queries as a sequence-to-sequence problem and propose Tailored Robustness assessment via Adversarial Perturbation (TRAP) to pinpoint the performance loopholes of index advisors.
Third, to generalize to various index advisors, we placed TRAP in an black-box setting (i.e., with little knowledge of the index advisors' internal design), and we propose a two-phase training paradigm to efficiently train TRAP without elaborately annotated data.
Fourth, we conducte comprehensive robustness assessments on standard benchmarks and real workloads for ten existing index advisors. Our findings reveal that these index advisors are vulnerable to the workloads generated by TRAP.
Finally, the assessment shed light on various insights into how we can enhance the robustness of different index advisors. For example, learning based index advisors can benefit from fine-grained state representation and a candidate pruning strategy.
Temporal-Frequency Masked Autoencoders for Time Series Anomaly Detection
In today's era of observability, massive amounts of time series data have to be collected to monitor the status of a target system, where anomaly detection serves to identify observations that differ significantly from the remaining ones. The ability to extract value from such data is of the utmost importance. While existing reconstruction-based methods have demonstrated favorable detection capabilities in the absence of labeled data, they still tend to result in training bias on abnormal times and distribution shifts within time series.
To address these issues, we proposed a simple yet effective Temporal-Frequency Masked AutoEncoder (TFMAE) to detect anomalies in time series data through contrastive criterion. Specifically, TFMAE uses two Transformer-based autoencoders that respectively incorporate a window-based temporal masking strategy and an amplitudebased frequency masking strategy to learn knowledge without abnormal bias and reconstruct anomalies based on the extracted normal information.
Moreover, the dual autoencoder is trained using a contrastive objective function, which minimizes the discrepancy of representations from temporal-frequency masked autoencoders to highlight anomalies, as it helps alleviate the negative impact of distribution shifts.
Finally, to prevent overfitting, TFMAE uses adversarial training during the training phase. Extensive experiments conducted on seven datasets provide evidence that our model is able to surpass the state-of-the-art in terms of anomaly detection accuracy.
The Huawei database papers selected for ICDE 2024 cover a wide range of technologies, including AI4DB, time series databases, query optimization, and machine learning model training and inference for databases. Huawei has devoted many years to cutting-edge database technologies and worked with world-leading academic institutions to resolve international database challenges. Based on collaboration between industry, academia, research, and applications, Huawei continuously integrates innovative research outputs into product technologies, putting great effort into building a robust ecosystem and providing innovative, competitive products and services for customers.
Huawei is dedicated to innovation and exploration in the database field and to demonstrating its influence on industry development.