Master Thesis Talk: Replicated Training in Self-Driving Database Management Systems
Self-driving database management systems (DBMSs) are a new family of DBMSs that can optimize themselves for better performance without human intervention. Self-driving DBMSs use machine learning (ML) models that predict system behaviors and make planning decisions based on the workload the system sees. These ML models are trained using metrics produced by different components running inside the system. Self-driving DBMSs are a challenging environment for these models that require a significant amount of training data that must be representative of the specific database the model is running on. To obtain such data, self-driving DBMSs must generate this training data themselves in an online setting. This data generation, however, imposes a performance overhead during query execution.
To deal with this performance overhead, we propose a novel technique named Replicated Training that leverages the existing distributed master-replica architecture of a self-driving DBMS to generate training data for models. As opposed to generating training data solely in the master node, Replicated Training load balances this resource-intensive task across the distributed replica nodes. Under Replicated Training, each replica dynamically controls training data collection if it needs more resources to keep up with the master node. To show the effectiveness of our technique, we implement it in NoisePage, a self-driving DBMS, and evaluate it in a distributed environment. Our experi- ments show that training data collection in a DBMS incurs a noticeable 11% performance overhead in the master node, and using Replicated Training elim- inates this overhead in the master node while still ensuring that replicas keep up with the master with low delay. Finally, we show that Replicated Training produces ML models that have accuracies comparable to those trained solely on the master node.
Thesis committee members – Andy Pavlo, David Andersen