DRL for DB optimizations

Overview

I participated in the TiDB Hackathon. It was a good experience developing a working tool in a 24hr time period. Although we participated remotely due to the virus situation, we were still able to interact and present over zoom and through online services. The idea managed to make it to the final round but we came up short in winning the whole competition. I’ll walk through the idea and implementation of our idea. This mostly talks about it from a theoretical standpoint with sudo code to help understand how the whole thing works. You can see the actual code on github. It has comments but due to the time constraints of the competition it isn’t documented very well.

Theme

The theme of the hackathon was [∞], by which participants were expected to innovate, challenge, and to unlock the infinite power of TiDB. This was a pretty open ended theme and basically you just had to work on the TiDB ecosystem which includes TiDB, TiKV, and all their related tooling such as placement driver and chaos-mesh.

Idea

Develop a Machine learning model using Deep Reinforcement Learning(DRL) to automatically TiDB to decrease latency and increase throughput.

Background

Database tuning is a hard problem and even experienced DBA’s can have trouble getting optimal performance. This difficulty increases even more with a relatively new db like TiDB since DBA’s don’t have decades of experience to build on like with more traditional SQL database.

Large companies like Huawei, Oracle, Google, Microsoft, etc. have implemented AI systems to help automatically tune database deployments. TiDB has also done some experimenting in this area with auto-tikv which is based on ottertune that uses some ML techniques like Gaussian Process and clustering to learn tuning settings.

The large companies techniques are based on DRL which have demonstrated better performance when compared with other machine learning techniques or custom tuning done by DBA’s. This idea aims to bring these same cutting edge machine learning performance benefits to TiDB’s auto tuning offerings.

Rationale

There are a number of studies on automatic knob tuning ranging from using heuristics to machine learning to more recently a number of studies using deep learning. like, BestConfig, OtterTune, and CDBTune.

Heuristic based approaches like BestConfig use a heuristic method to search for the optimal configuration from the history but comes with a big disadvantage in that it may not find good knob values if there is no similar configurations in the historic data. These perform better than default settings but often can’t compete with experienced DBAs

OtterTune (which auto-Tikv is based on) utilizes machine learning techniques analyze configurations. This relies on a large number of high-quality training examples from DBAs’ experience data, which can be hard to obtain. Ottertune also per

CDBTune uses deep reinforcement learning (DRL) with an actor-critic network similar to the proposal but only takes into account metrics and doesn’t account for query information which can have a large effect on the tuning environment.

The draw back is each one takes more development effort than the previous. Heuristic based approaches are relatively straight forward to implement -> Ottertune adds more complexity and steps into using more modern machine learning techniques -> CDBTune takes it a step further and is based on leading edge machine learning research -> Qtune( what the proposal is based on) takes it a step further and adds more specific domain knowledge on top of cutting edge DRL techniques.

Implementation

Use Deep Reinforcement Learning based on QTune’s query-aware approach to develop a system that can automatically tune TiDB’s configuration settings.

First we featurize the SQL queries to gather information about the queries including query type, tables, and query cost. Then we feed the query features into the DRL model to dynamically choose suitable configurations. For the DRL model we implement a Double-State Deep Deterministic Policy Gradient(DS-DDPG) using the actor-critic networks. The DS-DDPG model can automatically solve the tuning problem by learning the actor-critic policy according to both the database states and query information.

Vectorize the Query

Vectorize SQL request information so our models can be trained with it

  1. Vectorize query information
    Based on the SQL query capture the operation(insert/select/update/delete) and the table(s) to be queried and store them in a 4+t dimensional vector where t is the number of tables. These values can be 0 if not used or 1 if in use.
  2. Vectorize cost information
    To get query costs we can use the query plan generated by the db engine and the sum the cost estimation of operation then normalize the cost by subtracting mean and dividing the std deviation. We maintain a p dimensional vector where p is the number of operations available.
  3. Create overall query vector
    To do this we simply have to concatenate the query info vector and the cost vector.
  4. Support multiple query vectors
    For each query vector we get the union of the query vectors. For each table that is used, if it contains 1 we replace it with the row number of the table. For cost vector, we need to sum up all the costs.

Setup Environment

Get metrics and stats from the database to feed into our models.

  1. Collect inner-state metrics. These are the configuration settings which can be tuned like cache size, memory, etc. These can be gotten from the TiDBs internal metric collection features.
  2. Collect outer state metrics. These are the performance indicators like number of transactions, deadlocks, etc. These can be gotten from the benchmarking framework such as sysbench.
  3. Execute a workload and compute a reward based on performance

Train the Predictor

The Predictor Aims to predict the performance metrics change if processing a query in the database

  1. Get training data
    • A set of tuples T={〈v,S,I,∆S〉}
    • v is a vector of a query
    • S is the outer metrics
    • I is the inner state
    • ∆S is the outer metrics change when processing v - For each〈v,S,I〉, we train Predictor to output a value that is close to ∆S. - Training data is gotten for each query q by:
      1. We first use Query2Vector to generate v and obtain metrics S, I from the db.
      2. Run q in the database and record the metrics change ∆S
  2. Train Predictor which is composed of four fully connected layers:
    • input layer: accepts the feature vector and outputs the mapped tensor to the hidden layers
    • 2 hidden layers: a series of non-linear data transformations.
    • Output layer: restricts the tensor to the scale of the database state and generates a vector representing the predicted db metric changes.
    • Use ReLU for the hidden layers, the weights in the network are initialized by the standard normal distribution.
    • Use the Adam algorithm to train the Predictor

Sudo Code:

TrainPredictor(W,TP)
// W: The weights of a neural network;
// TP: The training set
while (!converged) do
    foreach (v,S,I,∆S in TP) do 
        Generate the output G of〈v,S,I〉;
        Accumulate the backward propagation error: E=E+(.5*|G−∆S|^2);
    Compute gradient(E), update weights W

Training the Actor-Critic Module

The Actor-Critic module aims to tune the database configurations.

  1. Get training data
  2. Given a query workload, we randomly select a subset of queries and generate a sample workload.
  3. Generate its feature vector via Query2Vector
  4. Predict a database metrics S via the Predictor
  5. Get an action A via the Actor
  6. Deploy the actions in the databases
  7. Run the database to get a reward R
  8. Get a new data base metric S2 by updating S the new metrics
  9. Repeat the above steps to get A2 and R2
  10. Iteratively, we get a set of triples〈T1A= (S′1,A1,R1),(S′2,A2,R2),…,(S′t,At,Rt)〉until the average reward value is good enough (e.g., the average reward of ten runs is larger than 10.)
  11. Training Actor-Critic model
  12. Use the experience replay method to train the actor and critic with a training data set〈T1A= (S′1,A1,R1),(S′2,A2,R2),…,(S′t,At,Rt)〉(from above)
  13. Use (S,A,R) and update the Actor and Critic as follows.
  14. (1) We update the actor policy A using the gradient value ∇A=∇AiQ(S′i,Ai|πC)·∇θπAπA(S′i|θπA) - θ is the parameters in A - Q(S′i,Ai|πC) is the Q-value computed by Critic.
  15. (2) We estimate the real action-value Y using the Bellman function to compute Y based on the reward and Q-value - *Y*=R+τ·Q(S+1,*A*(S+1|θπA)|πC) - τ is a tuning factor to tradeoff the Q-value and the reward value
  16. (3) We calculate the loss value L with Q and Y. Critic updates the weights in C by minimizing the loss value L= (Q(S′i,Ai|πC)−Yi)^2
  17. We run the three steps for i=t−1 to i= 1. The algorithm terminates if the model is converged (e.g.,the performance improvement is smaller than a threshold); otherwise we select next training data T2A,T3A,···. - To improve the stability of training, we can introduce two extra target actor and critic networks(whose policies are A and C respectively). These two networks are updated at every step and their weights (parameters of the policy) are updated slower than the normal networks. Then the weights in the normal critic network are updated by minimizing loss compared with the target.

Sudo Code:

TrainAgent(A,C,TA)
// A: The actor’s policy;
// C: The critic’s policy;
// TA:  training data
while (!converged) do
    Get a training data
    for (i=t−1 to 1) do 
        Update the weights in A;
        Estimate an action-value;
        Update the weights in C by minimizing the loss value

Overall

Given a request (a query or a query work-load),Query2Vec generates a vector based on the sql. Predictor utilizes the feature vector and generates a predicted state change ∆S. We pass the state change to the db and generates metrics S. Actor takes the metrics as input, and outputs an action (a vector of suggested knob values). We deploy the new configurations in the db and executes the query.

This allows for tuning to optimize the latency (query-level) or throughput(workload-level). Although we can optimize for one at the cost of the other it has been shown that the one not being optimized for still outperforms other ML models like CDBTune, and Ottertune. There is a method to optimize for both by adding clustering to the queries. This adds a couple extra steps where we take the query vector and pass it through a another model that recognizes patterns then through clustering algorithms and train the model based on the clusters. This would be useful for things like analytical queries. This extra cluster level model can be added later for future work.

TrainDS-DDPG()
    Generate training dataTP;
    TrainPredictor(W,TP);
    Generate training dataTA;
    TrainAgent(A,C,TA);

Results

We are able to construct a working tensorflow model! We set up the environment and currently it is configured to pull data from TiDB internal metrics table and the results from the sysbench test. It uses this data to update an Actor-Critic model discribed above which can output a set of weights TiDB system variables.

Future plans

This task was quite big to fit in the duration of a single hackathon so we weren’t able to get everything done that we wanted.

  1. Currently it only supports sysbench for the training method and the commands are hardcoded in. While this is good we should support other bench marking frameworks and custom workloads.
  2. Also, we didn’t have enough time to fully train a model and compare benchmarks with default and manually tuned settings. While we are able to generate the model it still takes some time to train the model in order to get good performance. This depends on the resources available and could be shorttend with more processing power or GPU support.
  3. Lastly we currently only change 15 system variables for TiDB. We chose these because we are able to change them dynamically and don’t require any restarts. While this is a good start TiDB makes use of TiKV and that in turn makes use of RocksDB for it’s storage layer. If we include a method to tune these settings too we could see even further improvements over what this model can accomplish.

Final Thoughts

It was a fun competition and I would definitely participate again next year. I think next time I would lean towards a different idea. While I’m interested in AI applications and do quite a bit of work with them, I think a 24hr time frame was a little to ambitious. While getting the model working is an accomplishment we ran out of time to actually train it and put together compelling benchmarks.