# 摘要

Relational databases store a significant amount of the worlds data. However, accessing this data currently requires users to understand a query language such as SQL. We propose Seq2SQL, a deep neural network for translating natural language questions to corresponding SQL queries. Our model uses rewards from in the-loop query execution over the database to learn a policy to generate the query, which contains unordered parts that are less suitable for optimization via cross entropy loss. Moreover, Seq2SQL leverages the structure of SQL to prune the space of generated queries and significantly simplify the generation problem. In addition to the model, we release WikiSQL, a dataset of 80654 hand-annotated examples of questions and SQL queries distributed across 24241 tables from Wikipedia that is an order of magnitude larger than comparable datasets. By applying policybased reinforcement learning with a query execution environment to WikiSQL, Seq2SQL outperforms a state-of-the-art semantic parser, improving execution accuracy from 35.9% to 59.4% and logical form accuracy from 23.4% to 48.3%.

# 1 介绍

Relational databases store a vast amount of today’s information and provide the foundation of applications such as medical records (Hillestad et al., 2005), financial markets (Beck et al., 2000), and customer relations management (Ngai et al., 2009). However, accessing relational databases requires an understanding of query languages such as SQL, which, while powerful, is difficult to master. Natural language interfaces (NLI), a research area at the intersection of natural language processing and human-computer interactions, seeks to provide means for humans to interact with computers through the use of natural language (Androutsopoulos et al., 1995). We investigate one particular aspect of NLI applied to relational databases: translating natural language questions to SQL queries.

Our main contributions in this work are two-fold. First, we introduce Seq2SQL, a deep neural network for translating natural language questions to corresponding SQL queries. Seq2SQL, shown in Figure 1, consists of three components that leverage the structure of SQL to prune the output space of generated queries. Moreover, it uses policy-based reinforcement learning (RL) to generate the conditions of the query, which are unsuitable for optimization using cross entropy loss due to their unordered nature. We train Seq2SQL using a mixed objective, combining cross entropy losses and RL rewards from in-the-loop query execution on a database. These characteristics allow Seq2SQL to achieve state-of-the-art results on query generation.

Figure 1: Seq2SQL takes as input a question and the columns of a table. It generates the corresponding SQL query, which, during training, is executed against a database. The result of the execution is utilized as the reward to train the reinforcement learning algorithm.

Figure 2: An example in WikiSQL. The inputs consist of a table and a question. The outputs consist of a ground truth SQL query and the corresponding result from execution.

Next, we release WikiSQL, a corpus of 80654 hand-annotated instances of natural language questions, SQL queries, and SQL tables extracted from 24241 HTML tables from Wikipedia. WikiSQL is an order of magnitude larger than previous semantic parsing datasets that provide logical forms along with natural language utterances. We release the tables used in WikiSQL both in raw JSON format as well as in the form of a SQL database. Along with WikiSQL, we release a query execution engine for the database used for in-the-loop query execution to learn the policy. On WikiSQL, Seq2SQL outperforms a previously state-of-the-art semantic parsing model by Dong & Lapata (2016), which obtains 35.9% execution accuracy, as well as an augmented pointer network baseline, which obtains 53.3% execution accuracy. By leveraging the inherent structure of SQL queries and applying policy gradient methods using reward signals from live query execution, Seq2SQL achieves state-of-the-art performance on WikiSQL, obtaining 59.4% execution accuracy.

# 2 模型

The WikiSQL task is to generate a SQL query from a natural language question and table . Our baseline model is the attentional sequence to sequence neural semantic parser proposed by Dong & Lapata (2016) that achieves state-of-the-art performance on a host of semantic parsing datasets without using hand-engineered grammar. However, the output space of the softmax in their Seq2Seq model is unnecessarily large for this task. In particular, we can limit the output space of the generated sequence to the union of the table schema, question utterance, and SQL key words. The resulting model is similar to a pointer network (Vinyals et al., 2015) with augmented inputs. We first describe the augmented pointer network model, then address its limitations in our definition of Seq2SQL, particularly with respect to generating unordered query conditions.

WikiSQL的任务是根据自然语言问题和表模式生成SQL查询。我们的基线模型是Dong＆Lapata（2016）提出的基于注意力的序列到序列语意解析模型，该模型在不使用人工语法的情况下，在大量语义解析数据集上实现了优良的性能。 但是，在此Seq2Seq模型中，softmax的输出空间对于此任务而言不必要地大。 特别是，我们可以将生成序列的输出空间限制为表模式，问题话语和SQL关键字的并集。 生成的模型类似于带有扩充输入的指针网络（Vinyals等，2015）。 我们首先描述增强指针网络模型，然后在我们对Seq2SQL的定义中解决其局限性，特别是在生成无序查询条件方面。

## 2.1 增强型指针网络

The augmented pointer network generates the SQL query token-by-token by selecting from an input sequence. In our case, the input sequence is the concatenation of the column names, required for the selection column and the condition columns of the query, the question, required for the conditions of the query, and the limited vocabulary of the SQL language such as SELECT, COUNT etc. In the example shown in Figure 2, the column name tokens consist of “Pick”, “#”, “CFL”, “Team” etc.; the question tokens consist of “How”, “many”, “CFL”, “teams” etc.; the SQL tokens consist of SELECT, WHERE, COUNT, MIN, MAX etc. With this augmented input sequence, the pointer network can produce the SQL query by selecting exclusively from the input.

Suppose we have a list of N table columns and a question such as in Figure 2, and want to produce the corresponding SQL query. Let $x_j^c = [x_{j,1}^c,x_{j,2}^c,..x_{j,T_j}^c]$ denote the sequence of words in the name of the jth column, where $x_{j,i}^c$ represents the ith word in the jth column and $T_j$ represents the total number of words in the jth column. Similarly, let $x^q$ and $x^s$ respectively denote the sequence of words in the question and the set of unique words in the SQL vocabulary.

We define the input sequence x as the concatenation of all the column names, the question, and the
SQL vocabulary:

$x = [;x_1^c;x_2^c;...;;x_N^c;;x^s;;x^q]$

where [a;b] denotes the concatenation between the sequences a and b and we add sentinel tokens between neighbouring sequences to demarcate the boundaries.

$x = [;x_1^c;x_2^c;...;;x_N^c;;x^s;;x^q]$

The network first encodes x using a two-layer, bidirectional Long Short-Term Memory network (Hochreiter & Schmidhuber, 1997). The input to the encoder are the embeddings corresponding to words in the input sequence. We denote the output of the encoder by $h^{enc}$, where $h_t^{enc}$ is the state of the encoder corresponding to the tth word in the input sequence. For brevity, we do not write out the LSTM equations, which are described by Hochreiter & Schmidhuber (1997). We then apply a pointer network similar to that proposed by Vinyals et al. (2015) to the input encodings $h^{enc}$.

The decoder network uses a two layer, unidirectional LSTM. During each decoder step s, the decoder LSTM takes as input $y_{s-1}$, the query token generated during the previous decoding step, and outputs the state $g_s$. Next, the decoder produces a scalar attention score $α_{s,t}^{ptr}$ for each position t of the input sequence:

$\alpha_{s, t}^{\mathrm{ptr}}=W^{\mathrm{ptr}} \tanh \left(U^{\mathrm{ptr}} g_{s}+V^{\mathrm{ptr}} h_{t}\right)$

We choose the input token with the highest score as the next token of the generated SQL query, $y_s = argmax(α_s^{ptr})$.

$\alpha_{s, t}^{\mathrm{ptr}}=W^{\mathrm{ptr}} \tanh \left(U^{\mathrm{ptr}} g_{s}+V^{\mathrm{ptr}} h_{t}\right)$

## 2.2 SEQ2SQL

While the augmented pointer network can solve the SQL generation problem, it does not leverage the structure inherent in SQL. Typically, a SQL query such as that shown in Figure 3 consists of three components. The first component is the aggregation operator, in this case COUNT, which produces a summary of the rows selected by the query. Alternatively the query may request no summary statistics, in which case an aggregation operator is not provided. The second component is the SELECT column(s), in this case Engine, which identifies the column(s) that are to be included in the returned results. The third component is the WHERE clause of the query, in this case WHERE Driver = Val Musetti, which contains conditions by which to filter the rows. Here, we keep rows in which the driver is “Val Musetti”.

Figure 3: The Seq2SQL model has three components, corresponding to the three parts of a SQL query (right). The input to the model are the question (top left) and the table column names (bottom left).

Seq2SQL, as shown in Figure 3, has three parts that correspond to the aggregation operator, the SELECT column, and the WHERE clause. First, the network classifies an aggregation operation for the query, with the addition of a null operation that corresponds to no aggregation. Next, the network points to a column in the input table corresponding to the SELECT column. Finally, the network generates the conditions for the query using a pointer network. The first two components are supervised using cross entropy loss, whereas the third generation component is trained using policy gradient to address the unordered nature of query conditions (we explain this in the subsequent WHERE Clause section). Utilizing the structure of SQL allows Seq2SQL to further prune the output space of queries, which leads to higher performance than Seq2Seq and the augmented pointer network.

Aggregation Operation. The aggregation operation depends on the question. For the example shown in Figure $3,$ the correct operator is COUNT because the question asks for “How many”. To compute the aggregation operation, we first compute the scalar attention score, $\alpha_{t}^{\mathrm{inp}}=W^{\mathrm{inp}} h_{t}^{\mathrm{enc}}$ for each $t$ th token in the input sequence. We normalize the vector of scores $\alpha^{\operatorname{inp}}=\left[\alpha_{1}^{\operatorname{inp}}, \alpha_{2}^{\mathrm{inp}}, \ldots\right]$ to
produce a distribution over the input encodings, $\beta^{\mathrm{inp}}=\operatorname{softmax}\left(\alpha^{\mathrm{inp}}\right) .$ The input representation $\kappa^{\mathrm{agg}}$ is the sum over the input encodings $h^{\mathrm{enc}}$ weighted by the normalized scores $\beta^{\mathrm{inp}}:$

$\kappa^{\mathrm{agg}}=\sum_{t} \beta_{t}^{\mathrm{inp}} h_{t}^{\mathrm{enc}}$

Let $\alpha^{\text {agg }}$ denote the scores over the aggregation operations such as COUNT, MIN, MAX, and the no-aggregation operation NULL. We compute $\alpha^{\text {agg }}$ by applying a multi-layer perceptron to the input representation $\kappa^{\mathrm{agg}}:$

$\quad \alpha^{\mathrm{agg}}=W^{\mathrm{agg}} \tanh \left(V^{\mathrm{agg}} \kappa^{\mathrm{agg}}+b^{\mathrm{agg}}\right)+c^{\mathrm{agg}}$

We apply the softmax function to obtain the distribution over the set of possible aggregation operations $\beta^{\text {agg }}=\operatorname{softmax}\left(\alpha^{\text {agg }}\right) .$ We use cross entropy loss $L^{\text {agg}}$ for the aggregation operation.

$\kappa^{\mathrm{agg}}=\sum_{t} \beta_{t}^{\mathrm{inp}} h_{t}^{\mathrm{enc}}$

$\alpha^{\text {agg }}$表示聚合操作的得分，例如COUNT，MIN，MAX和无聚合操作为NULL。 我们通过一个多层感知机输入$\kappa^{\mathrm{agg}}$ 来计算$\alpha^{\text {agg }}$

$\quad \alpha^{\mathrm{agg}}=W^{\mathrm{agg}} \tanh \left(V^{\mathrm{agg}} \kappa^{\mathrm{agg}}+b^{\mathrm{agg}}\right)+c^{\mathrm{agg}}$

SELECT Column. The selection column depends on the table columns as well as the question. Namely, for the example in Figure $3,$ “How many engine types” indicates that we need to retrieve the “Engine” column. SELECT column prediction is then a matching problem, solvable using a pointer: given the list of column representations and a question representation, we select the column that best matches the question.

In order to produce the representations for the columns, we first encode each column name with a LSTM. The representation of a particular column $j, e_{j}^{c},$ is given by:

$h_{j, t}^{c}=\operatorname{LSTM}\left(\operatorname{emb}\left(x_{j, t}^{c}\right), h_{j, t-1}^{c}\right) \quad e_{j}^{c}=h_{j, T_{j}}^{c}$

Here, $h_{j, t}^{c}$ denotes the the encoder state of the $j$ th column. We take the last encoder state to be $e_{j}^{c}$ column $j$ 's representation.

To construct a representation for the question, we compute another input representation $\kappa^{sel}$ using the same architecture as for $\kappa^{\text {agg }}$ (Equation 3 ) but with untied weights. Finally, we apply a multi-layer perceptron over the column representations, conditioned on the input representation, to compute the a score for each column $j:$

$\alpha_{j}^{\mathrm{sel}}=W^{\mathrm{sel}} \tanh \left(V^{\mathrm{sel}} \kappa^{\mathrm{sel}}+V^{\mathrm{c}} e_{j}^{\mathrm{c}}\right)$

We normalize the scores with a softmax function to produce a distribution over the possible SELECT columns $\beta^{\mathrm{sel}}=\operatorname{softmax}\left(\alpha^{\mathrm{sel}}\right) .$ For the example shown in Figure $3,$ the distribution is over the columns “Entrant”, “Constructor”, “Chassis”, “Engine”, “No”, and the ground truth SELECT column “Driver”. We train the SELECT network using cross entropy loss $L^{\text {sel }}$

$h_{j, t}^{c}=\operatorname{LSTM}\left(\operatorname{emb}\left(x_{j, t}^{c}\right), h_{j, t-1}^{c}\right) \quad e_{j}^{c}=h_{j, T_{j}}^{c}$

$\alpha_{j}^{\mathrm{sel}}=W^{\mathrm{sel}} \tanh \left(V^{\mathrm{sel}} \kappa^{\mathrm{sel}}+V^{\mathrm{c}} e_{j}^{\mathrm{c}}\right)$

“Driver”上。 我们使用交叉熵损失$L^{\text {sel }}$训练SELECT网络。

WHERE Clause. We can train the WHERE clause using a pointer decoder similar to that described in Section $2.1 .$ However, there is a limitation in using the cross entropy loss to optimize the network: the WHERE conditions of a query can be swapped and the query yield the same result. Suppose we have the question "which males are older than $18 "$ and the queries SELECT name FROM insurance WHERE age > 18 AND gender = “male” and SELECT name FROM insurance WHERE gender = “male” AND age, $>18 .$ Both queries obtain the correct execution result despite not having exact string match. If the former is provided as the ground truth, using cross entropy loss to supervise the generation would then wrongly penalize the latter. To address this problem, we apply reinforcement learning to learn a policy to directly optimize the expected correctness of the execution result (Equation).

Instead of teacher forcing at each step of query generation, we sample from the output distribution to obtain the next token. At the end of the generation procedure, we execute the generated SQL query against the database to obtain a reward. Let $y=\left[y^{1}, y^{2}, \ldots, y^{T}\right]$ denote the sequence of generated tokens in the WHERE clause. Let $q(y)$ denote the query generated by the model and $q_{g}$ denote the ground truth query corresponding to the question. We define the reward $R\left(q(y), q_{g}\right)$ as

$R\left(q(y), q_{g}\right)=\left\{\begin{array}{ll}{-2,} & {\text { if } q(y) \text { is not a valid SQL query }} \\ {-1,} & {\text { if } q(y) \text { is a valid } \operatorname{SQL} \text { query and executes to an incorrect result }} \\ {+1,} & {\text { if } q(y) \text { is a valid SQL query and executes to the correct result }}\end{array}\right.$

The loss, $L^{whe}=-\mathbb{E}_y\left[R\left(q(y), q_{g}\right)\right]$ , is the negative expected reward over possible WHERE clauses. We derive the policy gradient for $L^{\text {whe}}$ as shown by Sutton et al. (2000) and Schulman et al. (2015) .

\begin{aligned} \nabla L_{\Theta}^{\text {whe }} &=-\nabla_{\Theta}\left(\mathbb{E}_{y \sim p_{y}}\left[R\left(q(y), q_{g}\right)\right]\right) \\ &=-\mathbb{E}_{y \sim p_{y}}\left[R\left(q(y), q_{g}\right) \nabla_{\Theta} \sum_{t}\left(\log p_{y}\left(y_{t} ; \Theta\right)\right)\right] \\ & \approx-R\left(q(y), q_{g}\right) \nabla_{\Theta} \sum_{t}\left(\log p_{y}\left(y_{t} ; \Theta\right)\right) \end{aligned}

Here, $p_{y}\left(y_{t}\right)$ denotes the probability of choosing token $y_{t}$ during time step $t .$ In equation 10 , we approximate the expected gradient using a single Monte-Carlo sample $y$.

where子句。 我们可以使用类似于2.1节中描述的指针解码器来训练获得WHERE子句。 但是，使用使用交叉熵损失来优化这个网络存在局限：一个查询的where限制可以在交换后依然产生相同的结果。 假设我们有一个问题“哪些男性年龄大于18岁”，然后查询语句为“SELECT name FROM insurance WHERE age > 18 AND gender =“male””和“SELECT name FROM insurance WHERE gender = “male” AND age > 18”。尽管没有完全匹配的字符串，但两个查询都获得了正确的执行结果。 如果

$R\left(q(y), q_{g}\right)=\left\{\begin{array}{ll}{-2,} & {\text { if } q(y) \text { is not a valid SQL query }} \\ {-1,} & {\text { if } q(y) \text { is a valid } \operatorname{SQL} \text { query and executes to an incorrect result }} \\ {+1,} & {\text { if } q(y) \text { is a valid SQL query and executes to the correct result }}\end{array}\right.$

\begin{aligned} \nabla L_{\Theta}^{\text {whe }} &=-\nabla_{\Theta}\left(\mathbb{E}_{y \sim p_{y}}\left[R\left(q(y), q_{g}\right)\right]\right) \\ &=-\mathbb{E}_{y \sim p_{y}}\left[R\left(q(y), q_{g}\right) \nabla_{\Theta} \sum_{t}\left(\log p_{y}\left(y_{t} ; \Theta\right)\right)\right] \\ & \approx-R\left(q(y), q_{g}\right) \nabla_{\Theta} \sum_{t}\left(\log p_{y}\left(y_{t} ; \Theta\right)\right) \end{aligned}

Mixed Objective Function. We train the model using gradient descent to minimize the objective function $L = L^{agg} + L^{sel} + L^{whe}$. Consequently, the total gradient is the equally weighted sum of the gradients from the cross entropy loss in predicting the SELECT column, from the cross entropy loss in predicting the aggregation operation, and from policy learning.

# 3 WikiSQL

WikiSQL is a collection of questions, corresponding SQL queries, and SQL tables. A single example in WikiSQL, shown in Figure 2, contains a table, a SQL query, and the natural language question corresponding to the SQL query. Table 1 shows how WikiSQL compares to related datasets. Namely, WikiSQL is the largest hand-annotated semantic parsing dataset to date - it is an order of magnitude larger than other datasets that have logical forms, either in terms of the number of examples or the number of tables. The queries in WikiSQL span over Figure 4: Distribution of questions in WikiSQL. a large number of tables and hence presents an unique challenge: the model must be able to not only generalize to new queries, but to new table schema. Finally, WikiSQL contains realistic data extracted from the web. This is evident in the distributions of the number of columns, the lengths of questions, and the length of queries, respectively shown in Figure 5. Another indicator of the variety of questions in the dataset is the distribution of question types, shown in Figure 4

WikiSQL是问题，相应的SQL查询和SQL表的集合。 WikiSQL中的一个示例（如图2所示）包含一个表，一个SQL查询以及与该SQL查询相对应的自然语言问题。表1显示了WikiSQL如何与相关数据集进行比较。就是说，WikiSQL是迄今为止最大的人工注释语义分析数据集-无论从示例数量还是在表数量上，它都比其他具有逻辑形式的数据集大一个数量级。 WikiSQL中的查询跨越图4：WikiSQL中的问题分布。拥有大量的表提出了一个独特的挑战：该模型不仅必须能够推广到新的查询，而且还能够推广到新的表架构。最后，WikiSQL包含从网上提取的真实数据。如图5所示，这在列数，问题的长度和查询的长度的方面得到了体现。在数据集中另一个问题多样性的指标是问题种类的分布，如图4所示。

Figure 4: Distribution of questions in WikiSQL.

Figure 5: Distribution of table, question, query sizes in WikiSQL.

Table 1: Comparison between WikiSQL and existing datasets. The datasets are GeoQuery880 (Tang & Mooney, 2001), ATIS (Price, 1990), Free917 (Cai & Yates, 2013), Overnight (Wang et al., 2015), WebQuestions (Berant et al., 2013), and WikiTableQuestions (Pasupat & Liang, 2015). “Size” denotes the number of examples in the dataset. “LF” indicates whether it has annotated logical forms. “Schema” denotes the number of tables. ATIS is presented as a slot filling task. Each Freebase API page is counted as a separate domain.

We collect WikiSQL by crowd-sourcing on Amazon Mechanical Turk in two phases. First, a worker paraphrases a generated question for a table. We form the generated question using a template, filled using a randomly generated SQL query. We ensure the validity and complexity of the tables by keeping only those that are legitimate database tables and sufficiently large in the number of rows and columns. Next, two other workers verify that the paraphrase has the same meaning as the generated question. We discard paraphrases that do not show enough variation, as measured by the character edit distance from the generated question, as well as those both workers deemed incorrect during verification. Section A of the Appendix contains more details on the collection of WikiSQL. We make available examples of the interface used during the paraphrase phase and during the verification phase in the supplementary materials. The dataset is available for download at https://github.com/salesforce/WikiSQL.

The tables, their paraphrases, and SQL queries are randomly slotted into train, dev, and test splits, such that each table is present in exactly one split. In addition to the raw tables, queries, results, and natural utterances, we also release a corresponding SQL database and query execution engine.

## 3.1 评估

Let N denote the total number of examples in the dataset, $N_{ex}$ the number of queries that, when executed, result in the correct result, and $N_{lf}$ the number of queries has exact string match with the
ground truth query used to collect the paraphrase. We evaluate using the execution
accuracy metric $Acc_{ex} = \frac{N_{ex}}{N}$. One downside of $Acc_{ex}$ is that it is possible to construct a SQL query that does not correspond to the question but nevertheless obtains the same result. For example, the two queries SELECT COUNT(name) WHERE SSN = 123 and SELECT COUNT(SSN) WHERE SSN = 123 produce the same result if no two people with different names share the SSN 123. Hence, we also use the logical form
accuracy $Acc_{lf} = \frac{N_{lf}}{N}$. However, as we showed in Section 2.2, $Acc_{lf}$ incorrectly penalizes queries that achieve the correct result but do not have exact string match with the ground truth query. Due to these observations, we use both metrics to evaluate the models.

# 4 实验

We tokenize the dataset using Stanford CoreNLP (Manning et al., 2014). We use the normalized tokens for training and revert into original gloss before outputting the query so that generated queries are executable on the database. We use fixed GloVe word embeddings (Pennington et al., 2014) and character n-gram embeddings (Hashimoto et al., 2016). Let $w_x^g$ denote the GloVe embedding and $w_x^c$ the character embedding for word x. Here, $w_x^c$ is the mean of the embeddings of all the character n-grams in x. For words that have neither word nor character embeddings, we assign the zero vector. All networks are run for a maximum of 300 epochs with early stopping on dev split execution accuracy. We train using ADAM (Kingma & Ba, 2014) and regularize using dropout (Srivastava et al., 2014). All recurrent layers have a hidden size of 200 units and are followed by a dropout of 0.3. We implement all models using PyTorch. To train Seq2SQL, we first train a version in which the WHERE clause is supervised via teacher forcing (i.e. the policy is not learned from scratch) and then continue training using reinforcement learning. In order to obtain the rewards described in Section 2.2, we use the query execution engine described in Section 3.

## 4.1 结果

We compare results against the attentional sequence to sequence neural semantic parser proposed by Dong & Lapata (2016). This model achieves state of the art results on a variety of semantic parsing datasets, outperforming a host of non-neural semantic parsers despite not using hand-engineered grammars. To make this baseline even more competitive on our new dataset, we augment their input Reducing the output space by utilizing the augmented pointer network improves upon the baseline by 17.4%. Leveraging the structure of SQL queries leads to another improvement of 3.8%, as is shown by the performance of Seq2SQL without RL compared to the augmented pointer network. Finally, training using reinforcement learning based on rewards from in-the-loop query executions on a database leads to another performance increase of 2.3%, as is shown by the performance of the full Seq2SQL model.

Table 2: Performance on WikiSQL. Both metrics are defined in Section 3.1. For Seq2SQL (no RL), the WHERE clause is supervised via teacher forcing as opposed to reinforcement learning. with the table schema such that the model can generalize to new tables. We describe this baseline in detail in Section 2 of the Appendix. Table 2 compares the performance of the three models.

## 4.2 分析

Limiting the output space via pointer network leads to more accurate conditions. Compared to the baseline, the augmented pointer network generates higher quality WHERE clause. For example, for “in how many districts was a successor seated on march 4, 1850?”, the baseline generates the condition successor seated = seated march 4 whereas Seq2SQL generates successor seated = seated march 4 1850. Similarly, for “what’s doug battaglia’s pick number?”, the baseline generates Player = doug whereas Seq2SQL generates Player = doug battaglia. The conditions tend to contain rare words (e.g. “1850”), but the baseline is inclined to produce common words in the training corpus, such as “march” and “4” for date, or “doug” for name. The pointer is less affected since it selects exclusively from the input.

Incorporating structure reduces invalid queries. Seq2SQL without RL directly predicts selection and aggregation and reduces invalid SQL queries generated from 7.9% to 4.8%. A large quantity of invalid queries result from column names – the generated query refers to selection columns that are not present in the table. This is particularly helpful when the column name contain many tokens, such as “Miles (km)”, which has 4 tokens. Introducing a classifier for the aggregation also reduces the error rate. Table 3 shows that adding the aggregation classifier improves the precision, recall, and F1 for predicting the COUNT operator. For more queries produced by the different models, please see Section 3 of the Appendix.

Model Precision Recall F1
Aug Ptr Network 66.3% 64.4% 65.4%
Seq2SQL 72.6% 66.2% 69.2%

Table 3: Performance on the COUNT operator.

RL generates higher quality WHERE clause that are ordered differently than ground truth. Training with policy-based RL obtains correct results in which the order of conditions is differs from the ground truth query. For example, for “in what district was the democratic candidate first elected in 1992?”, the ground truth conditions are First elected = 1992 AND Party = Democratic whereas Seq2SQL generates Party = Democratic AND First elected = 1992. When Seq2SQL is correct and Seq2SQL without RL is not, the latter tends to produce an incorrect WHERE clause. For example, for the rather complex question “what is the race name of the 12th round trenton, new jersey race where a.j. foyt had the pole position?”, Seq2SQL trained without RL generates WHERE rnd = 12 and track = a.j. foyt AND pole position = a.j. foyt whereas Seq2SQL trained with RL correctly generates WHERE rnd = 12 AND pole position = a.j. foyt.

RL生成更高质量的WHERE子句，其顺序与事实不同。 使用基于策略的RL进行训练可获得正确的结果，其中条件的顺序与正确查询不同。 例如，对于“ in what district was the democratic candidate first elected in 1992？”，真实条件为：First elected = 1992 AND Party = Democratic，而Party = Democratic AND First elected = 1992。当Seq2SQL是正确的且Seq2SQL 如果没有RL，则RL往往会产生不正确的WHERE子句。 例如，对于一个相当复杂的问题，“what is the race name of the 12th round trenton, new jersey race where a.j. foyt had the pole position?”，未经RL训练的Seq2SQL生成WHERE rnd = 12 and track = a.j. foyt AND pole position = a.j. foyt，而经过RL训练的Seq2SQL正确生成WHERE rnd = 12 AND pole position = a.j. foyt。

# 相关工作

Semantic Parsing. In semantic parsing for question answering (QA), natural language questions are parsed into logical forms that are then executed on a knowledge graph (Zelle & Mooney, 1996; Wong & Mooney, 2007; Zettlemoyer & Collins, 2005; 2007). Other works in semantic parsing focus on learning parsers without relying on annotated logical forms by leveraging conversational logs (Artzi & Zettlemoyer, 2011), demonstrations (Artzi & Zettlemoyer, 2013), distant supervision (Cai & Yates, 2013; Reddy et al., 2014), and question-answer pairs (Liang et al., 2011). Semantic parsing systems are typically constrained to a single schema and require hand-curated grammars to perform well. Pasupat & Liang (2015) addresses the single-schema limitation by proposing the floating parser, which generalizes to unseen web tables on the WikiTableQuestions task. Our approach is similar in that it generalizes to new table schema. However, we do not require access to table content, conversion of table to an additional graph, nor hand-engineered features/grammar.

Semantic parsing datasets. Previous semantic parsing systems were designed to answer complex and compositional questions over closed-domain, fixed-schema datasets such as GeoQuery (Tang & Mooney, 2001) and ATIS (Price, 1990). Researchers also investigated QA over subsets of largescale knowledge graphs such as DBPedia (Starc & Mladenic, 2017) and Freebase (Cai & Yates, 2013; Berant et al., 2013). The dataset “Overnight” (Wang et al., 2015) uses a similar crowdsourcing process to build a dataset of natural language question, logical form pairs, but has only 8 domains. WikiTableQuestions (Pasupat & Liang, 2015) is a collection of question and answers, also over a large quantity of tables extracted from Wikipedia. However, it does not provide logical forms whereas WikiSQL does. In addition, WikiSQL focuses on generating SQL queries for questions over relational database tables and only uses table content during evaluation.

Representation learning for sequence generation. Dong & Lapata (2016)’s attentional sequence to sequence neural semantic parser, which we use as the baseline, achieves state-of-the-art results on a variety of semantic parsing datasets despite not utilizing hand-engineered grammar. Unlike their model, Seq2SQL uses pointer based generation akin to Vinyals et al. (2015) to achieve higher performance, especially in generating queries with rare words and column names. Pointer models have also been successfully applied to tasks such as language modeling (Merity et al., 2017), summarization (Gu et al., 2016), combinatorial optimization (Bello et al., 2017), and question answering (Seo et al., 2017; Xiong et al., 2017). Other interesting neural semantic parsing models are the Neural Programmer (Neelakantan et al., 2017) and the Neural Enquirer (Yin et al., 2016). Mou et al. (2017) proposed a distributed neural executor based on the Neural Enquirer, which efficiently executes queries and incorporates execution rewards in reinforcement learning. Our approach is different in that we do not access table content, which may be unavailable due to privacy concerns. We also do not hand-engineer model architecture for query execution and instead leverage existing database engines to perform efficient query execution. Furthermore, in contrast to Dong & Lapata (2016) and Neelakantan et al. (2017), we use policy-based RL in a fashion similar to Liang et al. (2017), Mou et al. (2017), and Guu et al. (2017), which helps Seq2SQL achieve state-of-the-art performance. Unlike Mou et al. (2017) and Yin et al. (2016), we generalize across natural language questions and table schemas instead of across synthetic questions on a single schema.

Natural language interface for databases. One prominent works in natural language interfaces is PRECISE (Popescu et al., 2003), which translates questions to SQL queries and identifies questions that it is not confident about. Giordani & Moschitti (2012) translate questions to SQL by first generating candidate queries from a grammar then ranking them using tree kernels. Both of these approaches rely on high quality grammar and are not suitable for tasks that require generalization to new schema. Iyer et al. (2017) also translate to SQL, but with a Seq2Seq model that is further improved with human feedback. Seq2SQL outperforms Seq2Seq and uses reinforcement learning instead of human feedback during training.

# 6 总结

We proposed Seq2SQL, a deep neural network for translating questions to SQL queries. Our model leverages the structure of SQL queries to reduce the output space of the model. To train Seq2SQL, we applied in-the-loop query execution to learn a policy for generating the conditions of the SQL query, which is unordered and unsuitable for optimization via cross entropy loss. We also introduced WikiSQL, a dataset of questions and SQL queries that is an order of magnitude larger than comparable datasets. Finally, we showed that Seq2SQL outperforms a state-of-the-art semantic parser on WikiSQL, improving execution accuracy from 35.9% to 59.4% and logical form accuracy from 23.4% to 48.3%.