RASAT: Integrating Relational Structures into Pretrained Seq2Seq Model for Text-to-SQL

1Shanghai Jiao Tong University, 2NetMind.AI and ProtagoLabs, 3Microsoft Research, Redmond, 4IGSNRR, Chinese Academy of Sciences

RASAT is a Transformer seq2seq architecture augmented with relation-aware self-attention that could leverage a variety of relational structures while inheriting the pretrained parameters from the T5 model effectively. It could handle single-turn and multi-turn text-to-SQL semantic parsing tasks and achieved State-Of-The-Art performance on Spider, SParC and CoSQL dataset.

Abstract

Relational structures such as schema linking and schema encoding have been validated as a key component to qualitatively translating nat- ural language into SQL queries. However, in- troducing these structural relations comes with prices: they often result in a specialized model structure, which largely prohibits using large pretrained models in text-to-SQL. To address this problem, we propose RASAT: a Trans- former seq2seq architecture augmented with relation-aware self-attention that could lever- age a variety of relational structures while in- heriting the pretrained parameters from the T5 model effectively. Our model can incorpo- rate almost all types of existing relations in the literature, and in addition, we propose in- troducing co-reference relations for the multi- turn scenario. Experimental results on three widely used text-to-SQL datasets, covering both single-turn and multi-turn scenarios, have shown that RASAT could achieve state-of-the- art results across all three benchmarks (75.5% EX on Spider, 52.6% IEX on SParC, and 37.4% IEX on CoSQL).

Two Types of Mainstream Method for Text-to-SQL

Graph Structure based method

  • Using graph structures to encode various complex relationships between inputs.
  • Pros:
    Better representation of structural information.
  • Cons:
    The powerful representation capability of PLM not be fully utilized.
    Most AST-based decoder missing the real value.

Seq2seq Method by just fine tuning PLM

  • Fine-tuning a pre-trained PLM(e.g. T5-3B) could yield results competitive performance with state-of-the-art.
  • Pros:
    Make full use of the powerful representation capabilities of PLM
    The predicted SQL query has real value
  • Cons:
    Simple serialization results in the loss of structural information

Can we combine the two approaches?

RASAT: Relation-Aware Self-Attention-augmented T5

  • The T5 model is taken as the base model, with its self-attention modules in the encoder substituted as relation-aware self-attentions.
  • For single-turn setting, the serialize the inputs are:
  • For multi-turn setting, the serialize the inputs are:

Relations in the Interaction Graph

There are 5 types of relations in interaction graph (for more detail, please ref to our paper):
  • schema encoding
  • schema linking
  • question dependency structure
  • coreference between questions
  • database content mentions

Data Statistics and Examples

DS-1000 contains 1000 problems originating from 451 unique StackOverflow problems. To defend against potential memoriza- tion, more than half of the DS-1000 problems are modified from the original StackOverflow problems; they include 152 surface perturbations, 235 semantic perturbations, and 162 difficult rewrites.

Below are more DS-1000 examples. For each example, The model needs to fill in the code into “[insert]” in the prompt on the left; the code will then be executed to pass the multi-criteria automatic evaluation, which includes the test cases and the surface form constraints; a reference solution is provided at the bottom left.

NumPy example problem involving randomness, requiring the use of a specialist knowledge test.

Perturbation and Prompt

For questions from StackOverflow, we equipped them with prompts, test cases, and evaluation functions and called them Origin. To prevent models simply recalling the solutions seen during pre-training, we have perturbed the questions in two ways: surface perturbations and semantic perturbations. To make DS-1000 more challenging, we additionally introduced Difficult Rewrite.

  • Origin:   Equipped them with a prompt, testcases, and an evaluation function.
  • Surface Perturbation:   We paraphrased the question or modified the code context in the question, but the reference solution should stay the same after the perturbation.
  • Semantic Perturbation:   We changed the semantics of the reference solution without changing its difficulty.
  • Difficult Rewrite:   More complex and more difficult.

We also provided an Insertion style prompt and a Completion style prompt for each question (including perturbation).

  • Insertion:   Models need to predict what should be written in the place of "[insert]".
  • Completion:   Models need to predict what should b written after the prompt.

Here is the performance of Codex-davinci-002 on DS-1000.

Here are more prompts, you can copy them and run the models on the playground of OpenAI.

Insertion:

Problem:
Let's say I have a 1d numpy positive integer array like this:
a = array([1,0,3])
I would like to encode this as a 2D one-hot array(for natural number)
b = array([[0,1,0,0], [1,0,0,0], [0,0,0,1]])
The leftmost element corresponds to 0 in `a`(NO MATTER whether 0 appears in `a` or not.), and the rightmost vice versa.
Is there a quick way to do this only using numpy? Quicker than just looping over a to set elements of b, that is.
A:
<code>
import numpy as np
a = np.array([1, 0, 3])
</code>
BEGIN SOLUTION
<code>
[insert]
</code>
END SOLUTION
<code>
print(b)
</code>

Comparison

Table 4 compares DS-1000 to other datasets. Notably, the average of problem words in DS-1000 is much larger compared to other data science related datasets (e.g., DSP and CoNaLa).

More importantly, the problems in DS-1000 represent more diverse and naturalistic intent and context formats that cannot be seen in any other datasets.

Unlike generic Python code generation benchmarks (MBPP and HumanEval), we note that data science code generation benchmarks have fewer test cases since the annotators need to define program inputs with complex objects such as square matrices, classifiers, or dataframes than simple primitives, such as floats or lists.

Nevertheless, even a few test cases suffice for DS-1000 – only 1.8% of the Codex-002-predicted solutions accepted by our evaluation are incorrect.

Baselines

Notice: t=temperature, top-p=top-p cutoff, len = max generation length

Rank Model Details Score

1

Nov 08, 2022
codex-davinci-002

OpenAI

Chen et al., '21
Insertion

t=0.2, top-p=0.95

len = 1024

43.3

2

Nov 08, 2022
codex-davinci-002

OpenAI

Chen et al., '21
Completion

t=0.2, top-p=0.95

len = 1024

39.2

3

Nov 08, 2022
codex-davinci-001

OpenAI

Chen et al., '21
Completion

t=0.2, top-p=0.95

len = 1024

20.2

4

Nov 08, 2022
codex-cushman-001

OpenAI

Chen et al., '21
Completion

t=0.2, top-p=0.95

len = 1024

18.1

5

Nov 08, 2022
CodeGen-6B

Salesforce Research

Nijkamp et al., '22
Completion

t=0.2, top-p=0.95

len = 1024

8.4

6

Nov 08, 2022
InCoder-6B

Facebook AI

Fried et al., '22
Insertion

t=0.2, top-p=0.95

len = 1024

7.5

7

Nov 08, 2022
InCoder-6B

Facebook AI

Fried et al., '22
Completion

t=0.2, top-p=0.95

len = 1024

7.4

Acknowledgement

We thank Noah A. Smith, Tianbao Xie, Shuyang Jiang for their helpful feedback on this work.

BibTeX

@article{Lai2022DS1000,
  title={RASAT: Integrating Relational Structures into Pretrained Seq2Seq Model for Text-to-SQL},
  author={Yuhang Lai and Chengxi Li and Yiming Wang and Tianyi Zhang and Ruiqi Zhong and Luke Zettlemoyer and Scott Wen-tau Yih and Daniel Fried and Sida Wang and Tao Yu},
  journal={ArXiv},
  year={2022},
  volume={abs/2211.11501}
}