salesforce / wikisql Goto Github PK

A large annotated semantic parsing corpus for developing natural language interfaces.

License: BSD 3-Clause "New" or "Revised" License

Python 49.09% Dockerfile 0.55% HTML 50.36%

natural-language dataset database machine-learning natural-language-processing natural-language-interface

wikisql's Introduction

WikiSQL

A large crowd-sourced dataset for developing natural language interfaces for relational databases. WikiSQL is the dataset released along with our work Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning.

Citation

If you use WikiSQL, please cite the following work:

Victor Zhong, Caiming Xiong, and Richard Socher. 2017. Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning.

@article{zhongSeq2SQL2017,
  author    = {Victor Zhong and
               Caiming Xiong and
               Richard Socher},
  title     = {Seq2SQL: Generating Structured Queries from Natural Language using
               Reinforcement Learning},
  journal   = {CoRR},
  volume    = {abs/1709.00103},
  year      = {2017}
}

Notes

Regarding tokenization and Stanza --- when WikiSQL was written 3-years ago, it relied on Stanza, a CoreNLP python wrapper that has since been deprecated. If you'd still like to use the tokenizer, please use the docker image. We do not anticipate switching to the current Stanza as changes to the tokenizer would render the previous results not reproducible.

Leaderboard

If you submit papers on WikiSQL, please consider sending a pull request to merge your results onto the leaderboard. By submitting, you acknowledge that your results are obtained purely by training on the training split and tuned on the dev split (e.g. you only evaluted on the test set once). Moreover, you acknowledge that your models only use the table schema and question during inference. That is they do not use the table content. Update (May 12, 2019): We now have a separate leaderboard for weakly supervised models that do not use logical forms during training.

Weakly supervised without logical forms

Model	Dev execution accuracy	Test execution accuracy
TAPEX (Liu 2022)	89.2	89.5
HardEM (Min 2019)	84.4	83.9
LatentAlignment (Wang 2019)	79.4	79.3
MeRL (Agarwal 2019)	74.9 +/- 0.1	74.8 +/- 0.2
MAPO (Liang 2018)	72.2 +/- 0.2	72.1 +/- 0.3
Rule-SQL (Guo 2019)	61.1 +/- 0.2	61.0 +/- 0.3

Supervised via logical forms

Model	Dev logical form accuracy	Dev execution accuracy	Test logical form accuracy	Test execution accuracy	Uses execution
SeaD +Execution-Guided Decoding (Xu 2021) (Ant Group, Ada & ZhiXiaoBao)	87.6	92.9	87.5	93.0	Inference
SDSQL +Execution-Guided Decoding (Hui 2020) (Alibaba Group)	87.1	92.6	87.0	92.7	Inference
IE-SQL +Execution-Guided Decoding (Ma 2020) (Ping An Life, AI Team)	87.9	92.6	87.8	92.5	Inference
HydraNet +Execution-Guided Decoding (Lyu 2020) (Microsoft Dynamics 365 AI) (code)	86.6	92.4	86.5	92.2	Inference
BRIDGE^ +Execution-Guided Decoding (Lin 2020) (Salesforce Research)	86.8	92.6	86.3	91.9	Inference
X-SQL +Execution-Guided Decoding (He 2019)	86.2	92.3	86.0	91.8	Inference
SDSQL (Hui 2020) (Alibaba Group)	86.0	91.8	85.6	91.4
BRIDGE^ (Lin 2020) (Salesforce Research)	86.2	91.7	85.7	91.1
Text2SQLGen + EG (Mellah 2021) (Novelis.io Research)		91.2		91.0	Inference
SeqGenSQL+EG (Li 2020)		90.8		90.5	Inference
SeqGenSQL (Li 2020)		90.6		90.3	Inference
SeaD (Xu 2021) (Ant Group, Ada & ZhiXiaoBao)	84.9	90.2	84.7	90.1	Inference
(Guo 2019) +Execution-Guided Decoding with BERT-Base-Uncased^	85.4	91.1	84.5	90.1	Inference
SQLova +Execution-Guided Decoding (Hwang 2019)	84.2	90.2	83.6	89.6	Inference
IncSQL +Execution-Guided Decoding (Shi 2018)	51.3	87.2	51.1	87.1	Inference
HydraNet (Lyu 2020) (Microsoft Dynamics 365 AI) (code)	83.6	89.1	83.8	89.2
(Guo 2019) with BERT-Base-Uncased^	84.3	90.3	83.7	89.2
IE-SQL (Ma 2020) (Ping An Life, AI Team)	84.6	88.7	84.6	88.8
X-SQL (He 2019)	83.8	89.5	83.3	88.7
SQLova (Hwang 2019)	81.6	87.2	80.7	86.2
Execution-Guided Decoding (Wang 2018)	76.0	84.0	75.4	83.8	Inference
IncSQL (Shi 2018)	49.9	84.0	49.9	83.7
Auxiliary Mapping Task (Chang 2019)	76.0	82.3	75.0	81.7
MQAN (unordered) (McCann 2018)	76.1	82.0	75.4	81.4
MQAN (ordered) (McCann 2018)	73.5	82.0	73.2	81.4
Coarse2Fine (Dong 2018)	72.5	79.0	71.7	78.5
TypeSQL (Yu 2018)	-	74.5	-	73.5
PT-MAML (Huang 2018)	63.1	68.3	62.8	68.0
(Guo 2018)	64.1	71.1	62.5	69.0
SQLNet (Xu 2017)	-	69.8	-	68.0
Wang 2017^	62.0	67.1	61.5	66.8
Seq2SQL (Zhong 2017)	49.5	60.8	48.3	59.4	Training
Baseline (Zhong 2017)	23.3	37.0	23.4	35.9

^ indicates that table content is used directly by the model during training.
* indicates that the order in where conditions is ignored.

Installation

Both the evaluation script as well as the dataset are stored within the repo. Only Python 3 is supported at the moment - I would very much welcome a pull request that ports the code to work with Python 2. The installation steps are as follows:

git clone https://github.com/salesforce/WikiSQL
cd WikiSQL
pip install -r requirements.txt
tar xvjf data.tar.bz2

This will unpack the data files into a directory called data.

Content and format

Inside the data folder you will find the files in jsonl and db format. The former can be read line by line, where each line is a serialized JSON object. The latter is a SQLite3 database.

Question, query and table ID

These files are contained in the *.jsonl files. A line looks like the following:

{
   "phase":1,
   "question":"who is the manufacturer for the order year 1998?",
   "sql":{
      "conds":[
         [
            0,
            0,
            "1998"
         ]
      ],
      "sel":1,
      "agg":0
   },
   "table_id":"1-10007452-3"
}

The fields represent the following:

phase: the phase in which the dataset was collected. We collected WikiSQL in two phases.
question: the natural language question written by the worker.
table_id: the ID of the table to which this question is addressed.
sql: the SQL query corresponding to the question. This has the following subfields:
- sel: the numerical index of the column that is being selected. You can find the actual column from the table.
- agg: the numerical index of the aggregation operator that is being used. You can find the actual operator from Query.agg_ops in lib/query.py.
- conds: a list of triplets (column_index, operator_index, condition) where:
  - column_index: the numerical index of the condition column that is being used. You can find the actual column from the table.
  - operator_index: the numerical index of the condition operator that is being used. You can find the actual operator from Query.cond_ops in lib/query.py.
  - condition: the comparison value for the condition, in either string or float type.

Tables

These files are contained in the *.tables.jsonl files. A line looks like the following:

{
   "id":"1-1000181-1",
   "header":[
      "State/territory",
      "Text/background colour",
      "Format",
      "Current slogan",
      "Current series",
      "Notes"
   ],
   "types":[
      "text",
      "text",
      "text",
      "text",
      "text",
      "text"
   ],
   "rows":[
      [
         "Australian Capital Territory",
         "blue/white",
         "Yaa\u00b7nna",
         "ACT \u00b7 CELEBRATION OF A CENTURY 2013",
         "YIL\u00b700A",
         "Slogan screenprinted on plate"
      ],
      [
         "New South Wales",
         "black/yellow",
         "aa\u00b7nn\u00b7aa",
         "NEW SOUTH WALES",
         "BX\u00b799\u00b7HI",
         "No slogan on current series"
      ],
      [
         "New South Wales",
         "black/white",
         "aaa\u00b7nna",
         "NSW",
         "CPX\u00b712A",
         "Optional white slimline series"
      ],
      [
         "Northern Territory",
         "ochre/white",
         "Ca\u00b7nn\u00b7aa",
         "NT \u00b7 OUTBACK AUSTRALIA",
         "CB\u00b706\u00b7ZZ",
         "New series began in June 2011"
      ],
      [
         "Queensland",
         "maroon/white",
         "nnn\u00b7aaa",
         "QUEENSLAND \u00b7 SUNSHINE STATE",
         "999\u00b7TLG",
         "Slogan embossed on plate"
      ],
      [
         "South Australia",
         "black/white",
         "Snnn\u00b7aaa",
         "SOUTH AUSTRALIA",
         "S000\u00b7AZD",
         "No slogan on current series"
      ],
      [
         "Victoria",
         "blue/white",
         "aaa\u00b7nnn",
         "VICTORIA - THE PLACE TO BE",
         "ZZZ\u00b7562",
         "Current series will be exhausted this year"
      ]
   ]
}

The fields represent the following:

id: the table ID.
header: a list of column names in the table.
rows: a list of rows. Each row is a list of row entries.

Tables are also contained in a corresponding *.db file. This is a SQL database with the same information. Note that due to the flexible format of HTML tables, the column names of tables in the database has been symbolized. For example, for a table with the columns ['foo', 'bar'], the columns in the database are actually col0 and col1.

Scripts

evaluate.py contains the evaluation script, whose options are:

usage: evaluate.py [-h] source_file db_file pred_file

positional arguments:
  source_file  source file for the prediction
  db_file      source database for the prediction
  pred_file    predictions by the model

optional arguments:
  -h, --help   show this help message and exit

The pred_file, which is supplied by the user, should contain lines of serialized JSON objects. Each JSON object should contain a query field which corresponds to the query predicted for a line in the input *.jsonl file and should be similar to the sql field of the input. In particular, it should contain:

sel: the numerical index of the column that is being selected. You can find the actual column from the table.
agg: the numerical index of the aggregation operator that is being used. You can find the actual operator from Query.agg_ops in lib/query.py.
conds: a list of triplets (column_index, operator_index, condition) where:
- column_index: the numerical index of the condition column that is being used. You can find the actual column from the table.
- operator_index: the numerical index of the condition operator that is being used. You can find the actual operator from Query.cond_ops in lib/query.py.
- condition: the comparison value for the condition, in either string or float type.

An example predictions file can be found in test/example.pred.dev.jsonl. The lib directory contains dependencies of evaluate.py.

Integration Test

We supply a sample predictions file for the dev set in test/example.pred.dev.jsonl.bz2. You can unzip this file using bunzip2 test/example.pred.dev.jsonl.bz2 -k to look at what a real predictions file should look like. We distribute a docker file which installs the necessary dependencies of this library and runs the evaluation script on this file. The docker file also serves as an example of how to use the evaluation script.

To run the test, first build the image from the root directory:

docker build -t wikisqltest -f test/Dockerfile .

Next, run the image

docker run --rm --name wikisqltest wikisqltest

If everything works correctly, the output should be:

{
  "ex_accuracy": 0.5380596128725804,
  "lf_accuracy": 0.35375846099038116
}

Annotation

In addition to the raw data dump, we also release an optional annotation script that annotates WikiSQL using Stanford CoreNLP. The annotate.py script will annotate the query, question, and SQL table, as well as a sequence to sequence construction of the input and output for convenience of using Seq2Seq models. To use annotate.py, you must set up the CoreNLP python client using Stanford Stanza. One docker image of the CoreNLP server that this works with is here:

docker run --name corenlp -d -p 9000:9000 vzhong/corenlp-server

Note that the sequence output contain symbols to delineate the boundaries of fields. In lib/query.py you will also find accompanying functions to reconstruct a query given a sequence output in the annotated format.

FAQ

I will update this list with frequently asked questions.

How do you convert HTML table columns to SQL table columns?

Web tables are noisy and are not directly transferrable into a database. One problem is that SQL column names need to be symbolic whereas web table columns usually have unicode characters, whitespaces etc. To handle this problem, we convert table columns to symbols (e.g. Player Name to col1) just before executing the query. For the implementation details, please see evaluate.py.

Changelog

1.1: Removed examples from each split that have gloss mismatch between the logical form conditions and the annotated question utterance.

wikisql's People

Contributors

Stargazers

Watchers

Forkers

skidanovalex guptam codeaudit johndpope viur benelot stevenlol gazzola zxydi1992 linpingchuan mindis benjamesbabala timoxue wlau0721 benajibayassine rsilveira79 dkorolev chulakar akanimax tony32769 jithsjoy vicenschamorro semanticparsing khemanta nachoaguadoc cosecant-csc chrisatang sidbrahma kmjawadurrahman epfl-lara samratp-zz alxleonov ikshu prebenleroy nilesh-c nunofernandes-plight gouravp anand514 shubhampachori12110095 aswanipranjal sydatascience angelo337 trinhdinhphuong dorarad srishti-56 runningphoton zhi3 vikingmew pku-wuwei guochenkai88 posenhuang datamanagementlab tianforks ryanguest jessica0530 pb-pravin aiedward iamsusiep waveli123 jrocket567 tkhan3 tin-chata liuweiping2020 hyzcn latentine miyamm piyushguptajaipur ccsquare nicholasdbrady tzshi mohan-chinnappan-n pwforks cequencer amberian aboussetta wangbibo hidhineshraja karthikkbaalaji zhihaolzh nlpteam19 prpankajsingh whwang299 ksjpswaroop degerli cyzhangathit gabrielthompson pyseany lovit tomarraj008 won21kr dannyfriar github30 vv123 dingzhelun yi-mao ashokkumarsundar devalnaik pvn25 nikolayvoronchikhin ds-keshev

wikisql's Issues

How to get the original SQL query from the "sql"/"query" field which is in json format

How can I convert the following code in "sql" field into the original SQL query

{"phase": 1, "table_id": "1-1000181-1", "question": "Tell me what the notes are for South Australia ", "sql": {"sel": 5, "conds": [[3, 0, "SOUTH AUSTRALIA"]], "agg": 0}}

I tried using lib.query.Query.from_dict method but get SELECT col5 FROM table WHERE col3 = SOUTH AUSTRALIA
and tried using lib.dbengine.DBEngine.execute_query method but get SELECT col5 AS result FROM table_1_1000181_1 WHERE col3 = :col3.
None of the above two methods get the correct SQL query, so how can I get it? Anybody help?

Using this method on my own SQL database

Hi @vzhong,

I would like to get inferences for my own SQL table. For a similar question asked in Jan'18 you replied-

"You would have to train a model on this data, then perform inference on your data. Xiaojun and Chang from Berkeley has kindly made their model available here: https://github.com/xiaojunxu/SQLNet".

I am a little confused. Won't I need to train on my own SQL table? my column names could be very different. Won't need to create .jsonl files like you have in "data" directory? Could you please help me understand you comment above?

Thank you,
Shruti

In Google Colab - stanza.server.client.PermanentlyFailedException: Timed out waiting for service to come alive.

Hello Team,

While running the setup in Google Colab i am facing the below error:
Please advise further.

Traceback (most recent call last):
File "annotate.py", line 113, in
a = annotate_example(d, tables[d['table_id']])
File "annotate.py", line 38, in annotate_example
ann['question'] = annotate(example['question'])
File "annotate.py", line 22, in annotate
for s in client.annotate(sentence):
File "/usr/local/lib/python3.6/dist-packages/stanza/server/client.py", line 470, in annotate
r = self._request(text.encode('utf-8'), request_properties, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/stanza/server/client.py", line 379, in _request
self.ensure_alive()
File "/usr/local/lib/python3.6/dist-packages/stanza/server/client.py", line 203, in ensure_alive
raise PermanentlyFailedException("Timed out waiting for service to come alive.")
stanza.server.client.PermanentlyFailedException: Timed out waiting for service to come alive.
0% 0/56355 [02:00<?, ?it/s]

dev.sql and test.sql files

HI, All, do anyone of you have dev.sql and test.sql file for wikisql dataset. Somehow i have train.sql , but i don't recall from where i got it. I need schema.sql file for dev and test.
Thank you so much!
Anshu

[None] gold questions

I've instrumented evaluate.py to count the number of gold questions that have a gold answer of [None]. In the dev set, evaluated via the docker instructions on the front page, there are 657 / 8421 = 8% questions with a [None] result. I've manually looked at a few such questions, and they seem to have various errors, most notably lacking space before commas and > instead of >=.

Is this a known issue?

{'phase': 2, 'table_id': '2-12955969-1', 'question': 'What is the year of the tournament played at Melbourne, Australia?', 'sql': {'sel': 0, 'conds': [[2, 0, 'melbourne, australia']], 'agg': 5}}
SELECT AVG(col0) AS result FROM table_2_12955969_1 WHERE col2 = :col2
{'col2': 'melbourne, australia'}
Manual fix: SELECT AVG(col0) AS result FROM table_2_12955969_1 WHERE col2 = 'melbourne , australia';

{'phase': 2, 'table_id': '2-12312050-1', 'question': "What's the sum of points for the 1963 season when there are more than 30 games?", 'sql': {'sel': 4, 'conds': [[0, 0, '1963'], [2, 1, 30]], 'agg': 4}}
SELECT SUM(col4) AS result FROM table_2_12312050_1 WHERE col0 = :col0 AND col2 > :col2
{'col0': '1963', 'col2': 30}
Manual fix: SELECT SUM(col4) AS result FROM table_2_12312050_1 WHERE col0 = '1963' AND col2 >= 30;

where we could download prebuilt model?

hi, where we could download prebuilt model for prediction? thanks

Incorrect Expected Queries

It seems many of the expected queries are incorrect, based on the question posed. Here is a preliminary list of just some questions I noticed have incorrect expected queries/answers:

Question: How many games was Damien Wilkins (27) the high scorer?
Expected query: SELECT MIN(Game) FROM 1-11964154-2 WHERE High points = 'damien wilkins (27)'
Expected result: ['6.0']

Question: What is the name of the integrated where allied-related is shared?
Expected query: SELECT (Component) FROM 1-11944282-1 WHERE Allied-Related = 'shared'
Expected result: ['customers']

Question: what is the integrated in which the holding allied-unrelated is many?
Expected query: SELECT (Holding) FROM 1-11944282-1 WHERE Allied-Unrelated = 'many'
Expected result: ['many']

Question: How many integrated allied-related are there?
Expected query: SELECT (Integrated) FROM 1-11944282-1 WHERE Allied-Related = 'many'
Expected result: ['one']

Question: Which authority has a rocket launch called rehbar-5?
Expected query: SELECT COUNT(Derivatives) FROM 1-11869952-1 WHERE Rocket launch = 'rehbar-5'
Expected result: ['1']

Question: Who had an evening gown score of 9.78?
Expected query: SELECT (Interview) FROM 1-11690135-1 WHERE Evening Gown = '9.78'
Expected result: ['8.91']

How do you think NL2SPARQL?

NL2SPARQL is a method towards knowledge graph, which is similar to NL2SQL.
How do you think NL2SPARQL methods and other methods like deep-path for Knowledge Graph or Knowledge Base?
Thank you.
@vzhong @bmccann

What is the input to annotate.py

Hi,

Could you please share an input format/example file for annotate.py? I would like to create SQL queries on a new separate SQL database

Thank you,
Sandy

How to execute a given sql query on wikisql test set database?

I have generated sql query from test set question and schema. (sql: select nationality where player = terrence ross)

Now i want to execute the sql query over the wikisql test set and return the result. How to do it please?

Thanks.

[BUG] Unable to clone WikiSQL in a system

🐛🐛 Bug Report

The installation of WikiSQL given in README is not working.

It is showing a warning.

The folder WikiSQL is also empty

⚙️ Environment

Python version(s): [3.8.5]
OS: [Windows 10]

ground truth query

Requirements.txt versions

For those having trouble running the evaluation script in 2022 due to version updates in all the dependencies: use the following requirements.txt to get the specific package versions in the past (circa 2017-2018).

tqdm
sqlalchemy==1.2
records==0.5.3
babel==2.5.1
tabulate==0.8.1

Error in the first train record

In the train.jsonl the first record under the seq->conds attribute it says 3 but it should say 0 instead.

For reference following is the object
{'phase': 1, 'table_id': '1-1000181-1', 'question': 'Tell me what the notes are for South Australia ', 'sql': {'sel': 5, 'conds': [[3, 0, 'SOUTH AUSTRALIA']], 'agg': 0}}

and following is the header for the particular table
["State/territory", "Text/background colour", "Format", "Current slogan", "Current series", "Notes"]

Intuitively it should choose State/territory column in the where clause

Following is the translated query
SELECT Notes AS result FROM table_1_1000181_1 WHERE Current slogan = 'south australia';
tell me what the notes are for south australia

How to evaluate a Seq2Seq model ?

Hi,

I have built a se2seq model that takes the question and generates the SQL query directly.

My question is how to evaluate my model since the predictions are in sequence format ("select .... from ... where...")?

It's very urgent, please.

Thanks

Where does ChatGPT stand in the leaderboard?

I asked a few text-to-SQL questions from chatGPT and its answers are pretty good. I'm wondering where does it stand in the leaderboard mentioned in the readme file.

unable to load library CoreNLPClient

Hello Team,

While executing the git code for annotate.py i am getting below error for CoreNLPClient

annotating data/train.jsonl
loading tables
100% 18585/18585 [00:00<00:00, 19025.85it/s]
loading examples
0% 0/56355 [00:00<?, ?it/s]
Traceback (most recent call last):
File "annotate.py", line 113, in
a = annotate_example(d, tables[d['table_id']])
File "annotate.py", line 38, in annotate_example
ann['question'] = annotate(example['question'])
File "annotate.py", line 20, in annotate
client = CoreNLPClient(default_annotators='ssplit,tokenize'.split(','))
NameError: name 'CoreNLPClient' is not defined

Please advise

Not sure about NSM

The paper of neural-symbolic-machines says:

So I am not sure whether NSM use the logic form for training.

Question Generation using a Template

Hi Team,

  Could you please let me know regarding the question generation template that was used to generate questions before the paraphrasing phase? Is the generation template/code a part of this codebase?

How do I begin to use this on my own database?

Hi,

I'm no NLP expert but I would like to understand how I can apply this library and it's techniques on my own data. Is there a hello world example of how to modify the data files so I can use NL queries over my own data?

Read Me points to staging?

Currently the readme.md
Link for "Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning."

Points to https://stg.einstein.ai/static/images/layouts/research/seq2sql/seq2sql.pdf

Should point to:
https://einstein.ai/static/images/layouts/research/seq2sql/seq2sql.pdf

;)

James

TypeError: 'Document' object is not iterable annotate.py

File "C:/Users/Admin/Desktop/WikiSQL-master/annotate.py", line 21, in annotate
for s in client.annotate(sentence):
TypeError: 'Document' object is not iterable
I'm Using the stanza 1.0.1 and corenlp 2020

Cannot operate on a closed database

When running the evaluation.py as in Dockerfile, there will be an error:

Traceback (most recent call last):
  File "bug1.py", line 5, in <module>
    print(db.query('SELECT sql from sqlite_master WHERE tbl_name = :name', name='table_1_10015132_11').first())
  File "/home/zgzhen/projects/cse517-project/eval/seq2sql/WikiSQL/.venv/lib/python3.6/site-packages/records.py", line 214, in first
    record = self[0]
  File "/home/zgzhen/projects/cse517-project/eval/seq2sql/WikiSQL/.venv/lib/python3.6/site-packages/records.py", line 152, in __getitem__
    next(self)
  File "/home/zgzhen/projects/cse517-project/eval/seq2sql/WikiSQL/.venv/lib/python3.6/site-packages/records.py", line 136, in __next__
    nextrow = next(self._rows)
  File "/home/zgzhen/projects/cse517-project/eval/seq2sql/WikiSQL/.venv/lib/python3.6/site-packages/records.py", line 365, in <genexpr>
    row_gen = (Record(cursor.keys(), row) for row in cursor)
  File "/home/zgzhen/projects/cse517-project/eval/seq2sql/WikiSQL/.venv/lib/python3.6/site-packages/sqlalchemy/engine/result.py", line 946, in __iter__
    row = self.fetchone()
  File "/home/zgzhen/projects/cse517-project/eval/seq2sql/WikiSQL/.venv/lib/python3.6/site-packages/sqlalchemy/engine/result.py", line 1276, in fetchone
    e, None, None, self.cursor, self.context
  File "/home/zgzhen/projects/cse517-project/eval/seq2sql/WikiSQL/.venv/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 1458, in _handle_dbapi_exception
    util.raise_from_cause(sqlalchemy_exception, exc_info)
  File "/home/zgzhen/projects/cse517-project/eval/seq2sql/WikiSQL/.venv/lib/python3.6/site-packages/sqlalchemy/util/compat.py", line 296, in raise_from_cause
    reraise(type(exception), exception, tb=exc_tb, cause=cause)
  File "/home/zgzhen/projects/cse517-project/eval/seq2sql/WikiSQL/.venv/lib/python3.6/site-packages/sqlalchemy/util/compat.py", line 276, in reraise
    raise value.with_traceback(tb)
  File "/home/zgzhen/projects/cse517-project/eval/seq2sql/WikiSQL/.venv/lib/python3.6/site-packages/sqlalchemy/engine/result.py", line 1268, in fetchone
    row = self._fetchone_impl()
  File "/home/zgzhen/projects/cse517-project/eval/seq2sql/WikiSQL/.venv/lib/python3.6/site-packages/sqlalchemy/engine/result.py", line 1148, in _fetchone_impl
    return self.cursor.fetchone()
sqlalchemy.exc.ProgrammingError: (sqlite3.ProgrammingError) Cannot operate on a closed database. (Background on this error at: http://sqlalche.me/e/f405)

Look like this issue is related: kennethreitz/records#128

Obtain answers to queries

How do we execute the groudtruth sql query against the db to obtain the groundtruth answer?

Any info about the Anonymous (2019)?

\mathrm in question

Hi @vzhong

I think I found some unusual questions in train.jsonl (see the image below) which contain lots of \mathrm.

In my humble opinion, as they are few (6 questions), they can be removed or can be converted to normal text form (rather than using LaTeX form)?

Thanks!

For real industrial application, what strategy to locate the exact table?

The datasets like WikiSQL is that the table corresponding to question is given.

But in real industrial application, we have 100+ tables for 1 new question.

Thank you!

What does 'OP' stand for in cond_ops

cond_ops = ['=', '>', '<', 'OP']
And I can't find any 'OP' in dataset as index 3 using regex [[0-9], 3,.*?]

Data collection template

Hi @vzhong ,
I've been testing some models on your data and now I would like to create my own data following your format. In your paper you said that you made available examples of the interface used during the paraphrase. Wonder where I could find the template you used for your data collection?

Thank

How to get exact sql query for the natural language question ?

HI All,

I want to generate a custom json file which has the sql query for its natural language question.
I am unable to install docker to execute the annotate and query.py files to get the sql queries as i have windows 10 Home. and docker installation needs Windows 10 PRO.
Can you please suggest , how can i get it without docker way?
Or can you share the file for sql queries for wikisql if you have generated it already?

Thanks
Anshu

AttributeError: 'NoneType' object has no attribute 'number_symbols'

When i run evaluate.py on example 'pred.dev.jsonl' i get this error, what sould i do?
C:\Users\Admin\anaconda3\python.exe C:/Users/Admin/Desktop/WikiSQL-master/evaluate.py
30%|██▉ | 2517/8421 [00:04<00:10, 562.58it/s]
Traceback (most recent call last):
File "C:/Users/Admin/Desktop/WikiSQL-master/evaluate.py", line 29, in
gold = engine.execute_query(eg['table_id'], qg, lower=True)
File "C:\Users\Admin\Desktop\WikiSQL-master\lib\dbengine.py", line 18, in execute_query
return self.execute(table_id, query.sel_index, query.agg_index, query.conditions, *args, **kwargs)
File "C:\Users\Admin\Desktop\WikiSQL-master\lib\dbengine.py", line 40, in execute
val = float(parse_decimal(val))
File "C:\Users\Admin\anaconda3\lib\site-packages\babel\numbers.py", line 707, in parse_decimal
group_symbol = get_group_symbol(locale)
File "C:\Users\Admin\anaconda3\lib\site-packages\babel\numbers.py", line 332, in get_group_symbol
return Locale.parse(locale).number_symbols.get('group', u',')
AttributeError: 'NoneType' object has no attribute 'number_symbols'

Python 2.7 compatibility

The current code is not python 2.7 compatible.

Records library need to be patched for Sqlachemy 2.x

evaluate.py fails when records library depends on sqlachemy 2.x

TypeError: execute() got an unexpected keyword argument 'name'

records.py need changes refer this pull request

How does IE-SQL know the sequence NER labels of question? How to get the NER ground truth?

.travis.yml missing

While submitting a pull request, the travis ci bot failed saying that ".travis.yml file could not be found". It seems to be some configuration file for the travis Continuous Integration module. I tried to locate it in the repo, but couldn't find it. Because of which, the pull request doesn't pass the built in checks.

Leaderboard for weakly supervised WikiSQL task

Since the results on the supervised learning benchmark are quite close to being saturated, I think having a leaderboard for models trained using only weak-supervision would be more relevant benchmark (for example, Memory Augmented Program Synthesis from Liang et. al beats some of the old entries in the strongly-supervised leaderboard using only weak supervision.)

Phase 1 vs. phase 2

Hi,
I am confused by phase 1 and phase 2 annotations in the dataset files.
The paper says phase 1 is a paraphrasing phase while phase 2 is a verification phase. As far as I understand, phase 2 is just about discarding wrong paraphrases. So what do you mean by a given example was collected in phase 1 vs. phase 2?
Thanks,

--Ahmed

Licensing information?

Under which license is released your dataset?

Thank you.

The "sql" part corresponding to the real SQL language

I understand now

What is your plan as ICLR reject the paper?

Thank you.
@vzhong @bmccann

Invalid File Names while cloning the GitHub repo

On cloning this repository using the command git clone https://github.com/salesforce/WikiSQL
Git is able to download the repository but is not able to extract all the files. An error is encountered as follows:

Cloning into 'WikiSQL'...
remote: Enumerating objects: 386, done.
remote: Counting objects: 100% (192/192), done.
remote: Compressing objects: 100% (38/38), done.
remote: Total 386 (delta 185), reused 154 (delta 154), pack-reused 194
Receiving objects: 100% (386/386), 50.72 MiB | 19.88 MiB/s, done.
Resolving deltas: 100% (212/212), done.
error: unable to create file collection/paraphrase/Icon?: Invalid argument
error: unable to create file collection/paraphrase/paraphrase_files/Icon?: Invalid argument
error: unable to create file collection/verify/Icon?: Invalid argument
error: unable to create file collection/verify/verify_files/Icon?: Invalid argument
fatal: unable to checkout working tree
warning: Clone succeeded, but checkout failed.
You can inspect what was checked out with 'git status'
and retry with 'git restore --source=HEAD :/'

This is the output of running git status

On branch master
Your branch is up to date with 'origin/master'.

Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
	deleted:    .dockerignore
	deleted:    .gitattributes
	deleted:    .gitignore
	deleted:    .travis.yml
	deleted:    CODEOWNERS
	deleted:    LICENSE
	deleted:    README.md
	deleted:    annotate.py
	deleted:    collection/README.md
	deleted:    "collection/paraphrase/Icon\r"
	deleted:    collection/paraphrase/index.html
	deleted:    "collection/paraphrase/paraphrase_files/Icon\r"
	deleted:    collection/paraphrase/paraphrase_files/bootstrap.min.css
	deleted:    collection/paraphrase/paraphrase_files/bootstrap.min.js
	deleted:    collection/paraphrase/paraphrase_files/jquery-3.2.1.min.js
	deleted:    collection/paraphrase/paraphrase_files/toastr.min.css
	deleted:    collection/paraphrase/paraphrase_files/toastr.min.js
	deleted:    "collection/verify/Icon\r"
	deleted:    collection/verify/verify.html
	deleted:    "collection/verify/verify_files/Icon\r"
	deleted:    collection/verify/verify_files/bootstrap.min.css
	deleted:    collection/verify/verify_files/bootstrap.min.js
	deleted:    collection/verify/verify_files/jquery-3.2.1.min.js
	deleted:    collection/verify/verify_files/toastr.min.css
	deleted:    collection/verify/verify_files/toastr.min.js
	deleted:    data.tar.bz2
	deleted:    evaluate.py
	deleted:    lib/__init__.py
	deleted:    lib/common.py
	deleted:    lib/dbengine.py
	deleted:    lib/query.py
	deleted:    lib/table.py
	deleted:    requirements.txt
	deleted:    test/Dockerfile
	deleted:    test/check.py
	deleted:    test/example.pred.dev.jsonl.bz2
	deleted:    version.txt

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	.dockerignore
	.gitattributes
	.gitignore
	.travis.yml
	CODEOWNERS
	LICENSE
	README.md
	annotate.py
	collection/
	data.tar.bz2
	evaluate.py
	lib/
	requirements.txt
	test/
	version.txt

And the output of running git restore --source=HEAD :/ as suggested:

error: unable to create file collection/paraphrase/Icon?: Invalid argument
error: unable to create file collection/paraphrase/paraphrase_files/Icon?: Invalid argument
error: unable to create file collection/verify/Icon?: Invalid argument
error: unable to create file collection/verify/verify_files/Icon?: Invalid argument

It seems like the issue is with the filename of the files which contains question marks, a character which is not allowed in file names in Linux file systems.

I attempted to see if the issue can be resolved by downloading the missing files directly from GitHub into the directory where it is supposed to be, for example this file. But it is not possible to download this file as it is and using wget fails as well.

An alternate method tried by me was to download the the Master branch code as a ZIP file and extract is using the unzip WikiSQL-master.zip command. This method works fine and in fact, even the offending files (such as collection/paraphrase/Icon) were successfully extracted with no illegal characters in their file names. It seems like this is an issue with how Git is extracting the files in this repository.

Can you provide the original data of WikiSQL？

At present, we are conducting multimodal task evaluation. We hope to include WikiSQL in the evaluation, but there is no image and layout information. We hope you can provide the original data.

Thanks

[BUG ] make dockerFile processing.

OS is Ubuntu x64

yummyyyy@yummyyyy-virtual-machine:~/公共的/WikiSQL$

sudo docker build -t wikisqltest -f test/Dockerfile .

Sending build context to Docker daemon 80.38MB
Step 1/8 : FROM python:3.6.2-alpine
---> 294201c0731f
Step 2/8 : RUN mkdir -p /eval
---> Using cache
---> 998bcbd64c08
Step 3/8 : WORKDIR /eval
---> Using cache
---> 94acbfabbb4f
Step 4/8 : ADD . /eval/
---> Using cache
---> 942024fe38cb
Step 5/8 : RUN pip install -r requirements.txt
---> Running in fbdf67763a18
Collecting tqdm (from -r requirements.txt (line 1))
Downloading https://files.pythonhosted.org/packages/9c/05/cf212f57daa0eb6106fa668a04d74d932e9881fd4a22f322ea1dadb5aba0/tqdm-4.62.2-py2.py3-none-any.whl (76kB)
Collecting records (from -r requirements.txt (line 2))
Downloading https://files.pythonhosted.org/packages/ef/93/2467c761ea3729713ab97842a46cc125ad09d14a0a174cb637bee4983911/records-0.5.3-py2.py3-none-any.whl
Collecting babel (from -r requirements.txt (line 3))
Downloading https://files.pythonhosted.org/packages/aa/96/4ba93c5f40459dc850d25f9ba93f869a623e77aaecc7a9344e19c01942cf/Babel-2.9.1-py2.py3-none-any.whl (8.8MB)
Collecting tabulate (from -r requirements.txt (line 4))
Downloading https://files.pythonhosted.org/packages/ca/80/7c0cad11bd99985cfe7c09427ee0b4f9bd6b048bd13d4ffb32c6db237dfb/tabulate-0.8.9-py3-none-any.whl
Collecting openpyxl<2.5.0 (from records->-r requirements.txt (line 2))
Downloading https://files.pythonhosted.org/packages/77/26/0bd1a39776f53b4f28e5bb1d26b3fcd99068584a7e1ddca4e09c0d5fd592/openpyxl-2.4.11.tar.gz (158kB)
Collecting SQLAlchemy; python_version >= "3.0" (from records->-r requirements.txt (line 2))
Downloading https://files.pythonhosted.org/packages/ad/c7/61ff52be84f5ac86c72672ceac941981f1685b4ef90391d405a1f89677d0/SQLAlchemy-1.4.23.tar.gz (7.7MB)
Collecting docopt (from records->-r requirements.txt (line 2))
Downloading https://files.pythonhosted.org/packages/a2/55/8f8cab2afd404cf578136ef2cc5dfb50baa1761b68c9da1fb1e4eed343c9/docopt-0.6.2.tar.gz
Collecting tablib>=0.11.4 (from records->-r requirements.txt (line 2))
Downloading https://files.pythonhosted.org/packages/16/85/078fc037b15aa1120d6a0287ec9d092d93d632ab01a0e7a3e69b4733da5e/tablib-3.0.0-py3-none-any.whl (47kB)
Collecting pytz>=2015.7 (from babel->-r requirements.txt (line 3))
Downloading https://files.pythonhosted.org/packages/70/94/784178ca5dd892a98f113cdd923372024dc04b8d40abe77ca76b5fb90ca6/pytz-2021.1-py2.py3-none-any.whl (510kB)
Collecting jdcal (from openpyxl<2.5.0->records->-r requirements.txt (line 2))
Downloading https://files.pythonhosted.org/packages/f0/da/572cbc0bc582390480bbd7c4e93d14dc46079778ed915b505dc494b37c57/jdcal-1.4.1-py2.py3-none-any.whl
Collecting et_xmlfile (from openpyxl<2.5.0->records->-r requirements.txt (line 2))
Downloading https://files.pythonhosted.org/packages/96/c2/3dd434b0108730014f1b96fd286040dc3bcb70066346f7e01ec2ac95865f/et_xmlfile-1.1.0-py3-none-any.whl
Collecting importlib-metadata (from SQLAlchemy; python_version >= "3.0"->records->-r requirements.txt (line 2))
Downloading https://files.pythonhosted.org/packages/71/c2/cb1855f0b2a0ae9ccc9b69f150a7aebd4a8d815bd951e74621c4154c52a8/importlib_metadata-4.8.1-py3-none-any.whl
Collecting greenlet!=0.4.17 (from SQLAlchemy; python_version >= "3.0"->records->-r requirements.txt (line 2))
Downloading https://files.pythonhosted.org/packages/72/7e/d8586068d47adba73afc085249712bd266cd7ffbf27d8bc260c33e9d6133/greenlet-1.1.1.tar.gz (85kB)
Collecting zipp>=0.5 (from importlib-metadata->SQLAlchemy; python_version >= "3.0"->records->-r requirements.txt (line 2))
Downloading https://files.pythonhosted.org/packages/92/d9/89f433969fb8dc5b9cbdd4b4deb587720ec1aeb59a020cf15002b9593eef/zipp-3.5.0-py3-none-any.whl
Collecting typing-extensions>=3.6.4; python_version < "3.8" (from importlib-metadata->SQLAlchemy; python_version >= "3.0"->records->-r requirements.txt (line 2))
Downloading https://files.pythonhosted.org/packages/74/60/18783336cc7fcdd95dae91d73477830aa53f5d3181ae4fe20491d7fc3199/typing_extensions-3.10.0.2-py3-none-any.whl
Building wheels for collected packages: openpyxl, SQLAlchemy, docopt, greenlet
Running setup.py bdist_wheel for openpyxl: started
Running setup.py bdist_wheel for openpyxl: finished with status 'done'
Stored in directory: /root/.cache/pip/wheels/59/44/27/63b211425501ad51d197ff8ed00e9e469e38b9e516cb69b1c2
Running setup.py bdist_wheel for SQLAlchemy: started
Running setup.py bdist_wheel for SQLAlchemy: finished with status 'done'
Stored in directory: /root/.cache/pip/wheels/7d/52/1c/117179bb38418ab4e06deb5c8288acd8ee1e0b418f5e59608f
Running setup.py bdist_wheel for docopt: started
Running setup.py bdist_wheel for docopt: finished with status 'done'
Stored in directory: /root/.cache/pip/wheels/9b/04/dd/7daf4150b6d9b12949298737de9431a324d4b797ffd63f526e
Running setup.py bdist_wheel for greenlet: started
Running setup.py bdist_wheel for greenlet: finished with status 'error'
Complete output from command /usr/local/bin/python -u -c "import setuptools, tokenize;file='/tmp/pip-build-e0uqasn_/greenlet/setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" bdist_wheel -d /tmp/tmpug655ghzpip-wheel- --python-tag cp36:
/usr/local/lib/python3.6/distutils/dist.py:261: UserWarning: Unknown distribution option: 'project_urls'
warnings.warn(msg)
running bdist_wheel
running build
running build_py
creating build
creating build/lib.linux-x86_64-3.6
creating build/lib.linux-x86_64-3.6/greenlet
copying src/greenlet/init.py -> build/lib.linux-x86_64-3.6/greenlet
creating build/lib.linux-x86_64-3.6/greenlet/tests
copying src/greenlet/tests/test_generator_nested.py -> build/lib.linux-x86_64-3.6/greenlet/tests
copying src/greenlet/tests/test_stack_saved.py -> build/lib.linux-x86_64-3.6/greenlet/tests
copying src/greenlet/tests/test_leaks.py -> build/lib.linux-x86_64-3.6/greenlet/tests
copying src/greenlet/tests/test_version.py -> build/lib.linux-x86_64-3.6/greenlet/tests
copying src/greenlet/tests/test_gc.py -> build/lib.linux-x86_64-3.6/greenlet/tests
copying src/greenlet/tests/test_cpp.py -> build/lib.linux-x86_64-3.6/greenlet/tests
copying src/greenlet/tests/test_tracing.py -> build/lib.linux-x86_64-3.6/greenlet/tests
copying src/greenlet/tests/test_weakref.py -> build/lib.linux-x86_64-3.6/greenlet/tests
copying src/greenlet/tests/test_extension_interface.py -> build/lib.linux-x86_64-3.6/greenlet/tests
copying src/greenlet/tests/test_greenlet.py -> build/lib.linux-x86_64-3.6/greenlet/tests
copying src/greenlet/tests/test_contextvars.py -> build/lib.linux-x86_64-3.6/greenlet/tests
copying src/greenlet/tests/test_throw.py -> build/lib.linux-x86_64-3.6/greenlet/tests
copying src/greenlet/tests/test_generator.py -> build/lib.linux-x86_64-3.6/greenlet/tests
copying src/greenlet/tests/init.py -> build/lib.linux-x86_64-3.6/greenlet/tests
running egg_info
writing src/greenlet.egg-info/PKG-INFO
writing dependency_links to src/greenlet.egg-info/dependency_links.txt
writing requirements to src/greenlet.egg-info/requires.txt
writing top-level names to src/greenlet.egg-info/top_level.txt
reading manifest file 'src/greenlet.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
no previously-included directories found matching 'docs/_build'
warning: no files found matching '.py' under directory 'appveyor'
warning: no previously-included files matching '.pyc' found anywhere in distribution
warning: no previously-included files matching '.pyd' found anywhere in distribution
warning: no previously-included files matching '.so' found anywhere in distribution
warning: no previously-included files matching '.coverage' found anywhere in distribution
writing manifest file 'src/greenlet.egg-info/SOURCES.txt'
copying src/greenlet/greenlet.c -> build/lib.linux-x86_64-3.6/greenlet
copying src/greenlet/greenlet.h -> build/lib.linux-x86_64-3.6/greenlet
copying src/greenlet/slp_platformselect.h -> build/lib.linux-x86_64-3.6/greenlet
creating build/lib.linux-x86_64-3.6/greenlet/platform
copying src/greenlet/platform/setup_switch_x64_masm.cmd -> build/lib.linux-x86_64-3.6/greenlet/platform
copying src/greenlet/platform/switch_aarch64_gcc.h -> build/lib.linux-x86_64-3.6/greenlet/platform
copying src/greenlet/platform/switch_alpha_unix.h -> build/lib.linux-x86_64-3.6/greenlet/platform
copying src/greenlet/platform/switch_amd64_unix.h -> build/lib.linux-x86_64-3.6/greenlet/platform
copying src/greenlet/platform/switch_arm32_gcc.h -> build/lib.linux-x86_64-3.6/greenlet/platform
copying src/greenlet/platform/switch_arm32_ios.h -> build/lib.linux-x86_64-3.6/greenlet/platform
copying src/greenlet/platform/switch_csky_gcc.h -> build/lib.linux-x86_64-3.6/greenlet/platform
copying src/greenlet/platform/switch_m68k_gcc.h -> build/lib.linux-x86_64-3.6/greenlet/platform
copying src/greenlet/platform/switch_mips_unix.h -> build/lib.linux-x86_64-3.6/greenlet/platform
copying src/greenlet/platform/switch_ppc64_aix.h -> build/lib.linux-x86_64-3.6/greenlet/platform
copying src/greenlet/platform/switch_ppc64_linux.h -> build/lib.linux-x86_64-3.6/greenlet/platform
copying src/greenlet/platform/switch_ppc_aix.h -> build/lib.linux-x86_64-3.6/greenlet/platform
copying src/greenlet/platform/switch_ppc_linux.h -> build/lib.linux-x86_64-3.6/greenlet/platform
copying src/greenlet/platform/switch_ppc_macosx.h -> build/lib.linux-x86_64-3.6/greenlet/platform
copying src/greenlet/platform/switch_ppc_unix.h -> build/lib.linux-x86_64-3.6/greenlet/platform
copying src/greenlet/platform/switch_riscv_unix.h -> build/lib.linux-x86_64-3.6/greenlet/platform
copying src/greenlet/platform/switch_s390_unix.h -> build/lib.linux-x86_64-3.6/greenlet/platform
copying src/greenlet/platform/switch_sparc_sun_gcc.h -> build/lib.linux-x86_64-3.6/greenlet/platform
copying src/greenlet/platform/switch_x32_unix.h -> build/lib.linux-x86_64-3.6/greenlet/platform
copying src/greenlet/platform/switch_x64_masm.asm -> build/lib.linux-x86_64-3.6/greenlet/platform
copying src/greenlet/platform/switch_x64_masm.obj -> build/lib.linux-x86_64-3.6/greenlet/platform
copying src/greenlet/platform/switch_x64_msvc.h -> build/lib.linux-x86_64-3.6/greenlet/platform
copying src/greenlet/platform/switch_x86_msvc.h -> build/lib.linux-x86_64-3.6/greenlet/platform
copying src/greenlet/platform/switch_x86_unix.h -> build/lib.linux-x86_64-3.6/greenlet/platform
copying src/greenlet/tests/_test_extension.c -> build/lib.linux-x86_64-3.6/greenlet/tests
copying src/greenlet/tests/_test_extension_cpp.cpp -> build/lib.linux-x86_64-3.6/greenlet/tests
running build_ext
building 'greenlet._greenlet' extension
creating build/temp.linux-x86_64-3.6
creating build/temp.linux-x86_64-3.6/src
creating build/temp.linux-x86_64-3.6/src/greenlet
gcc -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/usr/local/include/python3.6m -c src/greenlet/greenlet.c -o build/temp.linux-x86_64-3.6/src/greenlet/greenlet.o
unable to execute 'gcc': No such file or directory
error: command 'gcc' failed with exit status 1

Failed building wheel for greenlet
Running setup.py clean for greenlet
Successfully built openpyxl SQLAlchemy docopt
Failed to build greenlet
Installing collected packages: tqdm, jdcal, et-xmlfile, openpyxl, zipp, typing-extensions, importlib-metadata, greenlet, SQLAlchemy, docopt, tablib, records, pytz, babel, tabulate
Running setup.py install for greenlet: started
Running setup.py install for greenlet: finished with status 'error'
Complete output from command /usr/local/bin/python -u -c "import setuptools, tokenize;file='/tmp/pip-build-e0uqasn_/greenlet/setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record /tmp/pip-tmdjgba9-record/install-record.txt --single-version-externally-managed --compile:
/usr/local/lib/python3.6/distutils/dist.py:261: UserWarning: Unknown distribution option: 'project_urls'
warnings.warn(msg)
running install
running build
running build_py
creating build
creating build/lib.linux-x86_64-3.6
creating build/lib.linux-x86_64-3.6/greenlet
copying src/greenlet/init.py -> build/lib.linux-x86_64-3.6/greenlet
creating build/lib.linux-x86_64-3.6/greenlet/tests
copying src/greenlet/tests/test_generator_nested.py -> build/lib.linux-x86_64-3.6/greenlet/tests
copying src/greenlet/tests/test_stack_saved.py -> build/lib.linux-x86_64-3.6/greenlet/tests
copying src/greenlet/tests/test_leaks.py -> build/lib.linux-x86_64-3.6/greenlet/tests
copying src/greenlet/tests/test_version.py -> build/lib.linux-x86_64-3.6/greenlet/tests
copying src/greenlet/tests/test_gc.py -> build/lib.linux-x86_64-3.6/greenlet/tests
copying src/greenlet/tests/test_cpp.py -> build/lib.linux-x86_64-3.6/greenlet/tests
copying src/greenlet/tests/test_tracing.py -> build/lib.linux-x86_64-3.6/greenlet/tests
copying src/greenlet/tests/test_weakref.py -> build/lib.linux-x86_64-3.6/greenlet/tests
copying src/greenlet/tests/test_extension_interface.py -> build/lib.linux-x86_64-3.6/greenlet/tests
copying src/greenlet/tests/test_greenlet.py -> build/lib.linux-x86_64-3.6/greenlet/tests
copying src/greenlet/tests/test_contextvars.py -> build/lib.linux-x86_64-3.6/greenlet/tests
copying src/greenlet/tests/test_throw.py -> build/lib.linux-x86_64-3.6/greenlet/tests
copying src/greenlet/tests/test_generator.py -> build/lib.linux-x86_64-3.6/greenlet/tests
copying src/greenlet/tests/init.py -> build/lib.linux-x86_64-3.6/greenlet/tests
running egg_info
writing src/greenlet.egg-info/PKG-INFO
writing dependency_links to src/greenlet.egg-info/dependency_links.txt
writing requirements to src/greenlet.egg-info/requires.txt
writing top-level names to src/greenlet.egg-info/top_level.txt
reading manifest file 'src/greenlet.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
no previously-included directories found matching 'docs/_build'
warning: no files found matching '.py' under directory 'appveyor'
warning: no previously-included files matching '.pyc' found anywhere in distribution
warning: no previously-included files matching '.pyd' found anywhere in distribution
warning: no previously-included files matching '.so' found anywhere in distribution
warning: no previously-included files matching '.coverage' found anywhere in distribution
writing manifest file 'src/greenlet.egg-info/SOURCES.txt'
copying src/greenlet/greenlet.c -> build/lib.linux-x86_64-3.6/greenlet
copying src/greenlet/greenlet.h -> build/lib.linux-x86_64-3.6/greenlet
copying src/greenlet/slp_platformselect.h -> build/lib.linux-x86_64-3.6/greenlet
creating build/lib.linux-x86_64-3.6/greenlet/platform
copying src/greenlet/platform/setup_switch_x64_masm.cmd -> build/lib.linux-x86_64-3.6/greenlet/platform
copying src/greenlet/platform/switch_aarch64_gcc.h -> build/lib.linux-x86_64-3.6/greenlet/platform
copying src/greenlet/platform/switch_alpha_unix.h -> build/lib.linux-x86_64-3.6/greenlet/platform
copying src/greenlet/platform/switch_amd64_unix.h -> build/lib.linux-x86_64-3.6/greenlet/platform
copying src/greenlet/platform/switch_arm32_gcc.h -> build/lib.linux-x86_64-3.6/greenlet/platform
copying src/greenlet/platform/switch_arm32_ios.h -> build/lib.linux-x86_64-3.6/greenlet/platform
copying src/greenlet/platform/switch_csky_gcc.h -> build/lib.linux-x86_64-3.6/greenlet/platform
copying src/greenlet/platform/switch_m68k_gcc.h -> build/lib.linux-x86_64-3.6/greenlet/platform
copying src/greenlet/platform/switch_mips_unix.h -> build/lib.linux-x86_64-3.6/greenlet/platform
copying src/greenlet/platform/switch_ppc64_aix.h -> build/lib.linux-x86_64-3.6/greenlet/platform
copying src/greenlet/platform/switch_ppc64_linux.h -> build/lib.linux-x86_64-3.6/greenlet/platform
copying src/greenlet/platform/switch_ppc_aix.h -> build/lib.linux-x86_64-3.6/greenlet/platform
copying src/greenlet/platform/switch_ppc_linux.h -> build/lib.linux-x86_64-3.6/greenlet/platform
copying src/greenlet/platform/switch_ppc_macosx.h -> build/lib.linux-x86_64-3.6/greenlet/platform
copying src/greenlet/platform/switch_ppc_unix.h -> build/lib.linux-x86_64-3.6/greenlet/platform
copying src/greenlet/platform/switch_riscv_unix.h -> build/lib.linux-x86_64-3.6/greenlet/platform
copying src/greenlet/platform/switch_s390_unix.h -> build/lib.linux-x86_64-3.6/greenlet/platform
copying src/greenlet/platform/switch_sparc_sun_gcc.h -> build/lib.linux-x86_64-3.6/greenlet/platform
copying src/greenlet/platform/switch_x32_unix.h -> build/lib.linux-x86_64-3.6/greenlet/platform
copying src/greenlet/platform/switch_x64_masm.asm -> build/lib.linux-x86_64-3.6/greenlet/platform
copying src/greenlet/platform/switch_x64_masm.obj -> build/lib.linux-x86_64-3.6/greenlet/platform
copying src/greenlet/platform/switch_x64_msvc.h -> build/lib.linux-x86_64-3.6/greenlet/platform
copying src/greenlet/platform/switch_x86_msvc.h -> build/lib.linux-x86_64-3.6/greenlet/platform
copying src/greenlet/platform/switch_x86_unix.h -> build/lib.linux-x86_64-3.6/greenlet/platform
copying src/greenlet/tests/_test_extension.c -> build/lib.linux-x86_64-3.6/greenlet/tests
copying src/greenlet/tests/_test_extension_cpp.cpp -> build/lib.linux-x86_64-3.6/greenlet/tests
running build_ext
building 'greenlet._greenlet' extension
creating build/temp.linux-x86_64-3.6
creating build/temp.linux-x86_64-3.6/src
creating build/temp.linux-x86_64-3.6/src/greenlet
gcc -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/usr/local/include/python3.6m -c src/greenlet/greenlet.c -o build/temp.linux-x86_64-3.6/src/greenlet/greenlet.o
unable to execute 'gcc': No such file or directory
error: command 'gcc' failed with exit status 1

----------------------------------------

Command "/usr/local/bin/python -u -c "import setuptools, tokenize;file='/tmp/pip-build-e0uqasn_/greenlet/setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record /tmp/pip-tmdjgba9-record/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /tmp/pip-build-e0uqasn_/greenlet/
You are using pip version 9.0.1, however version 21.2.4 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
The command '/bin/sh -c pip install -r requirements.txt' returned a non-zero code: 1

How to differentiate 'AND' vs 'OR

I see json when query : OR, AND in "conds" just a array of condition and separate by sysbol ",",
So what a signature maked we know that is "AND" or "OR" ?

annotate.py throws exception: query word '.' is not in input vocabulary.

query word "." is not in input vocabulary.
['symsyms', 'symselect', 'symwhere', 'symand', 'symcol', 'symtable', 'symcaption', 'sympage', 'symsection', 'symop', 'symcond', 'symquestion', 'symagg', 'symaggops', 'symcondops', 'symaggops', 'max', 'min', 'count', 'sum', 'avg', 'symcondops', '=', '>', '<', 'op', 'symtable', 'symcol', 'species', 'symcol', 'indole', 'symcol', 'methyl', 'red', 'symcol', 'voges-proskauer', 'symcol', 'citrate', 'symquestion', 'what', 'is', 'the', 'result', 'for', 'salmonella', 'spp.', 'if', 'you', 'use', 'citrate', '?', 'symend']
Traceback (most recent call last):
File "annotate.py", line 119, in
raise Exception(str(a))
Exception: {'table_id': '1-16083989-1', 'question': {'gloss': ['What', 'is', 'the', 'result', 'for', 'salmonella', 'spp.', 'if', 'you', 'use', 'citrate', '?'], 'words': ['what', 'is', 'the', 'result', 'for', 'salmonella', 'spp.', 'if', 'you', 'use', 'citrate', '?'], 'after': [' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', '', '']}, 'table': {'header': [{'gloss': ['Species'], 'words': ['species'], 'after': ['']}, {'gloss': ['Indole'], 'words': ['indole'], 'after': ['']}, {'gloss': ['Methyl', 'Red'], 'words': ['methyl', 'red'], 'after': [' ', '']}, {'gloss': ['Voges-Proskauer'], 'words': ['voges-proskauer'], 'after': ['']}, {'gloss': ['Citrate'], 'words': ['citrate'], 'after': ['']}]}, 'query': {'sel': 4, 'conds': [[0, 0, {'gloss': ['Salmonella', 'spp', '.'], 'words': ['salmonella', 'spp.', '.'], 'after': [' ', '', '']}]], 'agg': 3}, 'seq_input': {'gloss': ['SYMSYMS', 'SYMSELECT', 'SYMWHERE', 'SYMAND', 'SYMCOL', 'SYMTABLE', 'SYMCAPTION', 'SYMPAGE', 'SYMSECTION', 'SYMOP', 'SYMCOND', 'SYMQUESTION', 'SYMAGG', 'SYMAGGOPS', 'SYMCONDOPS', 'SYMAGGOPS', 'MAX', 'MIN', 'COUNT', 'SUM', 'AVG', 'SYMCONDOPS', '=', '>', '<', 'OP', 'SYMTABLE', 'SYMCOL', 'Species', 'SYMCOL', 'Indole', 'SYMCOL', 'Methyl', 'Red', 'SYMCOL', 'Voges-Proskauer', 'SYMCOL', 'Citrate', 'SYMQUESTION', 'What', 'is', 'the', 'result', 'for', 'salmonella', 'spp.', 'if', 'you', 'use', 'citrate', '?', 'SYMEND'], 'words': ['symsyms', 'symselect', 'symwhere', 'symand', 'symcol', 'symtable', 'symcaption', 'sympage', 'symsection', 'symop', 'symcond', 'symquestion', 'symagg', 'symaggops', 'symcondops', 'symaggops', 'max', 'min', 'count', 'sum', 'avg', 'symcondops', '=', '>', '<', 'op', 'symtable', 'symcol', 'species', 'symcol', 'indole', 'symcol', 'methyl', 'red', 'symcol', 'voges-proskauer', 'symcol', 'citrate', 'symquestion', 'what', 'is', 'the', 'result', 'for', 'salmonella', 'spp.', 'if', 'you', 'use', 'citrate', '?', 'symend'], 'after': [' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', '', ' ', '']}, 'seq_output': {'gloss': ['SYMSELECT', 'SYMAGG', 'COUNT', 'SYMCOL', 'Citrate', 'SYMWHERE', 'SYMCOL', 'Species', 'SYMOP', '=', 'SYMCOND', 'Salmonella', 'spp', '.', 'SYMEND'], 'words': ['symselect', 'symagg', 'count', 'symcol', 'citrate', 'symwhere', 'symcol', 'species', 'symop', '=', 'symcond', 'salmonella', 'spp.', '.', 'symend'], 'after': [' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', '', ' ', '']}, 'where_output': {'gloss': ['SYMWHERE', 'SYMCOL', 'Species', 'SYMOP', '=', 'SYMCOND', 'Salmonella', 'spp', '.', 'SYMEND'], 'words': ['symwhere', 'symcol', 'species', 'symop', '=', 'symcond', 'salmonella', 'spp.', '.', 'symend'], 'after': [' ', ' ', ' ', ' ', ' ', ' ', ' ', '', ' ', '']}}

Error while running evaluate.py

I tried running evaluate.py with dev data.
The command I have given in cmd:
python evaluate.py data\dev.jsonl data\dev.db test\example.pred.dev.jsonl

I'm getting this error-

Can someone help where I'm going wrong?

Regards,

Slight improvement:

In the installation notes, for cloning the repository, the instruction says use the git clone url command. The url however lacks the .git extension because of which the git-lfs checkout hook doesn't get executed. It took me some time to realise why I was not being able to extract the data. I was getting the error that "(stdin) is not a bzip2 file".
Let me know what the details of this problem are.

``query word "symend" is not in input vocabulary''

Thanks for sharing the dataset and preprocessing scripts.

I tried to run the following command to generate examples:

python annotate.py

But the error message indicates that is_valid_example() returns False because of ``symend'':

annotating data/train.jsonl
loading tables
100%|██████████████████████████████████████████████████████████| 18585/18585 [00:01<00:00, 11703.27it/s]
loading examples
  0%|                                                                         | 0/61297 [00:00<?, ?it/s]query word "symend" is not in input vocabulary.
[u'symsyms', u'symselect', u'symwhere', u'symand', u'symcol', u'symtable', u'symcaption', u'sympage', u'symsection', u'symop', u'symcond', u'symquestion', u'symagg', u'symaggops', u'symcondops', u'symaggops', u'max', u'min', u'count', u'sum', u'avg', u'symcondops', u'=', u'>', u'<', u'op', u'symtable', u'symcol', u'state/territory', u'symcol', u'text/background', u'colour', u'symcol', u'format', u'symcol', u'current', u'slogan', u'symcol', u'current', u'series', u'symcol', u'notes', u'symquestion', u'tell', u'me', u'what', u'the', u'notes', u'are', u'for', u'south', u'australia']

Traceback (most recent call last):
  File "annotate.py", line 114, in <module>
    raise Exception(str(a))
Exception: {'seq_output': {'gloss': [u'SYMSELECT', u'SYMAGG', u'SYMCOL', u'Notes', u'SYMWHERE', u'SYMCOL', u'Current', u'slogan', u'SYMOP', u'=', u'SYMCOND', u'SOUTH', u'AUSTRALIA', u'SYMEND'], 'words': [u'symselect', u'symagg', u'symcol', u'notes', u'symwhere', u'symcol', u'current', u'slogan', u'symop', u'=', u'symcond', u'south', u'australia', u'symend'], 'after': [u' ', u'  ', u' ', u' ', u' ', u' ', u' ', u' ', u' ', u' ', u' ', u' ', u' ', u'']}, 'where_output': {'gloss': [u'SYMWHERE', u'SYMCOL', u'Current', u'slogan', u'SYMOP', u'=', u'SYMCOND', u'SOUTH', u'AUSTRALIA', u'SYMEND'], 'words': [u'symwhere', u'symcol', u'current', u'slogan', u'symop', u'=', u'symcond', u'south', u'australia', u'symend'], 'after': [u' ', u' ', u' ', u' ', u' ', u' ', u' ', u' ', u' ', u'']}, 'question': {'gloss': [u'Tell', u'me', u'what', u'the', u'notes', u'are', u'for', u'South', u'Australia'], 'words': [u'tell', u'me', u'what', u'the', u'notes', u'are', u'for', u'south', u'australia'], 'after': [u' ', u' ', u' ', u' ', u' ', u' ', u' ', u' ', u'']}, 'table_id': u'1-1000181-1', 'table': {'header': [{'gloss': [u'State/territory'], 'words': [u'state/territory'], 'after': [u'']}, {'gloss': [u'Text/background', u'colour'], 'words': [u'text/background', u'colour'], 'after': [u' ', u'']}, {'gloss': [u'Format'], 'words': [u'format'], 'after': [u'']}, {'gloss': [u'Current', u'slogan'], 'words': [u'current', u'slogan'], 'after': [u' ', u'']}, {'gloss': [u'Current', u'series'], 'words': [u'current', u'series'], 'after': [u' ', u'']}, {'gloss': [u'Notes'], 'words': [u'notes'], 'after': [u'']}]}, 'query': {u'agg': 0, u'sel': 5, u'conds': [[3, 0, {'gloss': [u'SOUTH', u'AUSTRALIA'], 'words': [u'south', u'australia'], 'after': [u' ', u'']}]]}, 'seq_input': {'gloss': [u'SYMSYMS', u'SYMSELECT', u'SYMWHERE', u'SYMAND', u'SYMCOL', u'SYMTABLE', u'SYMCAPTION', u'SYMPAGE', u'SYMSECTION', u'SYMOP', u'SYMCOND', u'SYMQUESTION', u'SYMAGG', u'SYMAGGOPS', u'SYMCONDOPS', u'SYMAGGOPS', u'MAX', u'MIN', u'COUNT', u'SUM', u'AVG', u'SYMCONDOPS', u'=', u'>', u'<', u'OP', u'SYMTABLE', u'SYMCOL', u'State/territory', u'SYMCOL', u'Text/background', u'colour', u'SYMCOL', u'Format', u'SYMCOL', u'Current', u'slogan', u'SYMCOL', u'Current', u'series', u'SYMCOL', u'Notes', u'SYMQUESTION', u'Tell', u'me', u'what', u'the', u'notes', u'are', u'for', u'South', u'Australia'], 'words': [u'symsyms', u'symselect', u'symwhere', u'symand', u'symcol', u'symtable', u'symcaption', u'sympage', u'symsection', u'symop', u'symcond', u'symquestion', u'symagg', u'symaggops', u'symcondops', u'symaggops', u'max', u'min', u'count', u'sum', u'avg', u'symcondops', u'=', u'>', u'<', u'op', u'symtable', u'symcol', u'state/territory', u'symcol', u'text/background', u'colour', u'symcol', u'format', u'symcol', u'current', u'slogan', u'symcol', u'current', u'series', u'symcol', u'notes', u'symquestion', u'tell', u'me', u'what', u'the', u'notes', u'are', u'for', u'south', u'australia'], 'after': [u' ', u' ', u' ', u' ', u' ', u' ', u' ', u' ', u' ', u' ', u' ', u' ', u' ', u' ', u' ', u'  ', u' ', u' ', u' ', u' ', u' ', u' ', u' ', u' ', u' ', u' ', u' ', u' ', u' ', u' ', u' ', u' ', u' ', u' ', u' ', u' ', u' ', u' ', u' ', u' ', u' ', u' ', u' ', u' ', u' ', u' ', u' ', u' ', u' ', u' ', u' ', u'']}}

Higher LF accuracy on sample output.

Hi, I'm new to this data set. I followed README and run evaluation on example.pred.dev.jsonl. I got the following result.
{
"ex_accuracy": 0.5380596128725804,
"lf_accuracy": 0.45208407552547203
}

I'm not sure what is wrong. Can you give me some hint?