Coder Social home page Coder Social logo

chia-hsuan-lee / kaggledbqa Goto Github PK

View Code? Open in Web Editor NEW
25.0 4.0 4.0 301 KB

Introduction page of a challenging text-to-SQL dataset: KaggleDBQA

License: Other

dataset kaggle kaggle-dataset natural-language-processing question-answering semantic-parsing spider text-to-sql

kaggledbqa's Introduction

KaggleDBQA

Introduction and data download page of a challenging text-to-SQL dataset: KaggleDBQA.

Data | Evaluation | Paper | Citation | Leaderboard

KaggleDBQA is a challenging cross-domain and complex evaluation dataset of real Web databases, with domain-specific data types, original formatting, and unrestricted questions. It expands upon contemporary cross-domain text-to-SQL datasets in three key aspects:
(1) Its databases are pulled from real-world data sources and not normalized.
(2) Its questions are authored in environments that mimic natural question answering.
(3) It also provides database documentation which contain rich in-domain knowledge.

task_overview

Data Format

Natural Language questions and corresponding SQL queries

Each file in examples/ contains the following fields:

  • question: the natural language question
  • question_toks: the natural language question tokens
  • db_id: the database id to which this question is addressed.
  • query: the SQL query corresponding to the question.
  • query_toks: the SQL query tokens corresponding to the question.
  • sql: parsed results of this SQL query using process_sql.py from Spider
   {
        "db_id": "StudentMathScore",
        "query": "SELECT T1.school_district FROM FINREV_FED_17 as T1 JOIN FINREV_FED_KEY_17 as T2 ON T1.state_code = T2.state_code WHERE T2.state = \"Wisconsin\" ORDER BY T1.t_fed_rev DESC LIMIT 1",
        "query_toks": ["SELECT", "T1.school_district", "FROM", "FINREV_FED_17", ...],
        "question": "Which school district receive the most of federal revenue through state in Wisconsin?",
        "question_toks": ["Which", "school", "district", ...],
        "sql": {
            "from": {
                "table_units": [
                    [
                        "table_unit",
                        0
                    ],
                    [
                        "table_unit",
                        2
                    ]
                ],
        ...
        }
    },

Tables

KaggleDBQA_tables.json contains the following information for each database:

  • db_id: database id
  • table_names_original: original table names.
  • table_names: table names. The only preprocess steps are (1) tokenize names based on snake case and camel case. (2) lowercase.
  • column_names_original: original column names. Each column entry is given as [tableindex, name], where tableindex is the index of the table this column belongs to (in table_names), and name is the name of the column.
  • column_names: column names. The only preprocess steps are (1) tokenize names based on snake case and camel case. (2) lowercase.
  • column_names_manually_normalized_alternative: normalized column names for an alternative evaluation setting. This was done manually using column descriptions to help clarify each column and without introducing any extra knowledge into the column names except for the expansion of abbreviations (e.g. t_fed_rev -> total federal revenue). Please evaluate parsers in non-normalized setting by default, and use normalized column names for research exploration of model capabilities (as in our paper).
  • column_descriptions: textual descriptions that explain the meaning of the columns.
  • db_overview: textual descriptions that describe the general content in each DB.
  • value_enums: the set of possible values in the categorial columns and their meaning explanations.
  • column_types: data type of each column.
  • foreign_keys: foreign keys in the database. [3, 8] means column indices in the column_names. These two columns are foreign keys of two different tables.
  • primary_keys: primary keys in the database. Each number is the index of column_names.
    {
        "column_descriptions": [
        ...,
            "Total federal revenue through the state to each school district.",
            "Federal revenue through the state- Title 1 (no child left behind act).",
            "Federal revenue through the state- Child Nutrition A",
        ...
        ],
        "column_names": [
        ...,
            [
                0,
                "t fed rev"
            ],
            [
                0,
                "c14"
            ],
            [
                0,
                "c25"
            ],
        ...
        ],
        "column_names_normalized": [
        ...,
            [
                0,
                "total federal revenue"
            ],
            [
                0,
                "child 14"
            ],
            [
                0,
                "child 25"
            ],
        ...
        ],
        "column_names_original": [
        ...,
            [
                0,
                "t_fed_rev"
            ],
            [
                0,
                "c14"
            ],
            [
                0,
                "c25"
            ],
        ...
        ],
        "column_types": [
        ...,
            "number",
            "number",
            "number",
        ...
        ],
        "db_id": "StudentMathScore",
        "foreign_keys": [],
        "primary_keys": [],
        "table_names": [
            "finrev fed 17",
            "ndecoreexcel math grade8",
            "finrev fed key 17"
        ],
        "table_names_original": [
            "FINREV_FED_17",
            "NDECoreExcel_Math_Grade8",
            "FINREV_FED_KEY_17"
        ]
    },

Database Content

All the database content stored in SQLite database file under databases/. Please download the databases here.

Evaluation

KaggleDBQA is primarily used as a testing suite for realistic semantic parsing of text-to-SQL. We introduce two evaluation settings here (1) Plain-Testing (2) Few-Shot Annotation/Finetuning. We randomly split examples in each database into two subsets(30/70 ratio): finetune and test.

(1) Plain-Testing

Use the files examples/*_test to evaluate parsers.

(2) Few-Shot Annotation/Finetuning

You can use the in-domain examples contained in examples/*_fewshot to improve your model (e.g., by fine-tuning or other few-shot training methods) before testing on the examples in examples/*_test.

Citation and Contact

If you use the dataset in your work, please cite our paper.

@inproceedings{lee-2021-kaggle-dbqa,
    title = "{KaggleDBQA}: Realistic Evaluation of Text-to-{SQL} Parsers",
    author = "Lee, Chia-Hsuan  and
      Polozov, Oleksandr  and
      Richardson, Matthew",
    booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.acl-long.176",
    pages = "2261--2273"
}

Please contact Chia-Hsuan Lee (chiahlee[at]uw.edu) for questions and suggestions.

kaggledbqa's People

Contributors

chia-hsuan-lee avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

kaggledbqa's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.