ferrerojeremy / ln2sql Goto Github PK

View Code? Open in Web Editor NEW

508.0 508.0 200.0 11.1 MB

A tool to query a database in natural language

License: GNU General Public License v3.0

Python 99.93% Makefile 0.07%

database natural-language-processing natural-language-to-sql sql-statement

ln2sql's People

Contributors

Stargazers

Watchers

Forkers

jamsheer-thottathil mdlenin colinsongf yes mfarkhann alancheg sbairishal guptam sumeetpri sainideepak woniuhu maverickjoy wonkday jonike gurmukh90 sagarhowal vishaldhateria dkorolev puqiu liuhyzhy0909 sambitdixit shashank88 arfu2016 saady ddreamz prabhatcs asaforon123 singhvijay theparadoxer02 dashrath-chauhan zenghsh3 ramakrishnamamidi anand514 craftdata prabhjotsl sekhar14 satadru5 rhnvrm battbeach likey3 arvind-cp maxigu nighthaawk shreedev96 zouchl 0xharib prabhupant karthikapinipe zxsted revoltbi ramdharam michaelmsinclair lokeshlal yatwql kamalkraj anisayari meenakshiparyani ranidu shubhampachori12110095 yossivainshtein techkarthiks fendaq neerajaanil laura-ham manifold-labs enix12enix we1l1n lumosnysm shu2015626 pleiadex alapini sfahad1414 rijicp gsogard ashokkumarsundar uraniluib sashini-hettiarachchi itsshnik imohitagrwal shresthagrawal cuulee kr-amitsinha sajalcody rajhrikshit wanying abmitra84 vlevieux karthikrab aniketdsw salchem reiisky fubincom akhilkishore oszhugc pedrosiau idhruvc rohithrangaraju alaparthishivaji eddings vinayasathyanarayana

ln2sql's Issues

Limitations of this projects + Some example tests

If you've just arrived and don't want to spend too much time setting up and testing this project for your specific use case, take a look at the following example cases where it didn't work as expected.

Question	SQL Query
What is the id of the client with the username jdoe?	SELECT client.id FROM client WHERE client.username = 'OOV';
How many clients have made a reservation with a rating of 5?	SELECT COUNT(*) FROM client;
What is the name of the client who made the most reservations?	SELECT client.name FROM client;
How many clients have made a reservation at more than 1 hotel?	SELECT COUNT(*) FROM client;
How many reservations were made by clients whose mobile number contains the digits 555?	SELECT COUNT(*) FROM reservation;
What is the highest-rated hotel in the database?	SELECT * FROM hotel;
What is the lowest-rated hotel in the database?	SELECT * FROM hotel;
How many clients are over the age of 30?	SELECT COUNT(*) FROM client WHERE client.age = 'the';
What is the username of the oldest client in the database?	SELECT client.username FROM client;
What is the username of the youngest client in the database?	SELECT client.username FROM client;
How many reservations were made by clients under the age of 25?	SELECT COUNT(*) FROM reservation INNER JOIN client ON reservation.clientId = client.id WHERE client.age = OOV;

Few more with relations:

Question	SQL Query
What are the name of all the hotel where the client with id is 123 made reservation?	SELECT hotel.name FROM hotel WHERE hotel.id = ‘123’;
How many client have the same name as their username?	SELECT COUNT(*) FROM client WHERE client.name = OOV AND client.username = OOV;
What is the username of the client with the most reservation?	SELECT client.username FROM client;
What is the name of the hotel with the most reservation?	SELECT hotel.name FROM hotel;
How many client have made reservation on the same date as another client?	SELECT COUNT() FROM client; SELECT COUNT() FROM client;
What is the total number of reservation made at hotel with id is 777?	SELECT COUNT(*) FROM reservation WHERE reservation.id = ‘777’;
What are the names of all the hotel where the client with id is 999 made reservation with a rating of 4 or higher?	SELECT * FROM hotel WHERE hotel.id = ‘999’;
What is the age distribution of client who made reservation at hotel with id is 333?	SELECT client.age FROM client WHERE client.id = ‘333’;
What is the name of the client who made the highest-rated reservation?	SELECT client.name FROM client;
How many client have a username that contains the word “hotel”?	SELECT COUNT(*) FROM client WHERE client.username = OOV;
What is the id of the client with the highest-rated reservation at hotel with id is 222?	SELECT client.id FROM client WHERE MAX(client.id) = ‘222’;
How many reservation were made by client whose usernames contain the letter “a”?	SELECT COUNT(*) FROM reservation;
What is the average rating for reservation made by client over the age of 40?	SELECT AVG(reservation.rating) FROM reservation INNER JOIN client ON reservation.clientId = client.id WHERE client.age = ‘the’;

Credits: @harshadk-sourcefuse for testing these out.

Default column for table

Hello everyone,

Is there a way to fix a default column for a table when no column is specified.

For example, if I have a table "Bus", its default colum its "plate", so I can ask "Number of routes of bus 'BR1234'" instead of "Number of routes of bus with plate 'BR1234'".

Thanks to all people that worked on this amazin project.

Support for non MySQL common databases (eg. PostgreSQL)

Currently it works for MySQL dbs. It would be nice to make it work for postgres / Redshift / BigQuery.

Understanding parser

Hi everyone,
I'm facing difficulty in understanding how exactly key word parsing to map Natural language(NL) query words and database keywords is done.It would be greate if you can breifly explain me how exactly In2sql modue searches and maps words from NL query to database values like column names and cell values. Meanwhile I wish to know whether we can use a data structure like hash table for this problem as defined below

To create a map between columns and table
Columns | Table_names
Column1 | Table_name1
Column2 | Table_name2

So to check if any word from NLP querry is a column name just check whether word is a key of hash table defined above

And for column value or cell value create a hash table like
Values | Column_names
Value1 | Column1
Value2 | Column1
.
.
Valuen | Column_m
where there will be m columns and n values in total in database. So to check whether a word from NLP query is a cell value, check if the key exists in above hash table. Furthermore, in case of cell value repeating in more than one columns then use chaining concept for hash table's key.

Steps to use ln2sql

I am sorry, I am not getting how to use this tool.
Can anybody tell me the uses or steps from scratch so that I can understand and use this feature.

Use timestamp column in where clause

Unable to use timestamp, for example:
"what is minimum monoxide in pollution_data when time_stamp equals 2017-03-23"

my timestamp has both time and date.

Also tried using:
"what is minimum monoxide in pollution_data when extract(month from time_stamp) equals 3"

datatypes are:-
time_stamp timestamp with time zone,
monoxide double precision,

P.S.: not using real names of columns and table for privacy reasons

Start Up Error

There is an error while starting the project in python3

On running the file ln2sql.py

Primary key and Foreign key undparsed in ALTER TABLE

In tables.sql.zip, the following lines are ignored during the parsing.

ALTER TABLE `city`
  ADD PRIMARY KEY (`id`),
  ADD KEY `id` (`id`);

ALTER TABLE `emp`
  ADD PRIMARY KEY (`id`),
  ADD KEY `cityId` (`cityId`);

ALTER TABLE `emp`
  MODIFY `id` int(11) NOT NULL AUTO_INCREMENT, AUTO_INCREMENT=31;

ALTER TABLE `emp`
  ADD CONSTRAINT `emp_ibfk_1` FOREIGN KEY (`cityId`) REFERENCES `city` (`id`);

Thesaurus Support

I am having trouble getting the thesaurus support working. Here's a command I'm running in order to test the pupil -> student rule in th_english.dat:

python2.7 ln2sql.py -d database/school.sql -l lang/english.csv -j output.json -t thesaurus/th_english.dat -i "Count how many pupil are there with firstname is John"

Although it works for the word "student", it outputs this error with the word "pupil" (as shown above):

ParsingException.ParsingException: Error: No table name found in sentence!

Am I invoking it incorrectly? I'm guessing this is simply a configuration mistake on my part, but I can't figure out what's wrong with the above command.

Nonetype object Not Iterable on using thesaurus

Hi, I tried to use thesaurus, but I'm getting an issue saying "argument of type 'NoneType' is not iterable".
Could anyone help on this?
Thanks!

No table name found in sentence

Hi Team,

While running sample, "What is the average age of students whose name is Doe or age over 25?" it gives error "No table name found in sentence".

python -m ln2sql.main -d database_store/school.sql -l lang_store/english.csv -j output.json -i "What is the average age of students whose name is Doe or age over 25?"

It seems keyword "students" is not matching in database "school.sql".

Regards,
Dhaval

Parent module '' not loaded

When running gui this error came.
Traceback (most recent call last):
File "ln2sql/ln2sql_gui.py", line 6, in
from .ln2sql import Ln2sql
SystemError: Parent module '' not loaded, cannot perform relative import

Any Solution?

Multiple Group by

Unable to form query containing multiple Group by. Is it supported?
If yes, what should be the input sentence syntax?

Caveat Empetor: Postgresql

I think a huge sign should be placed in the readme file to inform users that this doesn't work with Postgres databases.
I spent hours trying to figure out why I was getting this traceback:
"AttributeError: 'NoneType' object has no attribute 'group'"

before going through the issues to discover that Postgres isn't supported.

COUNT keyword undetected

The following entry python ln2sql.py -d database/tal.sql -l lang/french.csv -i "combien y a t'il d'élève dont le nom est EMMAR ?" returns :

SELECT COUNT(*)
FROM eleve
WHERE eleve.nom = OOV;

but python ln2sql.py -d database/tal.sql -l lang/french.csv -i "Combien y a t'il d'élève dont le nom est EMMAR ?" returns :

SELECT *
FROM eleve
WHERE eleve.nom = OOV;

Provide options to use alias

Hi Team,

In normal world, we usually use business names while writing.
These names may or may not be same as technical name of columns in databases.

For e.g. "idStudent" is usually written as "student id"

Need to have feature to use alias for such colums.

Regards,
Dhaval

pip3 package doesn't work

When installing the package for python3 it installs successfully

$ pip3 install ln2sql
Collecting ln2sql
Installing collected packages: ln2sql
Successfully installed ln2sql-0.1

but module can't be loaded

$ python -m ln2sql
venv/bin/python: No module named ln2sq

and the package is empty

$ ls -a venv/lib/python3.7/site-packages/ln2sql-0.1.dist-info/
.		INSTALLER	RECORD		top_level.txt
..		METADATA	WHEEL

Is this expected to be working? If so how is it meant to be used?

The school.sql is not in the database

could you please put the school.sql in datasetbase ?

French paper

Is there an English version of the french scientific paper? I don't speak french but I want to read that paper.
thank you

JOIN statements

As mentioned in the documentation joins are supported but I can't seem to generate a SQL with JOIN. I have tried with a sql dump with 2 tables with proper foreign key and primary keys.

SQL dump : tables.sql.zip

Please provide an example natural language query for JOINs

need help

I have problems dealing with the between command. And also there is a problem when you use spaces in the equivalences of a word can some one help me

Convert to a package to make it pip installable

Currently the only way to run is to clone and execute. Will convert to a python package
https://python-packaging.readthedocs.io/en/latest/

The last punctuation character is kept in value extraction

A punctuation sign in end of sentence (like in english) can distort the detection of the last value of the WHERE.

This sentence:
Count how many city there are where the score is greater than 2?

give:

SELECT COUNT(*)
FROM city
INNER JOIN emp
ON emp.cityId = city.id
WHERE emp.score > '2?';

case sensitive issue how to solve it

#58 i am facing some problem to solve this issue when i run this code it accepts only exact table name for example we have table whose name is 'class' and when i write table name like (Class,CLASS,classes) it is not giving me the output although i am using [flags=re.IGNORRECASE] how to solve this issue please help me.

def create_table(self, table_string):
    lines = table_string.split("\n")
    table = Table()
    for line in lines:
        if 'TABLE' in line:
            table_name = re.search("`(\w+)`", line,flags=re.IGNORECASE)
            table.name = table_name.group(1)
            if self.thesaurus_object is not None:
                table.equivalences = self.thesaurus_object.get_synonyms_of_a_word(table.name)
        elif 'PRIMARY KEY' in line:
            primary_key_columns = re.findall("`(\w+)`", line,flags=re.IGNORECASE)
            for primary_key_column in primary_key_columns:
                table.add_primary_key(primary_key_column)
        else:
            column_name = re.search("`(\w+)`", line,flags=re.IGNORECASE)
            if column_name is not None:
                column_type = self.predict_type(line)
                if self.thesaurus_object is not None:
                    equivalences = self.thesaurus_object.get_synonyms_of_a_word(column_name.group(1))
                else:
                    equivalences = []
                table.add_column(column_name.group(1), column_type, equivalences)
    return table

def alter_table(self, alter_string):
    lines = alter_string.replace('\n', ' ').split(';')
    for line in lines:
        if 'PRIMARY KEY' in line:
            table_name = re.search("TABLE `(\w+)`", line,flags=re.IGNORECASE).group(1)
            table = self.get_table_by_name(table_name)
            primary_key_columns = re.findall("PRIMARY KEY \(`(\w+)`\)", line,flags=re.IGNORECASE)
            for primary_key_column in primary_key_columns:
                table.add_primary_key(primary_key_column)
        elif 'FOREIGN KEY' in line:
            table_name = re.search("TABLE `(\w+)`", line,flags=re.IGNORECASE).group(1)
            table = self.get_table_by_name(table_name)
            foreign_keys_list = re.findall("FOREIGN KEY \(`(\w+)`\) REFERENCES `(\w+)` \(`(\w+)`\)", line,flags=re.IGNORECASE)
            for column, foreign_table, foreign_column in foreign_keys_list:
                table.add_foreign_key(column, foreign_table, **foreign_column)**

Group tests into chunks

It's not a good practice to loop all different test cases in one tests. Grouping into categories like:
test_select()
test_select_where()

gives more context and allows adding tests more easily and detecting issues more easier. As I have already moved to pytest, just need to break it to chunks

Follow pep8 style guide

Break tests into sensible groups

It's not a good practice to loop all different test cases in one tests. Grouping into categories like:
test_select()
test_select_where()

gives more context and allows adding tests more easily and detecting issues more easier. As I have already moved to pytest, just need to break it to chunks

Make package more usable

This has multiple parts:

a) It should be easy to pip install to use.

i.e I just want to setup the sql file to use once (generally you don't have to change that once setup) and just query as:

query = Ln2sql.to_query(sentence)

b) Thinking about reasonable defaults. eg. english as default, specifying the db file once when initialising the class and then doing as in a)

how to use Thesaurus

Hi there
I did not understand where and how you are using Thesaurus in ln2sql . Plz suggest
Thanks in advance

Unable to use thesaurus file

Issue

I am unable to replicate the example from README but with -t thesaurus_store/th_english.dat flag (works fine without it):

python -m ln2sql.main -d database_store/city.sql -l lang_store/english.csv -t thesaurus_store/th_english.dat -i "Count how many city there are with the name blob?"

Error Message

Traceback (most recent call last):
  File "c:\users\user\appdata\local\continuum\anaconda3\Lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "c:\users\user\appdata\local\continuum\anaconda3\Lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "c:\users\user\desktop\projects\natural_language_sql\ln2sql-master\ln2sql\main.py", line 26, in <module>
    main()
  File "c:\users\user\desktop\projects\natural_language_sql\ln2sql-master\ln2sql\main.py", line 23, in main
    ).get_query(args.sentence)
  File "c:\users\user\desktop\projects\natural_language_sql\ln2sql-master\ln2sql\ln2sql.py", line 44, in get_query
    queries = self.parser.parse_sentence(input_sentence, self.stopwordsFilter)
  File "c:\users\user\desktop\projects\natural_language_sql\ln2sql-master\ln2sql\parser.py", line 711, in parse_sentence
    input_word_list[i] in self.database_object.get_table_by_name(table_name).equivalences):
TypeError: argument of type 'NoneType' is not iterable

System info and versions

Python 3.6.1
Windows 10
ln2sql version: 0.2

Inconsistant generated SQL Statement

My question was : "what is name in emp with the highest score?"
and the answer was :

SELECT emp.name
FROM emp
WHERE MAX(emp.score) = OOV;

Seems that subqueries are not taking into account. Am I wrong ?

Python2 and 3 compatibility

When I moved the project to python3, I missed adding some backwards compatibility checks. Need to convert so it supports both python2.7+ and python3.5+

Join on pk and fk columns with different name unsupported

Exemple of correct declarations of pk and fk but unsupported by ln2sql :

ALTER TABLE `city`
  ADD PRIMARY KEY (`id`),
  ADD KEY `id` (`id`);

ALTER TABLE `emp`
  ADD CONSTRAINT `emp_ibfk_1` FOREIGN KEY (`cityId`) REFERENCES `city` (`id`);

Descending order in ORDER BY

Add keywords in lang/*.csv
Identify DESC keyword(s) in order_by_phrase during parsing
Set to 1 the order attribut of the order_by object that is identified in descending order

Join between tables

Hi everyone,
I am facing difficulty in forming a question based on joins. How to write a question to access multiple tables using joins?Is there any chance?

Can you please provide an example with school.sql file available in the database_store

Not working to perform joins

The english statement 'What is the name of the professor who is teaching the field Algorithm ?' does not work. There comes an error saying 'there is at least one column 'field' which is not reachable from 'professor'.

Parser not defined error

Hi Need one help 👍
I am getting parser not defined after submitting the GUI form . PFA

find a bug

query.orderby when write to json has a bug like this

Query churned out seems to be wrong for the example in README.md

The out put should be a simple query which should not be using a join. The example uses the city data model. Please help me out with this.

Command:

python3 -m ln2sql.main -d database_store/city.sql -l lang_store/english.csv -j output.json -i "Count how many city there are with the name blob?"

Generated sql:

SELECT COUNT(*)
FROM city
INNER JOIN emp
ON city.id = emp.cityId
WHERE emp.name = 'with';

Value detection

All the values in NLP is detected as OOV in sql query. Is there any technique to identify the values in SQL generation

'NoneType' object has no attribute 'group'

The following issue raises when I try to load the sql dump file.

if 'TABLE' in line:
111 table_name = re.search("(\w+)", line)
--> 112 table.name = table_name.group(1)
113 if self.thesaurus_object is not None:
114 table.equivalences = self.thesaurus_object.get_synonyms_of_a_word(table.name)

AttributeError: 'NoneType' object has no attribute 'group'

Kindly suggest a fix to this issue.

Unit tests

@rupinder1133 if you feel like it 😊

Stopword filtering doesn't work.

The query given in the Readme works fine, but if stopwords path is provided the query output is wrong?
The English stopword file has words like 'name' and 'eighty' which will definitely remove essential words from the sentence.
But using the nltk.corpus.stopwords also makes no difference and gives the same wrong query output.

convert classes to more Pythonic style

Currently the classes (eg. column.py) are quite Java like with setters and getters. Need to check if all values are mutable and if not we can just use:
@Property
and
@.setter instead.

Module not found Error

Git cloned the package
Ran setup.py install
Tried python ln2sql_gui.py

Error:-

D:\work_parts\j\ln2sql\ln2sql>python ln2sql_gui.py
Traceback (most recent call last):
  File "ln2sql_gui.py", line 6, in <module>
    from .ln2sql import Ln2sql
ModuleNotFoundError: No module named '__main__.ln2sql'; '__main__' is not a package

Question about accuracy/efficiency

Hi,

I have read the paper and checked out the library, and I am considering using a similar method to the one used here to enable users to query a database with a natural language.

I saw in the other issue that you are no longer continuing this project, could you tell me what is the reason?
Is it due to performance or accuracy reason?
Or maybe you have find some specific limitations that make this "non machine learning" approach not useable in real life?

One of the limitation I could see would be that some words that are not a noun actually dictate the table/colmun values should be selected from (for example, "who takes the biology class?" -> "SELECT * FROM student WHERE ...").

Based on your experience with this approach, would you advice for or against using a similar technique in production?

Thank you in advance for the reply!

Spanish language.

I have not completely inspected this library, it looks very good and interesting, I would like to know some very simple instructions on how to add the Spanish language.
I guess it's the same way you add other languages, how could you do it.