harelba / q Goto Github PK

View Code? Open in Web Editor NEW

10.1K 168.0 414.0 3.13 MB

q - Run SQL directly on delimited files and multi-file sqlite databases

Home Page: http://harelba.github.io/q/

License: GNU General Public License v3.0

Python 97.14% Shell 1.08% Batchfile 0.03% HTML 0.21% JavaScript 0.48% CSS 0.12% Starlark 0.94%

python sql cli command-line-tool textasdata q qtextasdata csv tsv command-line

q's Issues

Problem with whitespace delimiter

Let's create a simple CSV file where the delimiter is a whitespace, e.g. " ".

$ echo "a 1 0\nb 2 0\nc  0" > test.csv
$ cat test.csv
a 1 0
b 2 0
c  0

Note that for the last row, we have two spaces between the c and the 0.
Now, using the latest version of q:

$ cat test.csv| ./q "select * from -" -D ';' 
a;1;0;;;;;;;
b;2;0;;;;;;;
c;0;;;;;;;;

As we can see, apart for the problem of the extras ; (described in a separate issue #36), the last row is incorrect. The expected output would be (without the extras ;):

$ cat test.csv| ./q "select * from -" -D ';' 
a;1;0;;;;;;;
b;2;0;;;;;;;
c;;0;;;;;;;;

Note the two ; between the c and the 0 for the last row, meaning that 0 was indeed a value for the third column, since the delimiter is the whitespace and there were two of them.

Can't force "q" to interpret the file as single-row table

Scenario:

q -d'' --mode=strict -c 1 'SELECT * FROM a LIMIT 1,1'

Expected:

The first line of the file, possibly trimmed unless -k

Actual:

Delimiter must be one character only

UTF-8 with BOM files cause column naming issues

 $ cat dailytasks.csv 
"typeid","limit","apcost","date","checkpointId"
"int","int","int","string","string"
"1","2","5","1,2,3,4,5,6,7","3000,3001,3002"
"2","2","5","1,2,3,4,5,6,7","3003,3004,3005"

This original content of file

 $ q -H -O -d ,  'select * from ./dailytasks.csv where "limit" = 2' 
"typeid",limit,apcost,date,checkpointId
1,2,5,1,2,3,4,5,6,7,3000,3001,3002
2,2,5,1,2,3,4,5,6,7,3003,3004,3005

I use "limit" in where, it works fine.

 $ q -H -O -d ,  'select * from ./dailytasks.csv where typeid = 1' 
query error: no such column: typeid
Warning - There seems to be a "no such column" error, and -H (header line) exists. Please make sure that you are using the column names from the header line and not the default (cXX) column names

but when I try first column of header, typeid, q return me a Warning and no result.

I also tried ...

 $ q -H -O -d ,  'select * from ./dailytasks.csv where "typeid" = 1'  
"typeid",limit,apcost,date,checkpointId

 $ q -H -O -d ,  'select * from ./dailytasks.csv where `"typeid"` = 1' 
query error: no such column: "typeid"
Warning - There seems to be a "no such column" error, and -H (header line) exists. Please make sure that you are using the column names from the header line and not the default (cXX) column names

but they all failed.

rpm: changelog

RPM changelog is for history of changes to the spec file, not the project. This is way too much:

Mon Mar 03 2014 Harel Ben-Attia [email protected] 1.3.0-1
Added column name and type detection (Use -A to see name/type analysis for the specified input)
Added support for multiple parsing modes - Relaxed, Strict and Fluffy (old, backward compatible behavior)
Fixed tab delimition parameter problem
More improvements to error reporting
Added a test suite, in preparation for refactoring
#7 - Dynamic column count support
#8 - Column name inference from input containing a header row
#9 - Automatic column type inference using sample data
#30 - Header lines option does nothing

process CSV quotes according to folk practices?

According to (hmm…) folk practice ( http://en.wikipedia.org/wiki/Comma-separated_values#Technical_background ), quotes in CSV can be either "double-quoted" "" or backslash-escaped \".

q does seem to do something else with quotes (but what exactly)?

$ cat doublequotes.csv ; q -t "select * from doublequotes.csv"
A   B
"a ""quote"" is escaped so" isn't it
"yeah"  """"""

A   B
a "quote"" is escaped so"   isn't it
yeah    """"

$ cat backquotes.csv; q -t "select * from backquotes.csv"
A   B
"a \"quote\" is escaped so" isn't it
"yeah"  "\"\""

A   B
a \quote\" is escaped so"   isn't it
yeah    \\""

Quote values that contain either field or record seperators.

Please correct if I'm wrong, but as far as I know, when q outputs CSV, it does not quote fields that contain a field separator or a record separator. This likely implies that it also does not escape quote characters, but I haven't checked this explicitly.

Package for pypi

Any reason/objection to doing that?

It looks like https://pypi.python.org/pypi/q is already taken, so an expanded name would have to be thought up.

New feature: use header to name columns

I hope I did not miss something from the documentation. It would be nice to be able to use the header with -H to give names to the columns (or any other new option).

For example:

$ echo "name,value\na,1\nb,2\na,5" > test.csv
$ cat test.csv
name,value
a,1
b,2
a,5

This is already possible:

$ cat test.csv | ./q -d',' "select c1,avg(c2) from - group by c1" -H 1
a,3.0
b,2.0

This would be nice (automatic naming):

$ cat test.csv | ./q -d',' "select name,avg(value) from - group by name" -H 1
a,3.0
b,2.0

problem caused by header_skip

try this exampledatafile:

total 368
-rw-rw-r--. 1 zhaokunyao zhaokunyao 1621 8月 8 16:08 add_fkey_idx.sql
-rw-rw-r--. 1 zhaokunyao zhaokunyao 317 8月 8 16:08 count.sql
-rw-rw-r--. 1 zhaokunyao zhaokunyao 3105 8月 8 16:08 create_table.sql
-rw-rw-r--. 1 zhaokunyao zhaokunyao 763 8月 8 16:08 drop_cons.sql
-rw-rw-r--. 1 zhaokunyao zhaokunyao 0 8月 8 20:25 exampledatafile

q -H1 "select * from exampledatafile"
Traceback (most recent call last):
File "/bin/q", line 484, in
table_creator.populate()
File "/bin/q", line 374, in populate
self._insert_row(line)
File "/bin/q", line 393, in _insert_row
self._insert_row_i(line)
File "/bin/q", line 403, in _insert_row_i
raise Exception('Encountered a line in an invalid format %s:%s - %s columns instead of %s. Did you make sure to set the correct delimiter?' % (self.current_filename,self.lines_read,len(col_vals),len(self.column_inferer.column_names)))
Exception: Encountered a line in an invalid format exampledatafile:2 - 13 columns instead of 7. Did you make sure to set the correct delimiter?

Fix project folder structure

Project folder structure needs to be fixed to be more orderly. Folders such as src, examples, etc.

RPM: License is wrong

GPL should be GPLv3 or GPLv3+ (per LICENSE file).

Feature request: process substitution

Something like q "SELECT * FROM <(find . -ls)"

It should be like

mkfifo tmptable
find . -ls > tmptable&
q "SELECT * FROM tmptable"

Example 1 fails (OS X)

$ ls -ltr * | q "select c1,count(1) from - group by c1"
Traceback (most recent call last):
File "/usr/local/bin/q", line 485, in
table_creator.populate()
File "/usr/local/bin/q", line 379, in populate
self._flush_inserts()
File "/usr/local/bin/q", line 415, in _flush_inserts
self.db.execute_and_fetch(insert_row_stmt)
File "/usr/local/bin/q", line 111, in execute_and_fetch
self.cursor.execute(q)
sqlite3.OperationalError: near ")": syntax error

q accepts EOF twice when reading input from tty

If the input is a pipe, it is not noticable:

$ cat | q 'SELECT * FROM -'
1 2 3
^D
1 2 3
$

But if reading input directly from the terminal, it stops only on double EOF, ignoring single ones:

$ q 'SELECT * FROM -'
2 3 4
^D
5 4 3
^D^D
2 3 4
5 4 3
$

Space in column name on TAB separated values break q

$ echo -e "a\tb c\n1\t2" > test.tsv
$ q -H -t 'select a from -' < test.tsv
query error: near "c": syntax error

Spaces in non space separators should be accepted.

Empty values should be handled as NULL when computing average

Let's create a simple CSV file.

$ echo "a,1,0\nb,2,0\nc,,0" > test.csv
$ cat test.csv
a,1,0
b,2,0
c,,0

Now, using the latest version of q:

$ cat test.csv| ./q -d',' "select avg(c2) from -"  
1.0

This output is unexpected for me.
Here we are computing the average of 1, 2 and an empty value. The result given by q is 1, because I think q consider the empty value as 0, and that gives (1 + 2 + 0) / 3 = 1. From a statistical point a view, it makes more sense to consider the empty value as unknown, and to compute the average as follow: (1 + 2) / 2 = 1.5.

In other SQL engines, NULL values are excluded from the average, and not counted as zeros. This means that the average of 1, 2 and NULL will be 1.5 indeed.
Converting empty values to NULL could solve the problem here.

Allow setting an empty output delimiter

This will allow a simple way to combine fields in q's output.

Header lines option does nothing

The option for ignoring header lines is there, however it doesn't actually get used at any point. All lines make it into the results.

Provide a python api + PyPI package for q

Use if name == '__main__' for standard command line usage in script.
This would also allow to test the code itself.

RPM: Source0 should be a URL

Source0 should link to a direct download of the source tarball (if at all possible).

i hope you could add a option to compact with different column numbers

if a column didnt affects the result , why not just ignore the difference between the specific line and other standard line

for eg, lets say i have a directory which contains space in its name

so when you use

ls -l | q "select c5 from - order by c5 desc"

it will throw an error caused by the useless directory name

SPEC file for "properly" creating RPMs

Wow, "q", the tool I've always been waiting for! I recall RedHat's own "squeal", but it is not configurable for arbitrary text files such as CSV.

Your provided Fedora RPM and SRPM both work absolutely fine, but I have managed to modify the SPEC file to build q from the proper, unmodified tarball, and to also create the docs while building the RPM, without preprocessing from the create-rpm script.

Are you interested in this SPEC file? I can branch your project, commit it, and create a pull request.

By the way, I have integrated the RPMs using this SPEC file into my Fedora copr repository:
http://copr.fedoraproject.org/coprs/barsnick/non-fed/

Build report:
http://copr.fedoraproject.org/coprs/barsnick/non-fed/build/27090/

Packages to be found e.g. here (as well as for various other Fedora flavors):
http://copr-be.cloud.fedoraproject.org/results/barsnick/non-fed/fedora-20-x86_64/q-text-as-data-1.4.0-1.fc20.1sunshine/

or via "yum install q-text-as-data" if the repo is enabled.

Fix bug when parsing empty rows

Support quoted csv files

Real SQL Parsing, allowing smarter logic

Couldn't find a good sql parser in python, so there are some ugly hacks there. If anyone knows a good sql parser for python, i'll be glad to integrate it.

Better error reporting

Currently, some errors return tracebacks - a proper error message should be shown instead.

A dot in the column name results in "no such column" error

Using this simple file as input:

a.id a.name
1 bill
2 bob

The query q -H "select a.id from test_file" results in the following error:

query error: no such column: a.id
Warning - There seems to be a "no such column" error, and -H (header line) exists. Please make sure that you are using the column names from the header line and not the default (cXX) column names

Ensure that generated temp tables are uniquely named

Currently, the temp tables created by the tool are named using a random number (1B options). Although very unlikely, it is possible that temp tables would be overlapping, and cause a failure of the tool.

The tool should verify that the generated temp table name is unique, and retry the name generation if it is, until a unique name is found.

Harel

Allow working with an External DB

Currenty, only in memory sqlite is being used. The tool could be used in more situations if data could be written using an external database.

Fix query encoding (data encoding works well)

Currently, the encoding is done properly for data, but not for the query itself.

Need to fix that.

Thanks @PreXident

Copyright peculiarities

https://github.com/harelba/q/search?q=copyright

In1988 there wasn't even Python... How is it possible for q to be dated back to that year?

Also, is this software really FSF's property? Does FSF even know about it? (GNU GPL license doesn't imply that software licensed under it are part of GNU project).

//EDIT changed the issue title.

Left joining an empty file can cause problems

Customer's use case:

Example:

File a.csv:

Name;Val
A;1
B;2
C;3

File b.csv:

Name;Val
A;10
B;11

select f1.name , ifnull(f2.val,0) as val from e:\a.csv f1 left join e:\b.csv f2 on f1.name = f2.name

In the example above, the result is:

A;10
B;11
C;0

But if the b.csv file is empty (only the header), the query doesn't produce any result at all instead of

A;0
B;0
C;0

expand numeric types when autodetecting column types

Column types are autodetected and handled properly. However, there are some cases where it is possible that a numeric column will be treated as text, depending on the actual sample data used for autodetection of the column type.

This can be easily resolved by automatically expanding numeric types int->long->float as part of the autodetection.

Harel

Last column should allow for spaces?

I love the idea of being able to query the output of ls but there are some practical constraints that I haven't seen addressed. Maybe you have some ideas for the following?

https://gist.github.com/canadaduane/9197079

Piping the above output to q will fail with the following error message:

Encountered a line in an invalid format -:1 - 11 columns instead of 9. Did you make sure to set the correct delimiter?

This is due to the symlink "Projects -> /Users/duane/Dropbox/Projects". Perhaps we could add a flag that allows the last column to scoop up spaces? I realize this wouldn't work if the first line/file happens to have spaces. So, yeah... not a perfect solution yet.

casting can be simpler

This is not a real issue, but to share that the "not so clean casting" can be made simpler.
From the readme sample:
q "SELECT c5,c9 FROM mydatafile WHERE CAST(c5 AS INT) > 1000"
Can be made easier to type by:
q "SELECT c5,c9 FROM mydatafile WHERE 0+c5 > 1000"

As far as I know, this way quite common in SQL semantic for real DB.
It will goes further with ''||cX for casting as string (if needed).

"SELECT 5" shows error in addition to the result

$ q 'SELECT 5'
5
Traceback (most recent call last):
  File "/usr/local/bin/q", line 1098, in <module>
    table_creator.drop_table()
NameError: name 'table_creator' is not defined

(It's my the first ever query I tried with q)

Dynamic column count support

The tool should support input with varying column count per row. This will allow more uses of the tool for semi-structured input.

Column name inference from input containing a header row

Reuse of previously loaded data

Reusing a file that has already been loaded in the past should be faster. Can be that by some form of caching the loaded data.

Problem with number of columns

Here is a simple example where q seems to fail recognizing the number of columns. Let's create a simple CSV file.

$ echo "a,1,0\nb,2,0\nc,,0" > test.csv
$ cat test.csv
a,1,0
b,2,0
c,,0

Now, using the latest version of q:

$ cat test.csv| ./q -d',' "select * from -"
a,1,0,,,,,,,
b,2,0,,,,,,,
c,,0,,,,,,,

My guess is q is having problems with EOL characters here, but I did not dig into the code. For me, the expected output would be:

$ cat test.csv| ./q -d',' "select * from -"
a,1,0
b,2,0
c,,0

Allow easier tab-delimited output

There doesn't seem to be a simple way to specify tab-delimited output.

E.g., I've tried q -D '\t' and q -D "\t", but both fail to produce tabs.

q -D "$(echo -e '\t')" works (in BASH anyway), but is needlessly complicated.

Since -t is shorthand for -d <tab>, perhaps -T can be shorthand for -D <tab>.

Automatic column type inference using sample data

Currently, column types are not inferred, and the sql type treatment is dependent on the expression being used. For example when using sum(c2), c2 will be treated as a number. This sometimes requires converting types as part of the sql statement - For example: ls -dltr * | q "select * from - where cast(c5 as int) > 10000". Automatic type inference can fix that.

Please add a tag for an official release, so I can get it included in homebrew

The issue is they won't include it unless it comes for the official repo. See here: Homebrew/legacy-homebrew#23192

RPM: Packaging issues

Vendor, Packager are not used anymore; kill those fields.

%clean section isn't needed anymore.

'%{__install}' can be replaced with just 'install'. No need to use a macro.

Prevent regexp from failing when field value is null

Currently, throws the following error:

query error: user-defined function raised exception

Workaround is to add a where clause which filters out null values for the relevant column.

Null values can also occur when running relaxed mode (the default) when there are rows with fewer columns than expected. Use -m strict if needed, in order to get an error when column count is not as expected.

Will be fixed in the coming release.

a question

hi,

a very nice project. 👍

by the way, i have a question: is this project same as https://github.com/dinedal/textql ??

Add a LICENSE file

Hello,

q looks nice but there's no mention of what license it is under. Can you add a LICENSE file?

Cheers!

harelba / q Goto Github PK

q's Issues

Scenario:

Expected:

Actual:

Recommend Projects

Recommend Topics

Recommend Org