harelba / q Goto Github PK
View Code? Open in Web Editor NEWq - Run SQL directly on delimited files and multi-file sqlite databases
Home Page: http://harelba.github.io/q/
License: GNU General Public License v3.0
q - Run SQL directly on delimited files and multi-file sqlite databases
Home Page: http://harelba.github.io/q/
License: GNU General Public License v3.0
Let's create a simple CSV file where the delimiter is a whitespace, e.g. " "
.
$ echo "a 1 0\nb 2 0\nc 0" > test.csv
$ cat test.csv
a 1 0
b 2 0
c 0
Note that for the last row, we have two spaces between the c
and the 0
.
Now, using the latest version of q
:
$ cat test.csv| ./q "select * from -" -D ';'
a;1;0;;;;;;;
b;2;0;;;;;;;
c;0;;;;;;;;
As we can see, apart for the problem of the extras ;
(described in a separate issue #36), the last row is incorrect. The expected output would be (without the extras ;
):
$ cat test.csv| ./q "select * from -" -D ';'
a;1;0;;;;;;;
b;2;0;;;;;;;
c;;0;;;;;;;;
Note the two ;
between the c
and the 0
for the last row, meaning that 0
was indeed a value for the third column, since the delimiter is the whitespace and there were two of them.
q -d'' --mode=strict -c 1 'SELECT * FROM a LIMIT 1,1'
The first line of the file, possibly trimmed unless -k
Delimiter must be one character only
$ cat dailytasks.csv
"typeid","limit","apcost","date","checkpointId"
"int","int","int","string","string"
"1","2","5","1,2,3,4,5,6,7","3000,3001,3002"
"2","2","5","1,2,3,4,5,6,7","3003,3004,3005"
This original content of file
$ q -H -O -d , 'select * from ./dailytasks.csv where "limit" = 2'
"typeid",limit,apcost,date,checkpointId
1,2,5,1,2,3,4,5,6,7,3000,3001,3002
2,2,5,1,2,3,4,5,6,7,3003,3004,3005
I use "limit" in where, it works fine.
$ q -H -O -d , 'select * from ./dailytasks.csv where typeid = 1'
query error: no such column: typeid
Warning - There seems to be a "no such column" error, and -H (header line) exists. Please make sure that you are using the column names from the header line and not the default (cXX) column names
but when I try first column of header, typeid, q return me a Warning and no result.
I also tried ...
$ q -H -O -d , 'select * from ./dailytasks.csv where "typeid" = 1'
"typeid",limit,apcost,date,checkpointId
$ q -H -O -d , 'select * from ./dailytasks.csv where `"typeid"` = 1'
query error: no such column: "typeid"
Warning - There seems to be a "no such column" error, and -H (header line) exists. Please make sure that you are using the column names from the header line and not the default (cXX) column names
but they all failed.
RPM changelog is for history of changes to the spec file, not the project. This is way too much:
According to (hmm…) folk practice ( http://en.wikipedia.org/wiki/Comma-separated_values#Technical_background ), quotes in CSV can be either "double-quoted" ""
or backslash-escaped \"
.
q
does seem to do something else with quotes (but what exactly)?
$ cat doublequotes.csv ; q -t "select * from doublequotes.csv"
A B
"a ""quote"" is escaped so" isn't it
"yeah" """"""
A B
a "quote"" is escaped so" isn't it
yeah """"
$ cat backquotes.csv; q -t "select * from backquotes.csv"
A B
"a \"quote\" is escaped so" isn't it
"yeah" "\"\""
A B
a \quote\" is escaped so" isn't it
yeah \\""
Please correct if I'm wrong, but as far as I know, when q
outputs CSV, it does not quote fields that contain a field separator or a record separator. This likely implies that it also does not escape quote characters, but I haven't checked this explicitly.
Any reason/objection to doing that?
It looks like https://pypi.python.org/pypi/q is already taken, so an expanded name would have to be thought up.
I hope I did not miss something from the documentation. It would be nice to be able to use the header with -H
to give names to the columns (or any other new option).
For example:
$ echo "name,value\na,1\nb,2\na,5" > test.csv
$ cat test.csv
name,value
a,1
b,2
a,5
This is already possible:
$ cat test.csv | ./q -d',' "select c1,avg(c2) from - group by c1" -H 1
a,3.0
b,2.0
This would be nice (automatic naming):
$ cat test.csv | ./q -d',' "select name,avg(value) from - group by name" -H 1
a,3.0
b,2.0
try this exampledatafile:
total 368
-rw-rw-r--. 1 zhaokunyao zhaokunyao 1621 8月 8 16:08 add_fkey_idx.sql
-rw-rw-r--. 1 zhaokunyao zhaokunyao 317 8月 8 16:08 count.sql
-rw-rw-r--. 1 zhaokunyao zhaokunyao 3105 8月 8 16:08 create_table.sql
-rw-rw-r--. 1 zhaokunyao zhaokunyao 763 8月 8 16:08 drop_cons.sql
-rw-rw-r--. 1 zhaokunyao zhaokunyao 0 8月 8 20:25 exampledatafile
q -H1 "select * from exampledatafile"
Traceback (most recent call last):
File "/bin/q", line 484, in
table_creator.populate()
File "/bin/q", line 374, in populate
self._insert_row(line)
File "/bin/q", line 393, in _insert_row
self._insert_row_i(line)
File "/bin/q", line 403, in _insert_row_i
raise Exception('Encountered a line in an invalid format %s:%s - %s columns instead of %s. Did you make sure to set the correct delimiter?' % (self.current_filename,self.lines_read,len(col_vals),len(self.column_inferer.column_names)))
Exception: Encountered a line in an invalid format exampledatafile:2 - 13 columns instead of 7. Did you make sure to set the correct delimiter?
Project folder structure needs to be fixed to be more orderly. Folders such as src, examples, etc.
GPL should be GPLv3 or GPLv3+ (per LICENSE file).
Something like q "SELECT * FROM <(find . -ls)"
It should be like
mkfifo tmptable
find . -ls > tmptable&
q "SELECT * FROM tmptable"
$ ls -ltr * | q "select c1,count(1) from - group by c1"
Traceback (most recent call last):
File "/usr/local/bin/q", line 485, in
table_creator.populate()
File "/usr/local/bin/q", line 379, in populate
self._flush_inserts()
File "/usr/local/bin/q", line 415, in _flush_inserts
self.db.execute_and_fetch(insert_row_stmt)
File "/usr/local/bin/q", line 111, in execute_and_fetch
self.cursor.execute(q)
sqlite3.OperationalError: near ")": syntax error
If the input is a pipe, it is not noticable:
$ cat | q 'SELECT * FROM -'
1 2 3
^D
1 2 3
$
But if reading input directly from the terminal, it stops only on double EOF, ignoring single ones:
$ q 'SELECT * FROM -'
2 3 4
^D
5 4 3
^D^D
2 3 4
5 4 3
$
$ echo -e "a\tb c\n1\t2" > test.tsv
$ q -H -t 'select a from -' < test.tsv
query error: near "c": syntax error
Spaces in non space separators should be accepted.
Let's create a simple CSV file.
$ echo "a,1,0\nb,2,0\nc,,0" > test.csv
$ cat test.csv
a,1,0
b,2,0
c,,0
Now, using the latest version of q
:
$ cat test.csv| ./q -d',' "select avg(c2) from -"
1.0
This output is unexpected for me.
Here we are computing the average of 1, 2 and an empty value. The result given by q
is 1, because I think q
consider the empty value as 0, and that gives (1 + 2 + 0) / 3 = 1. From a statistical point a view, it makes more sense to consider the empty value as unknown, and to compute the average as follow: (1 + 2) / 2 = 1.5.
In other SQL engines, NULL values are excluded from the average, and not counted as zeros. This means that the average of 1, 2 and NULL will be 1.5 indeed.
Converting empty values to NULL could solve the problem here.
This will allow a simple way to combine fields in q's output.
The option for ignoring header lines is there, however it doesn't actually get used at any point. All lines make it into the results.
Use if name == '__main__'
for standard command line usage in script.
This would also allow to test the code itself.
Source0 should link to a direct download of the source tarball (if at all possible).
if a column didnt affects the result , why not just ignore the difference between the specific line and other standard line
for eg, lets say i have a directory which contains space in its name
so when you use
ls -l | q "select c5 from - order by c5 desc"
it will throw an error caused by the useless directory name
Wow, "q", the tool I've always been waiting for! I recall RedHat's own "squeal", but it is not configurable for arbitrary text files such as CSV.
Your provided Fedora RPM and SRPM both work absolutely fine, but I have managed to modify the SPEC file to build q from the proper, unmodified tarball, and to also create the docs while building the RPM, without preprocessing from the create-rpm script.
Are you interested in this SPEC file? I can branch your project, commit it, and create a pull request.
By the way, I have integrated the RPMs using this SPEC file into my Fedora copr repository:
http://copr.fedoraproject.org/coprs/barsnick/non-fed/
Build report:
http://copr.fedoraproject.org/coprs/barsnick/non-fed/build/27090/
Packages to be found e.g. here (as well as for various other Fedora flavors):
http://copr-be.cloud.fedoraproject.org/results/barsnick/non-fed/fedora-20-x86_64/q-text-as-data-1.4.0-1.fc20.1sunshine/
or via "yum install q-text-as-data" if the repo is enabled.
Couldn't find a good sql parser in python, so there are some ugly hacks there. If anyone knows a good sql parser for python, i'll be glad to integrate it.
Currently, some errors return tracebacks - a proper error message should be shown instead.
Using this simple file as input:
a.id a.name
1 bill
2 bob
The query q -H "select a.id from test_file"
results in the following error:
query error: no such column: a.id
Warning - There seems to be a "no such column" error, and -H (header line) exists. Please make sure that you are using the column names from the header line and not the default (cXX) column names
Currently, the temp tables created by the tool are named using a random number (1B options). Although very unlikely, it is possible that temp tables would be overlapping, and cause a failure of the tool.
The tool should verify that the generated temp table name is unique, and retry the name generation if it is, until a unique name is found.
Harel
Currenty, only in memory sqlite is being used. The tool could be used in more situations if data could be written using an external database.
Currently, the encoding is done properly for data, but not for the query itself.
Need to fix that.
Thanks @PreXident
https://github.com/harelba/q/search?q=copyright
In1988 there wasn't even Python... How is it possible for q to be dated back to that year?
Also, is this software really FSF's property? Does FSF even know about it? (GNU GPL license doesn't imply that software licensed under it are part of GNU project).
//EDIT changed the issue title.
Customer's use case:
Example:
File a.csv:
Name;Val
A;1
B;2
C;3
File b.csv:
Name;Val
A;10
B;11
select f1.name , ifnull(f2.val,0) as val from e:\a.csv f1 left join e:\b.csv f2 on f1.name = f2.name
In the example above, the result is:
A;10
B;11
C;0
But if the b.csv file is empty (only the header), the query doesn't produce any result at all instead of
A;0
B;0
C;0
Column types are autodetected and handled properly. However, there are some cases where it is possible that a numeric column will be treated as text, depending on the actual sample data used for autodetection of the column type.
This can be easily resolved by automatically expanding numeric types int->long->float as part of the autodetection.
Harel
I love the idea of being able to query the output of ls
but there are some practical constraints that I haven't seen addressed. Maybe you have some ideas for the following?
https://gist.github.com/canadaduane/9197079
Piping the above output to q
will fail with the following error message:
Encountered a line in an invalid format -:1 - 11 columns instead of 9. Did you make sure to set the correct delimiter?
This is due to the symlink "Projects -> /Users/duane/Dropbox/Projects". Perhaps we could add a flag that allows the last column to scoop up spaces? I realize this wouldn't work if the first line/file happens to have spaces. So, yeah... not a perfect solution yet.
This is not a real issue, but to share that the "not so clean casting" can be made simpler.
From the readme sample:
q "SELECT c5,c9 FROM mydatafile WHERE CAST(c5 AS INT) > 1000"
Can be made easier to type by:
q "SELECT c5,c9 FROM mydatafile WHERE 0+c5 > 1000"
As far as I know, this way quite common in SQL semantic for real DB.
It will goes further with ''||cX for casting as string (if needed).
$ q 'SELECT 5'
5
Traceback (most recent call last):
File "/usr/local/bin/q", line 1098, in <module>
table_creator.drop_table()
NameError: name 'table_creator' is not defined
(It's my the first ever query I tried with q
)
The tool should support input with varying column count per row. This will allow more uses of the tool for semi-structured input.
Reusing a file that has already been loaded in the past should be faster. Can be that by some form of caching the loaded data.
Here is a simple example where q
seems to fail recognizing the number of columns. Let's create a simple CSV file.
$ echo "a,1,0\nb,2,0\nc,,0" > test.csv
$ cat test.csv
a,1,0
b,2,0
c,,0
Now, using the latest version of q
:
$ cat test.csv| ./q -d',' "select * from -"
a,1,0,,,,,,,
b,2,0,,,,,,,
c,,0,,,,,,,
My guess is q
is having problems with EOL characters here, but I did not dig into the code. For me, the expected output would be:
$ cat test.csv| ./q -d',' "select * from -"
a,1,0
b,2,0
c,,0
There doesn't seem to be a simple way to specify tab-delimited output.
E.g., I've tried q -D '\t'
and q -D "\t"
, but both fail to produce tabs.
q -D "$(echo -e '\t')"
works (in BASH anyway), but is needlessly complicated.
Since -t
is shorthand for -d <tab>
, perhaps -T
can be shorthand for -D <tab>
.
Currently, column types are not inferred, and the sql type treatment is dependent on the expression being used. For example when using sum(c2), c2 will be treated as a number. This sometimes requires converting types as part of the sql statement - For example: ls -dltr * | q "select * from - where cast(c5 as int) > 10000". Automatic type inference can fix that.
The issue is they won't include it unless it comes for the official repo. See here: Homebrew/legacy-homebrew#23192
Vendor, Packager are not used anymore; kill those fields.
%clean section isn't needed anymore.
'%{__install}' can be replaced with just 'install'. No need to use a macro.
Currently, throws the following error:
query error: user-defined function raised exception
Workaround is to add a where clause which filters out null values for the relevant column.
Null values can also occur when running relaxed mode (the default) when there are rows with fewer columns than expected. Use -m strict
if needed, in order to get an error when column count is not as expected.
Will be fixed in the coming release.
hi,
a very nice project. 👍
by the way, i have a question: is this project same as https://github.com/dinedal/textql ??
Hello,
q looks nice but there's no mention of what license it is under. Can you add a LICENSE file?
Cheers!
I haven't found a way to use files with spaces in their name: quoting with ", ', or ` doesn't work; escaping spaces with a backslash doesn't work either.
Use one or the other, but not both :).
Required, in order to provide for a more testable and more maintainable code base.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.