turicas / rows Goto Github PK
View Code? Open in Web Editor NEWA common, beautiful interface to tabular data, no matter the format
License: GNU Lesser General Public License v3.0
A common, beautiful interface to tabular data, no matter the format
License: GNU Lesser General Public License v3.0
Since an ODS is just a zip file with a XML an other meta-data files inside (and the spreadsheet data actually goes on the XML), we can use lxml
(as we're already using it on plugin HTML) to deal with it.
There are two approaches, actually:
1- Use lxml
(maybe slower, better to maintain and more accurate)
2- Use regular expressions (maybe faster, not so accurate and easy to maintain)
The row class returined by iterating over rows.Table
could be a dict
a collections.namedtuple
or even a customized class representing that data. We need to provide an way to the user to specify this class. Actually, preferably the user should pass a metaclass (think of collections.namedtuple
: when you call it the object returned is a Python class).
We may create an interface for it (instead of just passing dict
for example, we may need to create a RowDict
class that does some things).
Another option: use the attrs library.
Possible API: add row_class
parameter to import functions, like in:
for book in rows.import_from_csv(csv_path, row_class=dict):
print(book["title"])
Related to #304.
Note: check if we can integrate this feature with scrapy so it'll easier to parse data using rows in a scrapy project.
Requirement: #133.
Create a better way to express data types, converters and provide many useful already implemented, full-featured fields (we already did this one! just need more tests).
The current implementation only look at numbers (thousands and decimal separators), not for date and datetime fields, days on week etc.
The command-line interface should expose an option to query the table user is acessing.
table
);Usage example:
rows --from examples/data.csv --query 'SELECT * FROM table WHERE username == "turicas"'
It should print:
+----+----------+------------+
| id | username | birthday |
+----+----------+------------+
| 1 | turicas | 1987-04-29 |
+----+----------+------------+
metakit is an efficient embedded database library with a small footprint. It has some built-in operations (like groupby
).
There is an official Python package on PyPI.
I have written many tests for all available plugins on outputty library, but we need to migrate them to support the new API (rows
).
The idea is to export an array of objects, where each row is an (JS) object, for example, the file examples/data.csv
would be encoded like this:
[
{
"username": "turicas",
"birthday": "1987-04-29",
"id": 1
},
{
"username": "another-user",
"birthday": "2000-01-01",
"id": 2
}
]
As the JSON plugin, a Message Pack plugin should be pretty easy to implement.
The old branch has a text plugin (only for exporting). We may use it or texttable.
Create a more stable API for rows.Table
class, regarding to access its rows and utility methods.
User should be able to export only some fields of a Table
. The option may be added to serialize
(or actually prepare_to_export
-- see #54) so all exporto_to_*
will benefit from it.
Create an automated way to discover installed plugins and give users the ability to create their own plugins and upload to PyPI without our intervention (like nose
does, using setuptools' entrypoints)
List of possible types:
geopoint
, like in JSON table schemaCreate a standard set o parameters for importing functions, like converters
(present on CSV and HTML plugins).
Create an algorithm to automatically extract tables from PDFs (available in text format).
Could use pdftables, but the code is not up-to-date, does not work with Python3 etc.
User should be able to import only some fields to a Table
. The option should be added to create_table
so all import_from_*
will benefit from it.
Yeh, rows
will be available on Debian! @kretcheu is going to do it o/
We need to open some files as wb
instead of w
so it will add \r\n
.
@sxslex will work on this.
Convert a string made of a number and a currency symbol to float or to a Currency Type.
I think that we should write a simple version of this converter supporting a few frequently used currency, such as:
And then, eventually, build it seriously as another project.
What do you think?
It's something like serialize
but do not serialize, actually: it only filter the rows which will be exported but return high-level Python objects.
Can take some code from export_to_xls
.
It can be easily implemented based on MySQL plugin.
@fccoelho wrote:
"Add the option to add an auto-incremented PK when outputting to relational databases.
This is relevant since certain highly used ORMs, such as SQLAlchemy require tables to have a primary key"
at https://github.com/turicas/outputty/issues/16
Tasks:
docs/plugins.md
to include all import_from_*
and export_to_*
functions.We could merge some code from csvstudio into rows, as @mdipierro suggested.
sqlet.py may be an inspiration also.
Create a more stable API regarding to calling plugins (rows.import_from_X
, for example).
lazy
, callback
etc.)Currently some operations assume all rows are in memory (such as order_by
). We may move all the code to something lazy.
For order_by
specifically we could sort in-disk instead of in-memory, like csvsort does.
LazyTable
classlazy
parameter to rows.plugins.utils.create_table
lazy=False
, in CSV lazy=True
by default)Similar to a "workbook" from xlrd
: a collection of Table
objects, which one with its own properties (like name
and a link to the TableList
("list" is better than "set" here because order matters).
Could be used in the plugins: XLS, JSON, HTML and maybe others.
It would be pretty simple since the CLI just imports/exports from the available plugins.
Should have:
pip
, setup.py
and apt-get
)Create a standard set o parameters for exporting functions.
Currently converters work only for converting input data (raw) to native Python types; we need to add support for custom converters to export native types (for example: datetime.date
objects will always be exported using the %Y-%m-%d
format but it should be possible to provide an "output converter" to receive the object and return the raw value converted).
Currently we use two data types to represent something that could be represented in one class. The first is the fields
parameter received by import_from_*
(which are passed to utils.create_table
), like:
UWSGI_FIELDS = OrderedDict([('pid', rows.fields.IntegerField),
('ip', rows.fields.UnicodeField),
('datetime', rows.fields.DatetimeField),
('http_verb', rows.fields.UnicodeField),
('http_path', rows.fields.UnicodeField),
('generation_time', rows.fields.FloatField),
('http_version', rows.fields.FloatField),
('http_status', rows.fields.IntegerField)])
Second is the Table.Row
(created in Table.__init__
), which is a named tuple containing row data.
We could use an approach similar to ORMs and use a class to define the fields, like Django does. We could start with something like this:
class UwsgiLog(rows.Row):
pid = rows.fields.IntegerField()
ip = rows.fields.UnicodeField()
datetime = rows.fields.DatetimeField()
http_verb = rows.fields.UnicodeField()
http_path = rows.fields.UnicodeField()
generation_time = rows.fields.FloatField()
http_version = rows.fields.FloatField()
http_status = rows.fields.IntegerField()
And the Table
rows (returned when we iterate over it) will be instances of UwsgiLog
.
Pros:
Cons:
namedtuple
is probably faster than any other customized classNote: check if we can integrate this feature with scrapy so it'll easier to parse data using rows in a scrapy project.
Hi @turicas, I was surprised this is not on PyPI yet. I know you put a note in the README, but do you think it's still not "good enough" to go to PyPI?
Thanks for the library!
Create a better API for writing plugins, taking into consideration that the library will try to do the most it can so the only plugin's job will be to import/export data (rows
will automatically deal with importing data in a lazy way or not, for example -- the plugin should only provide a generator on the import function).
It can be easily implemented based on SQLite plugin.
It would be nice to have parameters like batch_size
, callback
, callback_every
and commit_every
as in the old export_to_mysql
.
It'll be very useful to have import from/export to pandas dataframes, as @mdipierro suggested. We may add this feature as a plugin.
Some decisions need to be made before we declare the API as stable. We can put
here all the questions for discussing (we should answer these questions as soon
as possible since it impacts the current implementation and would cause rework
if delayed).
rows.Table
rows.Table
be always lazy? Alwaysrows.Table
with many rows but want to filter somefilter
? Using Python's built-in filter
would be the more Pythonic way butTable
).rows.Table
like in question A.2: it's a filter to beTable
. User can specify a custom function that willTable.Row
object and return a new one (that should be returned whenTable
). This way we can deal with addition of new fieldscollections.namedtuple
. What is theTable
?sqlite3.Row
and other Python's DBAPI implementations.rows
current architecture is good for importing and exportingTable
from a CSV, change some rows' values andTable
itself but in the plugins) since we'll need to dealrows
to import-and-export data it'd beTable
is lazy we may not need this shortcut because we can iterateTable
(in a lazy way) at the same time we're saving into another.__add__
(so, for example,sum([table1, table2, ..., tableN])
will return another Table
with all therows.fields
rows.Table
and implement only the neededrows.Table
). This__len__
, __reverse__
and others.rows.Table
and notrows.Table
willtext
,json
, csv
, sqlite
.xls
,html
, ods
. See graphlab's connectors and tablib's supported extensions.Table.__rows
? What plugins can and cannot do with it? What is the expectedTable.meta
with metadata about that Table
. ForTable
was generated by a plugin (example: ifcsv
could have the actual CSV filename, encoding and so on).--query
(to query using SQL -- same asrows.Table
itself), a HTML file could contain more than one <table>
. See how tablib deals with it.detect_types
.A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.