Comments (22)
Fix now available in Datasette 0.64.6: https://docs.datasette.io/en/stable/changelog.html#v0-64-6
from datasette.
If I uncheck expand labels
in the Advanced CSV export dialog, the error does not occur. Re-checking that box and re-running the export does cause the error to occur.
from datasette.
I tested a bunch of the text-indexed columns and found two that (in combination) cause the export to fail in a fairly-minimal case. They are RAMP_INTERSECTION
and CHP_VEHTYPE_AT_FAULT
, both of which are usually an int but sometimes a -
.
The table definitions are showing up in Datasette as:
[RAMP_INTERSECTION] TEXT REFERENCES [RAMP_INTERSECTION]([key]),
[CHP_VEHTYPE_AT_FAULT] TEXT REFERENCES [CHP_VEHTYPE_AT_FAULT]([key]),
Those definitions are what I would expect them to be. However, if I look at the CHP_VEHTYPE_AT_FAULT
lookup table, single-digit numbers are zero-padded in the definition, and are not in the RAMP_INTERSECTION
definition. Hmm. I'll also check if the source data zero-pads either of them in use.
I'm going to look at the source data for these cases and then try to produce a much more minimal test case, too.
from datasette.
Demo here: https://rbl-collisions.fly.dev/records/collisions.csv?_labels=on&PRIMARY_RD=ASHBY+AV - compare with https://rbl-collisions.fly.dev/records/collisions?PRIMARY_RD=ASHBY+AV&_size=max
from datasette.
This could be because of "-"
v.s. "- "
or similar.
from datasette.
Schema of that table:
CREATE TABLE "collisions" (
[CASE_ID] INTEGER PRIMARY KEY,
[COLLISION_DATE] TEXT,
[COLLISION_TIME] INTEGER,
[OFFICER_ID] TEXT,
[REPORTING_DISTRICT] TEXT,
[DAY_OF_WEEK] INTEGER REFERENCES [DAY_OF_WEEK]([id]),
[CNTY_CITY_LOC] INTEGER,
[BEAT_NUMBER] TEXT,
[PRIMARY_RD] TEXT,
[SECONDARY_RD] TEXT,
[DISTANCE] FLOAT,
[DIRECTION] TEXT,
[INTERSECTION] TEXT,
[WEATHER_1] TEXT REFERENCES [WEATHER_1]([key]),
[WEATHER_2] TEXT REFERENCES [WEATHER_2]([key]),
[STATE_HWY_IND] TEXT,
[CALTRANS_COUNTY] TEXT,
[CALTRANS_DISTRICT] INTEGER,
[STATE_ROUTE] INTEGER,
[POSTMILE] FLOAT,
[LOCATION_TYPE] TEXT REFERENCES [LOCATION_TYPE]([key]),
[RAMP_INTERSECTION] TEXT REFERENCES [RAMP_INTERSECTION]([key]),
[SIDE_OF_HWY] TEXT REFERENCES [SIDE_OF_HWY]([key]),
[TOW_AWAY] TEXT,
[COLLISION_SEVERITY] INTEGER REFERENCES [COLLISION_SEVERITY]([id]),
[NUMBER_KILLED] INTEGER,
[NUMBER_INJURED] INTEGER,
[PARTY_COUNT] INTEGER,
[PRIMARY_COLL_FACTOR] TEXT REFERENCES [PRIMARY_COLL_FACTOR]([key]),
[PCF_VIOL_CATEGORY] TEXT REFERENCES [PCF_VIOL_CATEGORY]([key]),
[PCF_VIOLATION] INTEGER,
[PCF_VIOL_SUBSECTION] TEXT,
[HIT_AND_RUN] TEXT,
[TYPE_OF_COLLISION] TEXT REFERENCES [TYPE_OF_COLLISION]([key]),
[MVIW] TEXT REFERENCES [MVIW]([key]),
[PED_ACTION] TEXT REFERENCES [PED_ACTION]([key]),
[ROAD_SURFACE] TEXT REFERENCES [ROAD_SURFACE]([key]),
[ROAD_COND_1] TEXT REFERENCES [ROAD_COND_1]([key]),
[ROAD_COND_2] TEXT REFERENCES [ROAD_COND_2]([key]),
[LIGHTING] TEXT REFERENCES [LIGHTING]([key]),
[CONTROL_DEVICE] TEXT REFERENCES [CONTROL_DEVICE]([key]),
[PEDESTRIAN_ACCIDENT] TEXT,
[BICYCLE_ACCIDENT] TEXT,
[MOTORCYCLE_ACCIDENT] TEXT,
[TRUCK_ACCIDENT] TEXT,
[NOT_PRIVATE_PROPERTY] TEXT,
[ALCOHOL_INVOLVED] TEXT,
[STWD_VEHTYPE_AT_FAULT] TEXT REFERENCES [STWD_VEHTYPE_AT_FAULT]([key]),
[CHP_VEHTYPE_AT_FAULT] TEXT REFERENCES [CHP_VEHTYPE_AT_FAULT]([key]),
[COUNT_SEVERE_INJ] INTEGER,
[COUNT_VISIBLE_INJ] INTEGER,
[COUNT_COMPLAINT_PAIN] INTEGER,
[COUNT_PED_KILLED] INTEGER,
[COUNT_PED_INJURED] INTEGER,
[COUNT_BICYCLIST_KILLED] INTEGER,
[COUNT_BICYCLIST_INJURED] INTEGER,
[COUNT_MC_KILLED] INTEGER,
[COUNT_MC_INJURED] INTEGER,
[PRIMARY_RAMP] TEXT REFERENCES [PRIMARY_RAMP]([key]),
[SECONDARY_RAMP] TEXT REFERENCES [SECONDARY_RAMP]([key]),
[LATITUDE] FLOAT,
[LONGITUDE] FLOAT,
[ADDRESS] TEXT,
[SEVERITY_INDEX] TEXT
);
from datasette.
Most keys are text, but some are integers:
[DAY_OF_WEEK] INTEGER REFERENCES [DAY_OF_WEEK]([id]),
[COLLISION_SEVERITY] INTEGER REFERENCES [COLLISION_SEVERITY]([id]),
Those look OK to me:
CREATE TABLE [DAY_OF_WEEK] (
[id] INTEGER PRIMARY KEY,
[name] TEXT
);
CREATE TABLE [COLLISION_SEVERITY] (
[id] INTEGER PRIMARY KEY,
[name] TEXT
);
Both of those tables have only integer values:
- https://rbl-collisions.fly.dev/records?sql=select+distinct+DAY_OF_WEEK+from+collisions
- https://rbl-collisions.fly.dev/records?sql=select+distinct+COLLISION_SEVERITY+from+collisions
from datasette.
If it's useful, you can see the integer versus text key split here:
https://github.com/radical-bike-lobby/switrs-db/blob/main/import-bicycle-crashes.sh#L129
from datasette.
Generated this query to try and spot columns that have missing key references:
select 'DAY_OF_WEEK' as column, count(*) from collisions where DAY_OF_WEEK not in (select id from DAY_OF_WEEK) union all
select 'WEATHER_1' as column, count(*) from collisions where WEATHER_1 not in (select key from WEATHER_1) union all
select 'WEATHER_2' as column, count(*) from collisions where WEATHER_2 not in (select key from WEATHER_2) union all
select 'LOCATION_TYPE' as column, count(*) from collisions where LOCATION_TYPE not in (select key from LOCATION_TYPE) union all
select 'RAMP_INTERSECTION' as column, count(*) from collisions where RAMP_INTERSECTION not in (select key from RAMP_INTERSECTION) union all
select 'SIDE_OF_HWY' as column, count(*) from collisions where SIDE_OF_HWY not in (select key from SIDE_OF_HWY) union all
select 'COLLISION_SEVERITY' as column, count(*) from collisions where COLLISION_SEVERITY not in (select id from COLLISION_SEVERITY) union all
select 'PRIMARY_COLL_FACTOR' as column, count(*) from collisions where PRIMARY_COLL_FACTOR not in (select key from PRIMARY_COLL_FACTOR) union all
select 'PCF_VIOL_CATEGORY' as column, count(*) from collisions where PCF_VIOL_CATEGORY not in (select key from PCF_VIOL_CATEGORY) union all
select 'TYPE_OF_COLLISION' as column, count(*) from collisions where TYPE_OF_COLLISION not in (select key from TYPE_OF_COLLISION) union all
select 'MVIW' as column, count(*) from collisions where MVIW not in (select key from MVIW) union all
select 'PED_ACTION' as column, count(*) from collisions where PED_ACTION not in (select key from PED_ACTION) union all
select 'ROAD_SURFACE' as column, count(*) from collisions where ROAD_SURFACE not in (select key from ROAD_SURFACE) union all
select 'ROAD_COND_1' as column, count(*) from collisions where ROAD_COND_1 not in (select key from ROAD_COND_1) union all
select 'ROAD_COND_2' as column, count(*) from collisions where ROAD_COND_2 not in (select key from ROAD_COND_2) union all
select 'LIGHTING' as column, count(*) from collisions where LIGHTING not in (select key from LIGHTING) union all
select 'CONTROL_DEVICE' as column, count(*) from collisions where CONTROL_DEVICE not in (select key from CONTROL_DEVICE) union all
select 'STWD_VEHTYPE_AT_FAULT' as column, count(*) from collisions where STWD_VEHTYPE_AT_FAULT not in (select key from STWD_VEHTYPE_AT_FAULT) union all
select 'CHP_VEHTYPE_AT_FAULT' as column, count(*) from collisions where CHP_VEHTYPE_AT_FAULT not in (select key from CHP_VEHTYPE_AT_FAULT) union all
select 'PRIMARY_RAMP' as column, count(*) from collisions where PRIMARY_RAMP not in (select key from PRIMARY_RAMP) union all
select 'SECONDARY_RAMP' as column, count(*) from collisions where SECONDARY_RAMP not in (select key from SECONDARY_RAMP)
Running that here produces:
column | count(*) |
---|---|
DAY_OF_WEEK | 0 |
WEATHER_1 | 0 |
WEATHER_2 | 0 |
LOCATION_TYPE | 0 |
RAMP_INTERSECTION | 14236 |
SIDE_OF_HWY | 0 |
COLLISION_SEVERITY | 0 |
PRIMARY_COLL_FACTOR | 0 |
PCF_VIOL_CATEGORY | 287 |
TYPE_OF_COLLISION | 665 |
MVIW | 17 |
PED_ACTION | 0 |
ROAD_SURFACE | 0 |
ROAD_COND_1 | 0 |
ROAD_COND_2 | 0 |
LIGHTING | 0 |
CONTROL_DEVICE | 0 |
STWD_VEHTYPE_AT_FAULT | 0 |
CHP_VEHTYPE_AT_FAULT | 1629 |
PRIMARY_RAMP | 0 |
SECONDARY_RAMP | 5 |
from datasette.
Looking at RAMP_INTERSECTION as an example - that table has empty strings instead of nulls for rows that are missing a value. Maybe that's the source of the Datasette bug?
https://rbl-collisions.fly.dev/records/collisions/4750401.json?_shape=array
[
{
"CASE_ID": 4750401,
"COLLISION_DATE": "2012-04-10",
"COLLISION_TIME": 1347,
"OFFICER_ID": "23",
"REPORTING_DISTRICT": "21",
"DAY_OF_WEEK": 2,
"CNTY_CITY_LOC": 103,
"BEAT_NUMBER": "016",
"PRIMARY_RD": "UNIVERSITY AV",
"SECONDARY_RD": "6TH ST",
"DISTANCE": 0,
"DIRECTION": "",
"INTERSECTION": "Y",
"WEATHER_1": "B",
"WEATHER_2": "C",
"STATE_HWY_IND": "N",
"CALTRANS_COUNTY": "",
"CALTRANS_DISTRICT": "",
"STATE_ROUTE": "",
"POSTMILE": "",
"LOCATION_TYPE": "",
"RAMP_INTERSECTION": "",
"SIDE_OF_HWY": "",
"TOW_AWAY": "Y",
"COLLISION_SEVERITY": 4,
"NUMBER_KILLED": 0,
"NUMBER_INJURED": 2,
"PARTY_COUNT": 2,
"PRIMARY_COLL_FACTOR": "A",
"PCF_VIOL_CATEGORY": "12",
"PCF_VIOLATION": 21453,
"PCF_VIOL_SUBSECTION": "A",
"HIT_AND_RUN": "N",
"TYPE_OF_COLLISION": "D",
"MVIW": "C",
"PED_ACTION": "A",
"ROAD_SURFACE": "B",
"ROAD_COND_1": "H",
"ROAD_COND_2": "-",
"LIGHTING": "A",
"CONTROL_DEVICE": "A",
"PEDESTRIAN_ACCIDENT": "",
"BICYCLE_ACCIDENT": "",
"MOTORCYCLE_ACCIDENT": "",
"TRUCK_ACCIDENT": "",
"NOT_PRIVATE_PROPERTY": "Y",
"ALCOHOL_INVOLVED": "",
"STWD_VEHTYPE_AT_FAULT": "J",
"CHP_VEHTYPE_AT_FAULT": "48",
"COUNT_SEVERE_INJ": 0,
"COUNT_VISIBLE_INJ": 0,
"COUNT_COMPLAINT_PAIN": 2,
"COUNT_PED_KILLED": 0,
"COUNT_PED_INJURED": 0,
"COUNT_BICYCLIST_KILLED": 0,
"COUNT_BICYCLIST_INJURED": 0,
"COUNT_MC_KILLED": 0,
"COUNT_MC_INJURED": 0,
"PRIMARY_RAMP": "-",
"SECONDARY_RAMP": "-",
"LATITUDE": "",
"LONGITUDE": "",
"ADDRESS": "UNIVERSITY AV and 6TH ST, Berkeley, CA",
"SEVERITY_INDEX": "4"
}
]
from datasette.
I grabbed a copy of the https://rbl-collisions.fly.dev/records.db DB and ran it locally. Here's where the error message comes from:
datasette/datasette/views/base.py
Lines 553 to 568 in 4284c74
I added a breakpoint()
to get a debugger at that point, and found that the error is an assertion error:
560 else:
561 new_row.append(cell)
562 await writer.writerow(new_row)
563 except Exception as ex:
564 breakpoint()
565 -> sys.stderr.write("Caught this error: {}\n".format(ex))
566 sys.stderr.flush()
567 await r.write(str(e))
568 return
569 await limited_writer.write(postamble)
570
(Pdb) ex
AssertionError()
(Pdb) type(ex)
<class 'AssertionError'>
from datasette.
On a hunch I applied this diff:
diff --git a/datasette/views/base.py b/datasette/views/base.py
index 0080b33c..000a2e75 100644
--- a/datasette/views/base.py
+++ b/datasette/views/base.py
@@ -554,16 +554,20 @@ async def stream_csv(datasette, fetch_data, request, database):
if cell is None:
new_row.extend(("", ""))
else:
- assert isinstance(cell, dict)
+ assert isinstance(
+ cell, dict
+ ), "cell {} should have been a dict".format(
+ repr(cell)
+ )
new_row.append(cell["value"])
new_row.append(cell["label"])
else:
new_row.append(cell)
await writer.writerow(new_row)
- except Exception as e:
- sys.stderr.write("Caught this error: {}\n".format(e))
+ except Exception as ex:
+ sys.stderr.write("Caught this error: {}\n".format(ex))
sys.stderr.flush()
- await r.write(str(e))
+ await r.write(str(ex))
return
await limited_writer.write(postamble)
And now the error message that is displayed says:
Caught this error: cell '- ' should have been a dict
from datasette.
Using assertions here is bad in the first place, because they would be ignored if anyone ran Datasette with the python -O
option.
from datasette.
So the core bug is here:
datasette/datasette/views/base.py
Lines 546 to 562 in 4284c74
But I should clean up the way assertions are used here too.
from datasette.
The row that's breaking looks like this in the debugger:
{'ADDRESS': 'ASHBY AV and CLAIREMONT AV, Berkeley, CA',
'ALCOHOL_INVOLVED': '',
'BEAT_NUMBER': '008',
'BICYCLE_ACCIDENT': '',
'CALTRANS_COUNTY': 'ALA',
'CALTRANS_DISTRICT': 4,
'CASE_ID': 5101393,
'CHP_VEHTYPE_AT_FAULT': '- ',
'CNTY_CITY_LOC': 103,
Note CHP_VEHTYPE_AT_FAULT
is '- '
- but that's a column which should be expanded as a foreign key.
SQL query:
select
CHP_VEHTYPE_AT_FAULT, count(*)
from
collisions
where
CHP_VEHTYPE_AT_FAULT not in (
select
key
from
CHP_VEHTYPE_AT_FAULT
)
group by CHP_VEHTYPE_AT_FAULT
Run here as JSON returns:
[
{
"CHP_VEHTYPE_AT_FAULT": "",
"count(*)": 561
},
{
"CHP_VEHTYPE_AT_FAULT": "- ",
"count(*)": 1065
},
{
"CHP_VEHTYPE_AT_FAULT": "33",
"count(*)": 1
},
{
"CHP_VEHTYPE_AT_FAULT": "91",
"count(*)": 1
},
{
"CHP_VEHTYPE_AT_FAULT": "93",
"count(*)": 1
}
]
from datasette.
So I think the root bug here is that Datasette CSV export, with expand labels enabled, fails if any of the rows have a foreign key that doesn't correctly resolve to the other table.
from datasette.
I think the correct behavior here is to output the value column with the value from the table, and leave the label column blank.
I prototyped that up and here's what that export looks like with the CSV opened in Numbers:
Code:
diff --git a/datasette/views/base.py b/datasette/views/base.py
index 0080b33c..9ef403e0 100644
--- a/datasette/views/base.py
+++ b/datasette/views/base.py
@@ -554,16 +554,20 @@ async def stream_csv(datasette, fetch_data, request, database):
if cell is None:
new_row.extend(("", ""))
else:
- assert isinstance(cell, dict)
- new_row.append(cell["value"])
- new_row.append(cell["label"])
+ if not isinstance(
+ cell, dict
+ ):
+ new_row.extend((cell, ""))
+ else:
+ new_row.append(cell["value"])
+ new_row.append(cell["label"])
else:
new_row.append(cell)
await writer.writerow(new_row)
from datasette.
Need test coverage and I'll land this fix.
from datasette.
Wrote this test but it fails when I expect it to pass:
@pytest.mark.asyncio
async def test_table_csv_with_invalid_labels():
# https://github.com/simonw/datasette/issues/2214
ds = Datasette()
await ds.invoke_startup()
db = ds.add_memory_database("db_2214")
await db.execute_write_script(
"""
create table t1 (id integer primary key, name text);
insert into t1 (id, name) values (1, 'one');
insert into t1 (id, name) values (2, 'two');
create table t2 (textid text primary key, value text);
insert into t2 (textid, value) values ('a', 'alpha');
insert into t2 (textid, value) values ('b', 'beta');
create table if not exists maintable (
id integer primary key,
fk_integer integer references t2(id),
fk_text text references t3(textid)
);
insert into maintable (id, fk_integer, fk_text) values (1, 1, 'a');
insert into maintable (id, fk_integer, fk_text) values (2, 3, 'b'); -- invalid fk_integer
insert into maintable (id, fk_integer, fk_text) values (3, 2, 'c'); -- invalid fk_text
"""
)
response = await ds.client.get("/db_2214/maintable.csv?_labels=1")
assert response.status_code == 200
assert response.text == (
"id,fk_integer,fk_integer_label,fk_text,fk_text_label\r\n"
"1,1,one,a,alpha\r\n"
"2,3,,b,beta\r\n"
"3,2,two,c,\r\n"
)
Might be that the original bug only triggers on another circumstance, e.g. if NONE of the columns resolve correctly perhaps?
from datasette.
I got it to print cell, and saw this:
{'value': 1, 'label': '1'}
{'value': 'a', 'label': 'a'}
{'value': 3, 'label': '3'}
{'value': 'b', 'label': 'b'}
{'value': 2, 'label': '2'}
{'value': 'c', 'label': 'c'}
It was a bug in my test, now fixed.
from datasette.
I'm going to back-port this to the 0.64.x branch and release the fix.
from datasette.
I installed datasette-0.64.6 and re-ran my test and was able to download the full export with no errors. Thank you so much for your help, @simonw!
from datasette.
Related Issues (20)
- Mechanism for executing "isolated" database operations without prepared connection stuff HOT 4
- Cog is failing with round_trip_load_all() YAML error HOT 3
- datasette install datasette-comments break search-all
- jinja2_environment_from_request plugin hook HOT 8
- Support for duckdb HOT 1
- Datasette docker errors when db schema uses "strict"
- Tests and docs to guarantee extras used by datasette-search-all
- ?_extra=columns parameter causes all rows to return `"column-name": "column-name"` HOT 6
- black leaves extra whitespace in my rendered documentation HOT 2
- Show how to register a plugin in a test fixture HOT 1
- Empty list in plugin introspection demo
- After releasing 1.0a8 update projects that test against a commit hash
- Upgrade to most recent Black
- track_event() mechanism for analytics HOT 8
- 1.0a8 release notes and upgrade instructions HOT 7
- Upgrade Datasette Publish to work with 1.0a8 alpha HOT 3
- Fill out the TODO sections on the configuration docs page HOT 1
- Table configuration settings should work in both metadata.yml and datasette.yml HOT 12
- Treat "plugins" in metadata as if it was "plugins" in configuration on startup HOT 3
- Treat allow blocks in metadata as if they were in config HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from datasette.