Coder Social home page Coder Social logo

Comments (22)

simonw avatar simonw commented on May 24, 2024 1

Fix now available in Datasette 0.64.6: https://docs.datasette.io/en/stable/changelog.html#v0-64-6

from datasette.

precipice avatar precipice commented on May 24, 2024

If I uncheck expand labels in the Advanced CSV export dialog, the error does not occur. Re-checking that box and re-running the export does cause the error to occur.

CleanShot 2023-12-06 at 23 34 58@2x

from datasette.

precipice avatar precipice commented on May 24, 2024

I tested a bunch of the text-indexed columns and found two that (in combination) cause the export to fail in a fairly-minimal case. They are RAMP_INTERSECTION and CHP_VEHTYPE_AT_FAULT, both of which are usually an int but sometimes a -.

The table definitions are showing up in Datasette as:

[RAMP_INTERSECTION] TEXT REFERENCES [RAMP_INTERSECTION]([key]),
[CHP_VEHTYPE_AT_FAULT] TEXT REFERENCES [CHP_VEHTYPE_AT_FAULT]([key]),

Those definitions are what I would expect them to be. However, if I look at the CHP_VEHTYPE_AT_FAULT lookup table, single-digit numbers are zero-padded in the definition, and are not in the RAMP_INTERSECTION definition. Hmm. I'll also check if the source data zero-pads either of them in use.

I'm going to look at the source data for these cases and then try to produce a much more minimal test case, too.

from datasette.

simonw avatar simonw commented on May 24, 2024

Demo here: https://rbl-collisions.fly.dev/records/collisions.csv?_labels=on&PRIMARY_RD=ASHBY+AV - compare with https://rbl-collisions.fly.dev/records/collisions?PRIMARY_RD=ASHBY+AV&_size=max

from datasette.

simonw avatar simonw commented on May 24, 2024

This could be because of "-" v.s. "- " or similar.

from datasette.

simonw avatar simonw commented on May 24, 2024

Schema of that table:

CREATE TABLE "collisions" (
   [CASE_ID] INTEGER PRIMARY KEY,
   [COLLISION_DATE] TEXT,
   [COLLISION_TIME] INTEGER,
   [OFFICER_ID] TEXT,
   [REPORTING_DISTRICT] TEXT,
   [DAY_OF_WEEK] INTEGER REFERENCES [DAY_OF_WEEK]([id]),
   [CNTY_CITY_LOC] INTEGER,
   [BEAT_NUMBER] TEXT,
   [PRIMARY_RD] TEXT,
   [SECONDARY_RD] TEXT,
   [DISTANCE] FLOAT,
   [DIRECTION] TEXT,
   [INTERSECTION] TEXT,
   [WEATHER_1] TEXT REFERENCES [WEATHER_1]([key]),
   [WEATHER_2] TEXT REFERENCES [WEATHER_2]([key]),
   [STATE_HWY_IND] TEXT,
   [CALTRANS_COUNTY] TEXT,
   [CALTRANS_DISTRICT] INTEGER,
   [STATE_ROUTE] INTEGER,
   [POSTMILE] FLOAT,
   [LOCATION_TYPE] TEXT REFERENCES [LOCATION_TYPE]([key]),
   [RAMP_INTERSECTION] TEXT REFERENCES [RAMP_INTERSECTION]([key]),
   [SIDE_OF_HWY] TEXT REFERENCES [SIDE_OF_HWY]([key]),
   [TOW_AWAY] TEXT,
   [COLLISION_SEVERITY] INTEGER REFERENCES [COLLISION_SEVERITY]([id]),
   [NUMBER_KILLED] INTEGER,
   [NUMBER_INJURED] INTEGER,
   [PARTY_COUNT] INTEGER,
   [PRIMARY_COLL_FACTOR] TEXT REFERENCES [PRIMARY_COLL_FACTOR]([key]),
   [PCF_VIOL_CATEGORY] TEXT REFERENCES [PCF_VIOL_CATEGORY]([key]),
   [PCF_VIOLATION] INTEGER,
   [PCF_VIOL_SUBSECTION] TEXT,
   [HIT_AND_RUN] TEXT,
   [TYPE_OF_COLLISION] TEXT REFERENCES [TYPE_OF_COLLISION]([key]),
   [MVIW] TEXT REFERENCES [MVIW]([key]),
   [PED_ACTION] TEXT REFERENCES [PED_ACTION]([key]),
   [ROAD_SURFACE] TEXT REFERENCES [ROAD_SURFACE]([key]),
   [ROAD_COND_1] TEXT REFERENCES [ROAD_COND_1]([key]),
   [ROAD_COND_2] TEXT REFERENCES [ROAD_COND_2]([key]),
   [LIGHTING] TEXT REFERENCES [LIGHTING]([key]),
   [CONTROL_DEVICE] TEXT REFERENCES [CONTROL_DEVICE]([key]),
   [PEDESTRIAN_ACCIDENT] TEXT,
   [BICYCLE_ACCIDENT] TEXT,
   [MOTORCYCLE_ACCIDENT] TEXT,
   [TRUCK_ACCIDENT] TEXT,
   [NOT_PRIVATE_PROPERTY] TEXT,
   [ALCOHOL_INVOLVED] TEXT,
   [STWD_VEHTYPE_AT_FAULT] TEXT REFERENCES [STWD_VEHTYPE_AT_FAULT]([key]),
   [CHP_VEHTYPE_AT_FAULT] TEXT REFERENCES [CHP_VEHTYPE_AT_FAULT]([key]),
   [COUNT_SEVERE_INJ] INTEGER,
   [COUNT_VISIBLE_INJ] INTEGER,
   [COUNT_COMPLAINT_PAIN] INTEGER,
   [COUNT_PED_KILLED] INTEGER,
   [COUNT_PED_INJURED] INTEGER,
   [COUNT_BICYCLIST_KILLED] INTEGER,
   [COUNT_BICYCLIST_INJURED] INTEGER,
   [COUNT_MC_KILLED] INTEGER,
   [COUNT_MC_INJURED] INTEGER,
   [PRIMARY_RAMP] TEXT REFERENCES [PRIMARY_RAMP]([key]),
   [SECONDARY_RAMP] TEXT REFERENCES [SECONDARY_RAMP]([key]),
   [LATITUDE] FLOAT,
   [LONGITUDE] FLOAT,
   [ADDRESS] TEXT,
   [SEVERITY_INDEX] TEXT
);

from datasette.

simonw avatar simonw commented on May 24, 2024

Most keys are text, but some are integers:

   [DAY_OF_WEEK] INTEGER REFERENCES [DAY_OF_WEEK]([id]),
   [COLLISION_SEVERITY] INTEGER REFERENCES [COLLISION_SEVERITY]([id]),

Those look OK to me:

CREATE TABLE [DAY_OF_WEEK] (
   [id] INTEGER PRIMARY KEY,
   [name] TEXT
);

CREATE TABLE [COLLISION_SEVERITY] (
   [id] INTEGER PRIMARY KEY,
   [name] TEXT
);

Both of those tables have only integer values:

from datasette.

precipice avatar precipice commented on May 24, 2024

If it's useful, you can see the integer versus text key split here:

https://github.com/radical-bike-lobby/switrs-db/blob/main/import-bicycle-crashes.sh#L129

from datasette.

simonw avatar simonw commented on May 24, 2024

Generated this query to try and spot columns that have missing key references:

select 'DAY_OF_WEEK' as column, count(*) from collisions where DAY_OF_WEEK not in (select id from DAY_OF_WEEK) union all
select 'WEATHER_1' as column, count(*) from collisions where WEATHER_1 not in (select key from WEATHER_1) union all
select 'WEATHER_2' as column, count(*) from collisions where WEATHER_2 not in (select key from WEATHER_2) union all
select 'LOCATION_TYPE' as column, count(*) from collisions where LOCATION_TYPE not in (select key from LOCATION_TYPE) union all
select 'RAMP_INTERSECTION' as column, count(*) from collisions where RAMP_INTERSECTION not in (select key from RAMP_INTERSECTION) union all
select 'SIDE_OF_HWY' as column, count(*) from collisions where SIDE_OF_HWY not in (select key from SIDE_OF_HWY) union all
select 'COLLISION_SEVERITY' as column, count(*) from collisions where COLLISION_SEVERITY not in (select id from COLLISION_SEVERITY) union all
select 'PRIMARY_COLL_FACTOR' as column, count(*) from collisions where PRIMARY_COLL_FACTOR not in (select key from PRIMARY_COLL_FACTOR) union all
select 'PCF_VIOL_CATEGORY' as column, count(*) from collisions where PCF_VIOL_CATEGORY not in (select key from PCF_VIOL_CATEGORY) union all
select 'TYPE_OF_COLLISION' as column, count(*) from collisions where TYPE_OF_COLLISION not in (select key from TYPE_OF_COLLISION) union all
select 'MVIW' as column, count(*) from collisions where MVIW not in (select key from MVIW) union all
select 'PED_ACTION' as column, count(*) from collisions where PED_ACTION not in (select key from PED_ACTION) union all
select 'ROAD_SURFACE' as column, count(*) from collisions where ROAD_SURFACE not in (select key from ROAD_SURFACE) union all
select 'ROAD_COND_1' as column, count(*) from collisions where ROAD_COND_1 not in (select key from ROAD_COND_1) union all
select 'ROAD_COND_2' as column, count(*) from collisions where ROAD_COND_2 not in (select key from ROAD_COND_2) union all
select 'LIGHTING' as column, count(*) from collisions where LIGHTING not in (select key from LIGHTING) union all
select 'CONTROL_DEVICE' as column, count(*) from collisions where CONTROL_DEVICE not in (select key from CONTROL_DEVICE) union all
select 'STWD_VEHTYPE_AT_FAULT' as column, count(*) from collisions where STWD_VEHTYPE_AT_FAULT not in (select key from STWD_VEHTYPE_AT_FAULT) union all
select 'CHP_VEHTYPE_AT_FAULT' as column, count(*) from collisions where CHP_VEHTYPE_AT_FAULT not in (select key from CHP_VEHTYPE_AT_FAULT) union all
select 'PRIMARY_RAMP' as column, count(*) from collisions where PRIMARY_RAMP not in (select key from PRIMARY_RAMP) union all
select 'SECONDARY_RAMP' as column, count(*) from collisions where SECONDARY_RAMP not in (select key from SECONDARY_RAMP)

Running that here produces:

column count(*)
DAY_OF_WEEK 0
WEATHER_1 0
WEATHER_2 0
LOCATION_TYPE 0
RAMP_INTERSECTION 14236
SIDE_OF_HWY 0
COLLISION_SEVERITY 0
PRIMARY_COLL_FACTOR 0
PCF_VIOL_CATEGORY 287
TYPE_OF_COLLISION 665
MVIW 17
PED_ACTION 0
ROAD_SURFACE 0
ROAD_COND_1 0
ROAD_COND_2 0
LIGHTING 0
CONTROL_DEVICE 0
STWD_VEHTYPE_AT_FAULT 0
CHP_VEHTYPE_AT_FAULT 1629
PRIMARY_RAMP 0
SECONDARY_RAMP 5

from datasette.

simonw avatar simonw commented on May 24, 2024

Looking at RAMP_INTERSECTION as an example - that table has empty strings instead of nulls for rows that are missing a value. Maybe that's the source of the Datasette bug?

https://rbl-collisions.fly.dev/records/collisions/4750401.json?_shape=array

[
  {
    "CASE_ID": 4750401,
    "COLLISION_DATE": "2012-04-10",
    "COLLISION_TIME": 1347,
    "OFFICER_ID": "23",
    "REPORTING_DISTRICT": "21",
    "DAY_OF_WEEK": 2,
    "CNTY_CITY_LOC": 103,
    "BEAT_NUMBER": "016",
    "PRIMARY_RD": "UNIVERSITY AV",
    "SECONDARY_RD": "6TH ST",
    "DISTANCE": 0,
    "DIRECTION": "",
    "INTERSECTION": "Y",
    "WEATHER_1": "B",
    "WEATHER_2": "C",
    "STATE_HWY_IND": "N",
    "CALTRANS_COUNTY": "",
    "CALTRANS_DISTRICT": "",
    "STATE_ROUTE": "",
    "POSTMILE": "",
    "LOCATION_TYPE": "",
    "RAMP_INTERSECTION": "",
    "SIDE_OF_HWY": "",
    "TOW_AWAY": "Y",
    "COLLISION_SEVERITY": 4,
    "NUMBER_KILLED": 0,
    "NUMBER_INJURED": 2,
    "PARTY_COUNT": 2,
    "PRIMARY_COLL_FACTOR": "A",
    "PCF_VIOL_CATEGORY": "12",
    "PCF_VIOLATION": 21453,
    "PCF_VIOL_SUBSECTION": "A",
    "HIT_AND_RUN": "N",
    "TYPE_OF_COLLISION": "D",
    "MVIW": "C",
    "PED_ACTION": "A",
    "ROAD_SURFACE": "B",
    "ROAD_COND_1": "H",
    "ROAD_COND_2": "-",
    "LIGHTING": "A",
    "CONTROL_DEVICE": "A",
    "PEDESTRIAN_ACCIDENT": "",
    "BICYCLE_ACCIDENT": "",
    "MOTORCYCLE_ACCIDENT": "",
    "TRUCK_ACCIDENT": "",
    "NOT_PRIVATE_PROPERTY": "Y",
    "ALCOHOL_INVOLVED": "",
    "STWD_VEHTYPE_AT_FAULT": "J",
    "CHP_VEHTYPE_AT_FAULT": "48",
    "COUNT_SEVERE_INJ": 0,
    "COUNT_VISIBLE_INJ": 0,
    "COUNT_COMPLAINT_PAIN": 2,
    "COUNT_PED_KILLED": 0,
    "COUNT_PED_INJURED": 0,
    "COUNT_BICYCLIST_KILLED": 0,
    "COUNT_BICYCLIST_INJURED": 0,
    "COUNT_MC_KILLED": 0,
    "COUNT_MC_INJURED": 0,
    "PRIMARY_RAMP": "-",
    "SECONDARY_RAMP": "-",
    "LATITUDE": "",
    "LONGITUDE": "",
    "ADDRESS": "UNIVERSITY AV and 6TH ST, Berkeley, CA",
    "SEVERITY_INDEX": "4"
  }
]

from datasette.

simonw avatar simonw commented on May 24, 2024

I grabbed a copy of the https://rbl-collisions.fly.dev/records.db DB and ran it locally. Here's where the error message comes from:

if heading in expanded_columns:
if cell is None:
new_row.extend(("", ""))
else:
assert isinstance(cell, dict)
new_row.append(cell["value"])
new_row.append(cell["label"])
else:
new_row.append(cell)
await writer.writerow(new_row)
except Exception as e:
sys.stderr.write("Caught this error: {}\n".format(e))
sys.stderr.flush()
await r.write(str(e))
return
await limited_writer.write(postamble)

I added a breakpoint() to get a debugger at that point, and found that the error is an assertion error:

560  	                            else:
561  	                                new_row.append(cell)
562  	                        await writer.writerow(new_row)
563  	            except Exception as ex:
564  	                breakpoint()
565  ->	                sys.stderr.write("Caught this error: {}\n".format(ex))
566  	                sys.stderr.flush()
567  	                await r.write(str(e))
568  	                return
569  	        await limited_writer.write(postamble)
570  	
(Pdb) ex
AssertionError()
(Pdb) type(ex)
<class 'AssertionError'>

from datasette.

simonw avatar simonw commented on May 24, 2024

On a hunch I applied this diff:

diff --git a/datasette/views/base.py b/datasette/views/base.py
index 0080b33c..000a2e75 100644
--- a/datasette/views/base.py
+++ b/datasette/views/base.py
@@ -554,16 +554,20 @@ async def stream_csv(datasette, fetch_data, request, database):
                                 if cell is None:
                                     new_row.extend(("", ""))
                                 else:
-                                    assert isinstance(cell, dict)
+                                    assert isinstance(
+                                        cell, dict
+                                    ), "cell {} should have been a dict".format(
+                                        repr(cell)
+                                    )
                                     new_row.append(cell["value"])
                                     new_row.append(cell["label"])
                             else:
                                 new_row.append(cell)
                         await writer.writerow(new_row)
-            except Exception as e:
-                sys.stderr.write("Caught this error: {}\n".format(e))
+            except Exception as ex:
+                sys.stderr.write("Caught this error: {}\n".format(ex))
                 sys.stderr.flush()
-                await r.write(str(e))
+                await r.write(str(ex))
                 return
         await limited_writer.write(postamble)

And now the error message that is displayed says:

Caught this error: cell '- ' should have been a dict

from datasette.

simonw avatar simonw commented on May 24, 2024

Using assertions here is bad in the first place, because they would be ignored if anyone ran Datasette with the python -O option.

from datasette.

simonw avatar simonw commented on May 24, 2024

So the core bug is here:

if not expanded_columns:
# Simple path
await writer.writerow(row)
else:
# Look for {"value": "label": } dicts and expand
new_row = []
for heading, cell in zip(data["columns"], row):
if heading in expanded_columns:
if cell is None:
new_row.extend(("", ""))
else:
assert isinstance(cell, dict)
new_row.append(cell["value"])
new_row.append(cell["label"])
else:
new_row.append(cell)
await writer.writerow(new_row)

But I should clean up the way assertions are used here too.

from datasette.

simonw avatar simonw commented on May 24, 2024

The row that's breaking looks like this in the debugger:

{'ADDRESS': 'ASHBY AV and CLAIREMONT AV, Berkeley, CA',
 'ALCOHOL_INVOLVED': '',
 'BEAT_NUMBER': '008',
 'BICYCLE_ACCIDENT': '',
 'CALTRANS_COUNTY': 'ALA',
 'CALTRANS_DISTRICT': 4,
 'CASE_ID': 5101393,
 'CHP_VEHTYPE_AT_FAULT': '- ',
 'CNTY_CITY_LOC': 103,

Note CHP_VEHTYPE_AT_FAULT is '- ' - but that's a column which should be expanded as a foreign key.

SQL query:

select
  CHP_VEHTYPE_AT_FAULT, count(*)
from
  collisions
where
  CHP_VEHTYPE_AT_FAULT not in (
    select
      key
    from
      CHP_VEHTYPE_AT_FAULT
  )
group by CHP_VEHTYPE_AT_FAULT

Run here as JSON returns:

[
  {
    "CHP_VEHTYPE_AT_FAULT": "",
    "count(*)": 561
  },
  {
    "CHP_VEHTYPE_AT_FAULT": "- ",
    "count(*)": 1065
  },
  {
    "CHP_VEHTYPE_AT_FAULT": "33",
    "count(*)": 1
  },
  {
    "CHP_VEHTYPE_AT_FAULT": "91",
    "count(*)": 1
  },
  {
    "CHP_VEHTYPE_AT_FAULT": "93",
    "count(*)": 1
  }
]

from datasette.

simonw avatar simonw commented on May 24, 2024

So I think the root bug here is that Datasette CSV export, with expand labels enabled, fails if any of the rows have a foreign key that doesn't correctly resolve to the other table.

from datasette.

simonw avatar simonw commented on May 24, 2024

I think the correct behavior here is to output the value column with the value from the table, and leave the label column blank.

I prototyped that up and here's what that export looks like with the CSV opened in Numbers:

CleanShot 2023-12-22 at 14 48 00@2x

Code:

diff --git a/datasette/views/base.py b/datasette/views/base.py
index 0080b33c..9ef403e0 100644
--- a/datasette/views/base.py
+++ b/datasette/views/base.py
@@ -554,16 +554,20 @@ async def stream_csv(datasette, fetch_data, request, database):
                                 if cell is None:
                                     new_row.extend(("", ""))
                                 else:
-                                    assert isinstance(cell, dict)
-                                    new_row.append(cell["value"])
-                                    new_row.append(cell["label"])
+                                    if not isinstance(
+                                        cell, dict
+                                    ):
+                                        new_row.extend((cell, ""))
+                                    else:
+                                        new_row.append(cell["value"])
+                                        new_row.append(cell["label"])
                             else:
                                 new_row.append(cell)
                         await writer.writerow(new_row)

from datasette.

simonw avatar simonw commented on May 24, 2024

Need test coverage and I'll land this fix.

from datasette.

simonw avatar simonw commented on May 24, 2024

Wrote this test but it fails when I expect it to pass:

@pytest.mark.asyncio
async def test_table_csv_with_invalid_labels():
    # https://github.com/simonw/datasette/issues/2214
    ds = Datasette()
    await ds.invoke_startup()
    db = ds.add_memory_database("db_2214")
    await db.execute_write_script(
        """
        create table t1 (id integer primary key, name text);
        insert into t1 (id, name) values (1, 'one');
        insert into t1 (id, name) values (2, 'two');
        create table t2 (textid text primary key, value text);
        insert into t2 (textid, value) values ('a', 'alpha');
        insert into t2 (textid, value) values ('b', 'beta');
        create table if not exists maintable (
            id integer primary key,
            fk_integer integer references t2(id),
            fk_text text references t3(textid)
        );
        insert into maintable (id, fk_integer, fk_text) values (1, 1, 'a');
        insert into maintable (id, fk_integer, fk_text) values (2, 3, 'b'); -- invalid fk_integer
        insert into maintable (id, fk_integer, fk_text) values (3, 2, 'c'); -- invalid fk_text
    """
    )
    response = await ds.client.get("/db_2214/maintable.csv?_labels=1")
    assert response.status_code == 200
    assert response.text == (
        "id,fk_integer,fk_integer_label,fk_text,fk_text_label\r\n"
        "1,1,one,a,alpha\r\n"
        "2,3,,b,beta\r\n"
        "3,2,two,c,\r\n"
    )

Might be that the original bug only triggers on another circumstance, e.g. if NONE of the columns resolve correctly perhaps?

from datasette.

simonw avatar simonw commented on May 24, 2024

I got it to print cell, and saw this:

{'value': 1, 'label': '1'}
{'value': 'a', 'label': 'a'}
{'value': 3, 'label': '3'}
{'value': 'b', 'label': 'b'}
{'value': 2, 'label': '2'}
{'value': 'c', 'label': 'c'}

It was a bug in my test, now fixed.

from datasette.

simonw avatar simonw commented on May 24, 2024

I'm going to back-port this to the 0.64.x branch and release the fix.

from datasette.

precipice avatar precipice commented on May 24, 2024

I installed datasette-0.64.6 and re-ran my test and was able to download the full export with no errors. Thank you so much for your help, @simonw!

from datasette.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.