I'm starting this issue without a clear reproduction in case someone else has seen thi

Fix now available in Datasette 0.64.6: <a href="https://docs.datasette.io/en/stable/ch

This could be because of "-" v.s. <code class="notran

Schema of that table: <div class="highlight highlight-source-sql notranslate posit

Most keys are text, but some are integers: <div class="highlight highlight-source-

CSV export with ?_labels=on fails at first row with a foreign key that fails to resolve correctly,about simonw/datasette

Comments (22)

simonw commented on May 24, 2024 1

Fix now available in Datasette 0.64.6: https://docs.datasette.io/en/stable/changelog.html#v0-64-6

from datasette.

precipice commented on May 24, 2024

If I uncheck expand labels in the Advanced CSV export dialog, the error does not occur. Re-checking that box and re-running the export does cause the error to occur.

from datasette.

precipice commented on May 24, 2024

I tested a bunch of the text-indexed columns and found two that (in combination) cause the export to fail in a fairly-minimal case. They are RAMP_INTERSECTION and CHP_VEHTYPE_AT_FAULT, both of which are usually an int but sometimes a -.

The table definitions are showing up in Datasette as:

[RAMP_INTERSECTION] TEXT REFERENCES [RAMP_INTERSECTION]([key]),
[CHP_VEHTYPE_AT_FAULT] TEXT REFERENCES [CHP_VEHTYPE_AT_FAULT]([key]),

Those definitions are what I would expect them to be. However, if I look at the CHP_VEHTYPE_AT_FAULT lookup table, single-digit numbers are zero-padded in the definition, and are not in the RAMP_INTERSECTION definition. Hmm. I'll also check if the source data zero-pads either of them in use.

I'm going to look at the source data for these cases and then try to produce a much more minimal test case, too.

from datasette.

simonw commented on May 24, 2024

Demo here: https://rbl-collisions.fly.dev/records/collisions.csv?_labels=on&PRIMARY_RD=ASHBY+AV - compare with https://rbl-collisions.fly.dev/records/collisions?PRIMARY_RD=ASHBY+AV&_size=max

from datasette.

simonw commented on May 24, 2024

This could be because of "-" v.s. "- " or similar.

from datasette.

simonw commented on May 24, 2024

Schema of that table:

CREATE TABLE "collisions" (
   [CASE_ID] INTEGER PRIMARY KEY,
   [COLLISION_DATE] TEXT,
   [COLLISION_TIME] INTEGER,
   [OFFICER_ID] TEXT,
   [REPORTING_DISTRICT] TEXT,
   [DAY_OF_WEEK] INTEGER REFERENCES [DAY_OF_WEEK]([id]),
   [CNTY_CITY_LOC] INTEGER,
   [BEAT_NUMBER] TEXT,
   [PRIMARY_RD] TEXT,
   [SECONDARY_RD] TEXT,
   [DISTANCE] FLOAT,
   [DIRECTION] TEXT,
   [INTERSECTION] TEXT,
   [WEATHER_1] TEXT REFERENCES [WEATHER_1]([key]),
   [WEATHER_2] TEXT REFERENCES [WEATHER_2]([key]),
   [STATE_HWY_IND] TEXT,
   [CALTRANS_COUNTY] TEXT,
   [CALTRANS_DISTRICT] INTEGER,
   [STATE_ROUTE] INTEGER,
   [POSTMILE] FLOAT,
   [LOCATION_TYPE] TEXT REFERENCES [LOCATION_TYPE]([key]),
   [RAMP_INTERSECTION] TEXT REFERENCES [RAMP_INTERSECTION]([key]),
   [SIDE_OF_HWY] TEXT REFERENCES [SIDE_OF_HWY]([key]),
   [TOW_AWAY] TEXT,
   [COLLISION_SEVERITY] INTEGER REFERENCES [COLLISION_SEVERITY]([id]),
   [NUMBER_KILLED] INTEGER,
   [NUMBER_INJURED] INTEGER,
   [PARTY_COUNT] INTEGER,
   [PRIMARY_COLL_FACTOR] TEXT REFERENCES [PRIMARY_COLL_FACTOR]([key]),
   [PCF_VIOL_CATEGORY] TEXT REFERENCES [PCF_VIOL_CATEGORY]([key]),
   [PCF_VIOLATION] INTEGER,
   [PCF_VIOL_SUBSECTION] TEXT,
   [HIT_AND_RUN] TEXT,
   [TYPE_OF_COLLISION] TEXT REFERENCES [TYPE_OF_COLLISION]([key]),
   [MVIW] TEXT REFERENCES [MVIW]([key]),
   [PED_ACTION] TEXT REFERENCES [PED_ACTION]([key]),
   [ROAD_SURFACE] TEXT REFERENCES [ROAD_SURFACE]([key]),
   [ROAD_COND_1] TEXT REFERENCES [ROAD_COND_1]([key]),
   [ROAD_COND_2] TEXT REFERENCES [ROAD_COND_2]([key]),
   [LIGHTING] TEXT REFERENCES [LIGHTING]([key]),
   [CONTROL_DEVICE] TEXT REFERENCES [CONTROL_DEVICE]([key]),
   [PEDESTRIAN_ACCIDENT] TEXT,
   [BICYCLE_ACCIDENT] TEXT,
   [MOTORCYCLE_ACCIDENT] TEXT,
   [TRUCK_ACCIDENT] TEXT,
   [NOT_PRIVATE_PROPERTY] TEXT,
   [ALCOHOL_INVOLVED] TEXT,
   [STWD_VEHTYPE_AT_FAULT] TEXT REFERENCES [STWD_VEHTYPE_AT_FAULT]([key]),
   [CHP_VEHTYPE_AT_FAULT] TEXT REFERENCES [CHP_VEHTYPE_AT_FAULT]([key]),
   [COUNT_SEVERE_INJ] INTEGER,
   [COUNT_VISIBLE_INJ] INTEGER,
   [COUNT_COMPLAINT_PAIN] INTEGER,
   [COUNT_PED_KILLED] INTEGER,
   [COUNT_PED_INJURED] INTEGER,
   [COUNT_BICYCLIST_KILLED] INTEGER,
   [COUNT_BICYCLIST_INJURED] INTEGER,
   [COUNT_MC_KILLED] INTEGER,
   [COUNT_MC_INJURED] INTEGER,
   [PRIMARY_RAMP] TEXT REFERENCES [PRIMARY_RAMP]([key]),
   [SECONDARY_RAMP] TEXT REFERENCES [SECONDARY_RAMP]([key]),
   [LATITUDE] FLOAT,
   [LONGITUDE] FLOAT,
   [ADDRESS] TEXT,
   [SEVERITY_INDEX] TEXT
);

from datasette.

simonw commented on May 24, 2024

Most keys are text, but some are integers:

   [DAY_OF_WEEK] INTEGER REFERENCES [DAY_OF_WEEK]([id]),
   [COLLISION_SEVERITY] INTEGER REFERENCES [COLLISION_SEVERITY]([id]),

Those look OK to me:

CREATE TABLE [DAY_OF_WEEK] (
   [id] INTEGER PRIMARY KEY,
   [name] TEXT
);

CREATE TABLE [COLLISION_SEVERITY] (
   [id] INTEGER PRIMARY KEY,
   [name] TEXT
);

Both of those tables have only integer values:

from datasette.

precipice commented on May 24, 2024

If it's useful, you can see the integer versus text key split here:

https://github.com/radical-bike-lobby/switrs-db/blob/main/import-bicycle-crashes.sh#L129

from datasette.

simonw commented on May 24, 2024

Generated this query to try and spot columns that have missing key references:

select 'DAY_OF_WEEK' as column, count(*) from collisions where DAY_OF_WEEK not in (select id from DAY_OF_WEEK) union all
select 'WEATHER_1' as column, count(*) from collisions where WEATHER_1 not in (select key from WEATHER_1) union all
select 'WEATHER_2' as column, count(*) from collisions where WEATHER_2 not in (select key from WEATHER_2) union all
select 'LOCATION_TYPE' as column, count(*) from collisions where LOCATION_TYPE not in (select key from LOCATION_TYPE) union all
select 'RAMP_INTERSECTION' as column, count(*) from collisions where RAMP_INTERSECTION not in (select key from RAMP_INTERSECTION) union all
select 'SIDE_OF_HWY' as column, count(*) from collisions where SIDE_OF_HWY not in (select key from SIDE_OF_HWY) union all
select 'COLLISION_SEVERITY' as column, count(*) from collisions where COLLISION_SEVERITY not in (select id from COLLISION_SEVERITY) union all
select 'PRIMARY_COLL_FACTOR' as column, count(*) from collisions where PRIMARY_COLL_FACTOR not in (select key from PRIMARY_COLL_FACTOR) union all
select 'PCF_VIOL_CATEGORY' as column, count(*) from collisions where PCF_VIOL_CATEGORY not in (select key from PCF_VIOL_CATEGORY) union all
select 'TYPE_OF_COLLISION' as column, count(*) from collisions where TYPE_OF_COLLISION not in (select key from TYPE_OF_COLLISION) union all
select 'MVIW' as column, count(*) from collisions where MVIW not in (select key from MVIW) union all
select 'PED_ACTION' as column, count(*) from collisions where PED_ACTION not in (select key from PED_ACTION) union all
select 'ROAD_SURFACE' as column, count(*) from collisions where ROAD_SURFACE not in (select key from ROAD_SURFACE) union all
select 'ROAD_COND_1' as column, count(*) from collisions where ROAD_COND_1 not in (select key from ROAD_COND_1) union all
select 'ROAD_COND_2' as column, count(*) from collisions where ROAD_COND_2 not in (select key from ROAD_COND_2) union all
select 'LIGHTING' as column, count(*) from collisions where LIGHTING not in (select key from LIGHTING) union all
select 'CONTROL_DEVICE' as column, count(*) from collisions where CONTROL_DEVICE not in (select key from CONTROL_DEVICE) union all
select 'STWD_VEHTYPE_AT_FAULT' as column, count(*) from collisions where STWD_VEHTYPE_AT_FAULT not in (select key from STWD_VEHTYPE_AT_FAULT) union all
select 'CHP_VEHTYPE_AT_FAULT' as column, count(*) from collisions where CHP_VEHTYPE_AT_FAULT not in (select key from CHP_VEHTYPE_AT_FAULT) union all
select 'PRIMARY_RAMP' as column, count(*) from collisions where PRIMARY_RAMP not in (select key from PRIMARY_RAMP) union all
select 'SECONDARY_RAMP' as column, count(*) from collisions where SECONDARY_RAMP not in (select key from SECONDARY_RAMP)

Running that here produces:

column	count(*)
DAY_OF_WEEK	0
WEATHER_1	0
WEATHER_2	0
LOCATION_TYPE	0
RAMP_INTERSECTION	14236
SIDE_OF_HWY	0
COLLISION_SEVERITY	0
PRIMARY_COLL_FACTOR	0
PCF_VIOL_CATEGORY	287
TYPE_OF_COLLISION	665
MVIW	17
PED_ACTION	0
ROAD_SURFACE	0
ROAD_COND_1	0
ROAD_COND_2	0
LIGHTING	0
CONTROL_DEVICE	0
STWD_VEHTYPE_AT_FAULT	0
CHP_VEHTYPE_AT_FAULT	1629
PRIMARY_RAMP	0
SECONDARY_RAMP	5

from datasette.

simonw commented on May 24, 2024

Looking at RAMP_INTERSECTION as an example - that table has empty strings instead of nulls for rows that are missing a value. Maybe that's the source of the Datasette bug?

https://rbl-collisions.fly.dev/records/collisions/4750401.json?_shape=array

[
  {
    "CASE_ID": 4750401,
    "COLLISION_DATE": "2012-04-10",
    "COLLISION_TIME": 1347,
    "OFFICER_ID": "23",
    "REPORTING_DISTRICT": "21",
    "DAY_OF_WEEK": 2,
    "CNTY_CITY_LOC": 103,
    "BEAT_NUMBER": "016",
    "PRIMARY_RD": "UNIVERSITY AV",
    "SECONDARY_RD": "6TH ST",
    "DISTANCE": 0,
    "DIRECTION": "",
    "INTERSECTION": "Y",
    "WEATHER_1": "B",
    "WEATHER_2": "C",
    "STATE_HWY_IND": "N",
    "CALTRANS_COUNTY": "",
    "CALTRANS_DISTRICT": "",
    "STATE_ROUTE": "",
    "POSTMILE": "",
    "LOCATION_TYPE": "",
    "RAMP_INTERSECTION": "",
    "SIDE_OF_HWY": "",
    "TOW_AWAY": "Y",
    "COLLISION_SEVERITY": 4,
    "NUMBER_KILLED": 0,
    "NUMBER_INJURED": 2,
    "PARTY_COUNT": 2,
    "PRIMARY_COLL_FACTOR": "A",
    "PCF_VIOL_CATEGORY": "12",
    "PCF_VIOLATION": 21453,
    "PCF_VIOL_SUBSECTION": "A",
    "HIT_AND_RUN": "N",
    "TYPE_OF_COLLISION": "D",
    "MVIW": "C",
    "PED_ACTION": "A",
    "ROAD_SURFACE": "B",
    "ROAD_COND_1": "H",
    "ROAD_COND_2": "-",
    "LIGHTING": "A",
    "CONTROL_DEVICE": "A",
    "PEDESTRIAN_ACCIDENT": "",
    "BICYCLE_ACCIDENT": "",
    "MOTORCYCLE_ACCIDENT": "",
    "TRUCK_ACCIDENT": "",
    "NOT_PRIVATE_PROPERTY": "Y",
    "ALCOHOL_INVOLVED": "",
    "STWD_VEHTYPE_AT_FAULT": "J",
    "CHP_VEHTYPE_AT_FAULT": "48",
    "COUNT_SEVERE_INJ": 0,
    "COUNT_VISIBLE_INJ": 0,
    "COUNT_COMPLAINT_PAIN": 2,
    "COUNT_PED_KILLED": 0,
    "COUNT_PED_INJURED": 0,
    "COUNT_BICYCLIST_KILLED": 0,
    "COUNT_BICYCLIST_INJURED": 0,
    "COUNT_MC_KILLED": 0,
    "COUNT_MC_INJURED": 0,
    "PRIMARY_RAMP": "-",
    "SECONDARY_RAMP": "-",
    "LATITUDE": "",
    "LONGITUDE": "",
    "ADDRESS": "UNIVERSITY AV and 6TH ST, Berkeley, CA",
    "SEVERITY_INDEX": "4"
  }
]

from datasette.

simonw commented on May 24, 2024

I grabbed a copy of the https://rbl-collisions.fly.dev/records.db DB and ran it locally. Here's where the error message comes from:

datasette/datasette/views/base.py

Lines 553 to 568 in 4284c74

    
                               if heading in expanded_columns: 
        
                                   if cell is None: 
        
                                       new_row.extend(("", "")) 
        
                                   else: 
        
                                       assert isinstance(cell, dict) 
        
                                       new_row.append(cell["value"]) 
        
                                       new_row.append(cell["label"]) 
        
                               else: 
        
                                   new_row.append(cell) 
        
                           await writer.writerow(new_row) 
        
               except Exception as e: 
        
                   sys.stderr.write("Caught this error: {}\n".format(e)) 
        
                   sys.stderr.flush() 
        
                   await r.write(str(e)) 
        
                   return 
        
           await limited_writer.write(postamble)

I added a breakpoint() to get a debugger at that point, and found that the error is an assertion error:

560  	                            else:
561  	                                new_row.append(cell)
562  	                        await writer.writerow(new_row)
563  	            except Exception as ex:
564  	                breakpoint()
565  ->	                sys.stderr.write("Caught this error: {}\n".format(ex))
566  	                sys.stderr.flush()
567  	                await r.write(str(e))
568  	                return
569  	        await limited_writer.write(postamble)
570  	
(Pdb) ex
AssertionError()
(Pdb) type(ex)
<class 'AssertionError'>

from datasette.

simonw commented on May 24, 2024

On a hunch I applied this diff:

diff --git a/datasette/views/base.py b/datasette/views/base.py
index 0080b33c..000a2e75 100644
--- a/datasette/views/base.py
+++ b/datasette/views/base.py
@@ -554,16 +554,20 @@ async def stream_csv(datasette, fetch_data, request, database):
                                 if cell is None:
                                     new_row.extend(("", ""))
                                 else:
-                                    assert isinstance(cell, dict)
+                                    assert isinstance(
+                                        cell, dict
+                                    ), "cell {} should have been a dict".format(
+                                        repr(cell)
+                                    )
                                     new_row.append(cell["value"])
                                     new_row.append(cell["label"])
                             else:
                                 new_row.append(cell)
                         await writer.writerow(new_row)
-            except Exception as e:
-                sys.stderr.write("Caught this error: {}\n".format(e))
+            except Exception as ex:
+                sys.stderr.write("Caught this error: {}\n".format(ex))
                 sys.stderr.flush()
-                await r.write(str(e))
+                await r.write(str(ex))
                 return
         await limited_writer.write(postamble)

And now the error message that is displayed says:

Caught this error: cell '- ' should have been a dict

from datasette.

simonw commented on May 24, 2024

Using assertions here is bad in the first place, because they would be ignored if anyone ran Datasette with the python -O option.

from datasette.

simonw commented on May 24, 2024

So the core bug is here:

datasette/datasette/views/base.py

Lines 546 to 562 in 4284c74

    
           if not expanded_columns: 
        
               # Simple path 
        
               await writer.writerow(row) 
        
           else: 
        
               # Look for {"value": "label": } dicts and expand 
        
               new_row = [] 
        
               for heading, cell in zip(data["columns"], row): 
        
                   if heading in expanded_columns: 
        
                       if cell is None: 
        
                           new_row.extend(("", "")) 
        
                       else: 
        
                           assert isinstance(cell, dict) 
        
                           new_row.append(cell["value"]) 
        
                           new_row.append(cell["label"]) 
        
                   else: 
        
                       new_row.append(cell) 
        
               await writer.writerow(new_row)

But I should clean up the way assertions are used here too.

from datasette.

simonw commented on May 24, 2024

The row that's breaking looks like this in the debugger:

{'ADDRESS': 'ASHBY AV and CLAIREMONT AV, Berkeley, CA',
 'ALCOHOL_INVOLVED': '',
 'BEAT_NUMBER': '008',
 'BICYCLE_ACCIDENT': '',
 'CALTRANS_COUNTY': 'ALA',
 'CALTRANS_DISTRICT': 4,
 'CASE_ID': 5101393,
 'CHP_VEHTYPE_AT_FAULT': '- ',
 'CNTY_CITY_LOC': 103,

Note CHP_VEHTYPE_AT_FAULT is '- ' - but that's a column which should be expanded as a foreign key.

SQL query:

select
  CHP_VEHTYPE_AT_FAULT, count(*)
from
  collisions
where
  CHP_VEHTYPE_AT_FAULT not in (
    select
      key
    from
      CHP_VEHTYPE_AT_FAULT
  )
group by CHP_VEHTYPE_AT_FAULT

Run here as JSON returns:

[
  {
    "CHP_VEHTYPE_AT_FAULT": "",
    "count(*)": 561
  },
  {
    "CHP_VEHTYPE_AT_FAULT": "- ",
    "count(*)": 1065
  },
  {
    "CHP_VEHTYPE_AT_FAULT": "33",
    "count(*)": 1
  },
  {
    "CHP_VEHTYPE_AT_FAULT": "91",
    "count(*)": 1
  },
  {
    "CHP_VEHTYPE_AT_FAULT": "93",
    "count(*)": 1
  }
]

from datasette.

simonw commented on May 24, 2024

So I think the root bug here is that Datasette CSV export, with expand labels enabled, fails if any of the rows have a foreign key that doesn't correctly resolve to the other table.

from datasette.

simonw commented on May 24, 2024

I think the correct behavior here is to output the value column with the value from the table, and leave the label column blank.

I prototyped that up and here's what that export looks like with the CSV opened in Numbers:

Code:

diff --git a/datasette/views/base.py b/datasette/views/base.py
index 0080b33c..9ef403e0 100644
--- a/datasette/views/base.py
+++ b/datasette/views/base.py
@@ -554,16 +554,20 @@ async def stream_csv(datasette, fetch_data, request, database):
                                 if cell is None:
                                     new_row.extend(("", ""))
                                 else:
-                                    assert isinstance(cell, dict)
-                                    new_row.append(cell["value"])
-                                    new_row.append(cell["label"])
+                                    if not isinstance(
+                                        cell, dict
+                                    ):
+                                        new_row.extend((cell, ""))
+                                    else:
+                                        new_row.append(cell["value"])
+                                        new_row.append(cell["label"])
                             else:
                                 new_row.append(cell)
                         await writer.writerow(new_row)

from datasette.

simonw commented on May 24, 2024

Need test coverage and I'll land this fix.

from datasette.

simonw commented on May 24, 2024

Wrote this test but it fails when I expect it to pass:

@pytest.mark.asyncio
async def test_table_csv_with_invalid_labels():
    # https://github.com/simonw/datasette/issues/2214
    ds = Datasette()
    await ds.invoke_startup()
    db = ds.add_memory_database("db_2214")
    await db.execute_write_script(
        """
        create table t1 (id integer primary key, name text);
        insert into t1 (id, name) values (1, 'one');
        insert into t1 (id, name) values (2, 'two');
        create table t2 (textid text primary key, value text);
        insert into t2 (textid, value) values ('a', 'alpha');
        insert into t2 (textid, value) values ('b', 'beta');
        create table if not exists maintable (
            id integer primary key,
            fk_integer integer references t2(id),
            fk_text text references t3(textid)
        );
        insert into maintable (id, fk_integer, fk_text) values (1, 1, 'a');
        insert into maintable (id, fk_integer, fk_text) values (2, 3, 'b'); -- invalid fk_integer
        insert into maintable (id, fk_integer, fk_text) values (3, 2, 'c'); -- invalid fk_text
    """
    )
    response = await ds.client.get("/db_2214/maintable.csv?_labels=1")
    assert response.status_code == 200
    assert response.text == (
        "id,fk_integer,fk_integer_label,fk_text,fk_text_label\r\n"
        "1,1,one,a,alpha\r\n"
        "2,3,,b,beta\r\n"
        "3,2,two,c,\r\n"
    )

Might be that the original bug only triggers on another circumstance, e.g. if NONE of the columns resolve correctly perhaps?

from datasette.

simonw commented on May 24, 2024

I got it to print cell, and saw this:

{'value': 1, 'label': '1'}
{'value': 'a', 'label': 'a'}
{'value': 3, 'label': '3'}
{'value': 'b', 'label': 'b'}
{'value': 2, 'label': '2'}
{'value': 'c', 'label': 'c'}

It was a bug in my test, now fixed.

from datasette.

simonw commented on May 24, 2024

I'm going to back-port this to the 0.64.x branch and release the fix.

from datasette.

precipice commented on May 24, 2024

I installed datasette-0.64.6 and re-ran my test and was able to download the full export with no errors. Thank you so much for your help, @simonw!

from datasette.

CSV export with ?_labels=on fails at first row with a foreign key that fails to resolve correctly about datasette HOT 22 CLOSED

Comments (22)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	if heading in expanded_columns:
	if cell is None:
	new_row.extend(("", ""))
	else:
	assert isinstance(cell, dict)
	new_row.append(cell["value"])
	new_row.append(cell["label"])
	else:
	new_row.append(cell)
	await writer.writerow(new_row)
	except Exception as e:
	sys.stderr.write("Caught this error: {}\n".format(e))
	sys.stderr.flush()
	await r.write(str(e))
	return
	await limited_writer.write(postamble)

	if not expanded_columns:
	# Simple path
	await writer.writerow(row)
	else:
	# Look for {"value": "label": } dicts and expand
	new_row = []
	for heading, cell in zip(data["columns"], row):
	if heading in expanded_columns:
	if cell is None:
	new_row.extend(("", ""))
	else:
	assert isinstance(cell, dict)
	new_row.append(cell["value"])
	new_row.append(cell["label"])
	else:
	new_row.append(cell)
	await writer.writerow(new_row)