gzt5142 / nldi-crawler-py Goto Github PK

View Code? Open in Web Editor NEW

This project forked from internetofwater/nldi-crawler

1.0 1.0 1.0 3.75 MB

Network Linked Data Index Crawler

Home Page: https://labs.waterdata.usgs.gov/about-nldi/

License: Other

Python 18.57% Jupyter Notebook 80.90% Dockerfile 0.52%

nldi-crawler-py's Introduction

👋 Hi, I’m Gene (@gzt5142)

nldi-crawler-py's People

Contributors

Stargazers

Watchers

Forkers

webb-ben

nldi-crawler-py's Issues

fix documentation

The docs folder is broken -- it still contains references to the template project from which I copied it.

To fix.

Fetch Sample Data

Demonstrate successful and robust call to a data source URL to get its raw data.

use sub-commands for CLI

We are starting to get bloated with options/arguments to the CLI. Let's refactor to use sub-commands. Here's what that would look like:

> nldi-cli ./configfile.toml list
... lists all sources ...

> nldi-cli validate 
... connects to each source, verifying that it can get some data ...

> nldi-cli validate 13
... connects to source 13 to verify that it can get some data ...

> nldi-cli download 13
... connects to source 13 and fetches data to local disk. Does not process it ...

> nldi-cli ingest 13
... reads from source 13, processes data, and updates nldi-db ...

As usual, the global switches for 'help' and 'verbose' will apply, as will 'config' to specify the config.toml file used to establish database connection information.

Simplify SrcRepo Data Structures

Current implementation of the various source config options are taking kind of the long way around the barn. Refactor to subclass existing native python collection data structures.

Containerize

Figure out how to run the cli from within a docker container.

Link Ingested Features to Basins

From the original SQL definitions:

To link a point to its catchments:

<update id="linkPoint" >
update nldi_data.${tempTableName,jdbcType=VARCHAR} upd_table
	set comid = featureid
from nldi_data.${tempTableName,jdbcType=VARCHAR} src_table
	join nhdplus.catchmentsp
	on ST_covers(catchmentsp.the_geom, src_table.location)
	where upd_table.crawler_source_id = src_table.crawler_source_id and
		upd_table.identifier = src_table.identifier
</update>

To link a reach to its catchment:

<update id="linkReachMeasure">
update nldi_data.${tempTableName,jdbcType=VARCHAR} upd_table
	set comid = nhdflowline_np21.nhdplus_comid
from nldi_data.${tempTableName,jdbcType=VARCHAR} src_table
		  join nhdplus.nhdflowline_np21
		  on nhdflowline_np21.reachcode = src_table.reachcode and
		   src_table.measure between nhdflowline_np21.fmeasure and nhdflowline_np21.tmeasure
		  where upd_table.crawler_source_id = src_table.crawler_source_id and
		   upd_table.identifier = src_table.identifier
</update>

Need to implement this using SQL Alchemy in the python port.

It will be reasonably easy to just execute a 'raw' SQL statement such as the above using the SQLAlchemy mechanism -- but this opens up problems with injection, error-trapping, and also debugging. Need to 'translate' this into the appropriate API calls using the connection Engine().

Repository Pattern

Exploring the notion that we should take the list of sources as a JSON, TSV, or other source rather than the relational database.

Refactor to use a repo pattern where the source can be plugged in... making the main business logic adapt to an interface rather than a db-specific implementation.

Initial testing successful -- and it has the side benefit of moving some source-specific functions into the object itself (because it's not tied to the ORM).

Proof-of-concept is done; refactor CLI to make use of it.

Test/Mock database

Set up crawler test database to mimic production db.

SQL object mapping

Injest Types

python classes for each type of ingestion, along with the mapping to relate those classes to the relevant tables in the nldi database.

JSON parser needs to account for `id`

In parsing the GeoJSON from source 12, I find that there's a bug in how we're processing features.

The source table identifies the name of a column within property member that uniquely identifies each feature. All of the data we've got back from other sources do this. Source 12 is strictly compliant with the most recent GeoJSON format specification, and returns the unique ID as a separate member:

{
    "type": "Feature",
    "properties": {
        "description": "Well drilled or set into subsurface...",
        "name": "Water Well",
        "@iot.selfLink": "https://st2.newmexicowaterdata.org/FROST-Server/v1.1/Things(3798)",
        "Locations": [ ... ] , 
        "Datastreams": [ ... ],
        "//": "Many other properties..."
        "WellID": "C08EDC8F-5365-4A0C-B472-F39F8785D52A",
        "agency": "NMBGMR",
        "PointID": "QY-0070",
        "WellDepth": 140.0,
        "source_id": 31305.0,
        "GeologicFormation": "231CHNL",
        "Altitude": 4130.0,
        "geoconnex": "https://geoconnex.us/nmwdi/st/locations/3798"
    },
    "geometry": {
        "type": "Point",
        "coordinates": [
            -103.6153979576192,
            35.071781644432996
        ]
    },
    "id": "3798"
}

Note that id is not a property. It is a member of this feature at the same level of the JSON tree as type, properties, and geometry. Just like the 2016 GeoJSON spec says it should be.

In all the other sources, a unique ID could be found in the properties member. I suppose we could do that here also, with the PointID field. But a more compliant parser would recognize the id member (if present) and use it.

Mock database connection

Current testing depends too heavily on the environment. Configure a better mock for the db for the unit-tests.

Improve test coverage

I've not been keeping up with proper tests... refactor to make testing easier/better.

Also need to set up proper mocks and fixtures for httpx and postgresql unit-testing.

Recombine Deployment Infrastructure

Instead of treating this fork and its upstream repo as separate products, they need to have unified deployment pipelines. This issue can be closed once images created from default branch are pushed to nldi/crawler (link is for dev).

Feature iterator/generator

The CrawlerSource object holds information and methods about sources.
Right now, we have a separate function which takes that source to generate a collection of features (via ijson and a download function). Because that code will never be used without access to a specific feature, I believe it would be better to turn it into a method on the CrawlerSource object.

sources = src.SQLRepo(db_uri)   # ...or CSVRepo() ...or JSONrepo()
source = sources.get(source_id=13)
for f in source.get_features():
   # yada, yada, yada