Coder Social home page Coder Social logo

gzt5142 / nldi-crawler-py Goto Github PK

View Code? Open in Web Editor NEW

This project forked from internetofwater/nldi-crawler

1.0 1.0 1.0 3.75 MB

Network Linked Data Index Crawler

Home Page: https://labs.waterdata.usgs.gov/about-nldi/

License: Other

Python 18.57% Jupyter Notebook 80.90% Dockerfile 0.52%

nldi-crawler-py's Introduction

๐Ÿ‘‹ Hi, Iโ€™m Gene (@gzt5142)

nldi-crawler-py's People

Contributors

abriggs-usgs avatar codacy-badger avatar danielnaab avatar dblodgett-usgs avatar dependabot-preview[bot] avatar dependabot-support avatar dependabot[bot] avatar dsteinich avatar ewojtylko avatar kkehl-usgs avatar mbucknell avatar skaymen avatar ssoper-usgs avatar

Stargazers

 avatar

Watchers

 avatar

Forkers

webb-ben

nldi-crawler-py's Issues

fix documentation

The docs folder is broken -- it still contains references to the template project from which I copied it.

To fix.

Fetch Sample Data

Demonstrate successful and robust call to a data source URL to get its raw data.

use sub-commands for CLI

We are starting to get bloated with options/arguments to the CLI. Let's refactor to use sub-commands. Here's what that would look like:

> nldi-cli ./configfile.toml list
... lists all sources ...

> nldi-cli validate 
... connects to each source, verifying that it can get some data ...

> nldi-cli validate 13
... connects to source 13 to verify that it can get some data ...

> nldi-cli download 13
... connects to source 13 and fetches data to local disk. Does not process it ...

> nldi-cli ingest 13
... reads from source 13, processes data, and updates nldi-db ...

As usual, the global switches for 'help' and 'verbose' will apply, as will 'config' to specify the config.toml file used to establish database connection information.

Simplify SrcRepo Data Structures

Current implementation of the various source config options are taking kind of the long way around the barn. Refactor to subclass existing native python collection data structures.

Containerize

Figure out how to run the cli from within a docker container.

Link Ingested Features to Basins

From the original SQL definitions:

To link a point to its catchments:

<update id="linkPoint" >
update nldi_data.${tempTableName,jdbcType=VARCHAR} upd_table
	set comid = featureid
from nldi_data.${tempTableName,jdbcType=VARCHAR} src_table
	join nhdplus.catchmentsp
	on ST_covers(catchmentsp.the_geom, src_table.location)
	where upd_table.crawler_source_id = src_table.crawler_source_id and
		upd_table.identifier = src_table.identifier
</update>

To link a reach to its catchment:

<update id="linkReachMeasure">
update nldi_data.${tempTableName,jdbcType=VARCHAR} upd_table
	set comid = nhdflowline_np21.nhdplus_comid
from nldi_data.${tempTableName,jdbcType=VARCHAR} src_table
		  join nhdplus.nhdflowline_np21
		  on nhdflowline_np21.reachcode = src_table.reachcode and
		   src_table.measure between nhdflowline_np21.fmeasure and nhdflowline_np21.tmeasure
		  where upd_table.crawler_source_id = src_table.crawler_source_id and
		   upd_table.identifier = src_table.identifier
</update>

Need to implement this using SQL Alchemy in the python port.

It will be reasonably easy to just execute a 'raw' SQL statement such as the above using the SQLAlchemy mechanism -- but this opens up problems with injection, error-trapping, and also debugging. Need to 'translate' this into the appropriate API calls using the connection Engine().

Repository Pattern

Exploring the notion that we should take the list of sources as a JSON, TSV, or other source rather than the relational database.

Refactor to use a repo pattern where the source can be plugged in... making the main business logic adapt to an interface rather than a db-specific implementation.

Initial testing successful -- and it has the side benefit of moving some source-specific functions into the object itself (because it's not tied to the ORM).

Proof-of-concept is done; refactor CLI to make use of it.

Injest Types

python classes for each type of ingestion, along with the mapping to relate those classes to the relevant tables in the nldi database.

JSON parser needs to account for `id`

In parsing the GeoJSON from source 12, I find that there's a bug in how we're processing features.

The source table identifies the name of a column within property member that uniquely identifies each feature. All of the data we've got back from other sources do this. Source 12 is strictly compliant with the most recent GeoJSON format specification, and returns the unique ID as a separate member:

{
    "type": "Feature",
    "properties": {
        "description": "Well drilled or set into subsurface...",
        "name": "Water Well",
        "@iot.selfLink": "https://st2.newmexicowaterdata.org/FROST-Server/v1.1/Things(3798)",
        "Locations": [ ... ] , 
        "Datastreams": [ ... ],
        "//": "Many other properties..."
        "WellID": "C08EDC8F-5365-4A0C-B472-F39F8785D52A",
        "agency": "NMBGMR",
        "PointID": "QY-0070",
        "WellDepth": 140.0,
        "source_id": 31305.0,
        "GeologicFormation": "231CHNL",
        "Altitude": 4130.0,
        "geoconnex": "https://geoconnex.us/nmwdi/st/locations/3798"
    },
    "geometry": {
        "type": "Point",
        "coordinates": [
            -103.6153979576192,
            35.071781644432996
        ]
    },
    "id": "3798"
}

Note that id is not a property. It is a member of this feature at the same level of the JSON tree as type, properties, and geometry. Just like the 2016 GeoJSON spec says it should be.

In all the other sources, a unique ID could be found in the properties member. I suppose we could do that here also, with the PointID field. But a more compliant parser would recognize the id member (if present) and use it.

Mock database connection

Current testing depends too heavily on the environment. Configure a better mock for the db for the unit-tests.

Improve test coverage

I've not been keeping up with proper tests... refactor to make testing easier/better.

Also need to set up proper mocks and fixtures for httpx and postgresql unit-testing.

Feature iterator/generator

The CrawlerSource object holds information and methods about sources.
Right now, we have a separate function which takes that source to generate a collection of features (via ijson and a download function). Because that code will never be used without access to a specific feature, I believe it would be better to turn it into a method on the CrawlerSource object.

sources = src.SQLRepo(db_uri)   # ...or CSVRepo() ...or JSONrepo()
source = sources.get(source_id=13)
for f in source.get_features():
   # yada, yada, yada

Populate Feature Table

Upon reading the GeoJSON for a feature returned from a source, map it to a Feature and insert into the correct table in nldi-demo database.

The GeoJSON parsing is already sorted out using ijson. Just need to:

  • Create ORM for feature table
  • create the table with tmp name
  • Insert all features
  • swap tmp with 'real' feature table
  • Clean up

SQL sanitization

means by which sql statements are sanitized to remove attempts at injection.

Feature dataclass

Recommend to invert the dependency relationship between a "Feature" object and the ORM implementation of that object. Would rather have the ORM depend on the model rather than the other way around (as it is now).

While I don't think it makes sense to abstract away the implementation completely, I think it will be helpful to reduce coupling with the ORM for most of the business logic.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.