gzt5142 / nldi-crawler-py Goto Github PK
View Code? Open in Web Editor NEWThis project forked from internetofwater/nldi-crawler
Network Linked Data Index Crawler
Home Page: https://labs.waterdata.usgs.gov/about-nldi/
License: Other
This project forked from internetofwater/nldi-crawler
Network Linked Data Index Crawler
Home Page: https://labs.waterdata.usgs.gov/about-nldi/
License: Other
The docs folder is broken -- it still contains references to the template project from which I copied it.
To fix.
Demonstrate successful and robust call to a data source URL to get its raw data.
We are starting to get bloated with options/arguments to the CLI. Let's refactor to use sub-commands. Here's what that would look like:
> nldi-cli ./configfile.toml list
... lists all sources ...
> nldi-cli validate
... connects to each source, verifying that it can get some data ...
> nldi-cli validate 13
... connects to source 13 to verify that it can get some data ...
> nldi-cli download 13
... connects to source 13 and fetches data to local disk. Does not process it ...
> nldi-cli ingest 13
... reads from source 13, processes data, and updates nldi-db ...
As usual, the global switches for 'help' and 'verbose' will apply, as will 'config' to specify the config.toml
file used to establish database connection information.
Current implementation of the various source config options are taking kind of the long way around the barn. Refactor to subclass existing native python collection data structures.
Figure out how to run the cli from within a docker container.
From the original SQL definitions:
To link a point to its catchments:
<update id="linkPoint" >
update nldi_data.${tempTableName,jdbcType=VARCHAR} upd_table
set comid = featureid
from nldi_data.${tempTableName,jdbcType=VARCHAR} src_table
join nhdplus.catchmentsp
on ST_covers(catchmentsp.the_geom, src_table.location)
where upd_table.crawler_source_id = src_table.crawler_source_id and
upd_table.identifier = src_table.identifier
</update>
To link a reach to its catchment:
<update id="linkReachMeasure">
update nldi_data.${tempTableName,jdbcType=VARCHAR} upd_table
set comid = nhdflowline_np21.nhdplus_comid
from nldi_data.${tempTableName,jdbcType=VARCHAR} src_table
join nhdplus.nhdflowline_np21
on nhdflowline_np21.reachcode = src_table.reachcode and
src_table.measure between nhdflowline_np21.fmeasure and nhdflowline_np21.tmeasure
where upd_table.crawler_source_id = src_table.crawler_source_id and
upd_table.identifier = src_table.identifier
</update>
Need to implement this using SQL Alchemy in the python port.
It will be reasonably easy to just execute a 'raw' SQL statement such as the above using the SQLAlchemy mechanism -- but this opens up problems with injection, error-trapping, and also debugging. Need to 'translate' this into the appropriate API calls using the connection Engine().
Exploring the notion that we should take the list of sources as a JSON, TSV, or other source rather than the relational database.
Refactor to use a repo pattern where the source can be plugged in... making the main business logic adapt to an interface rather than a db-specific implementation.
Initial testing successful -- and it has the side benefit of moving some source-specific functions into the object itself (because it's not tied to the ORM).
Proof-of-concept is done; refactor CLI to make use of it.
Set up crawler test database to mimic production db.
python classes for each type of ingestion, along with the mapping to relate those classes to the relevant tables in the nldi database.
In parsing the GeoJSON from source 12, I find that there's a bug in how we're processing features.
The source table identifies the name of a column within property
member that uniquely identifies each feature. All of the data we've got back from other sources do this. Source 12 is strictly compliant with the most recent GeoJSON format specification, and returns the unique ID as a separate member:
{
"type": "Feature",
"properties": {
"description": "Well drilled or set into subsurface...",
"name": "Water Well",
"@iot.selfLink": "https://st2.newmexicowaterdata.org/FROST-Server/v1.1/Things(3798)",
"Locations": [ ... ] ,
"Datastreams": [ ... ],
"//": "Many other properties..."
"WellID": "C08EDC8F-5365-4A0C-B472-F39F8785D52A",
"agency": "NMBGMR",
"PointID": "QY-0070",
"WellDepth": 140.0,
"source_id": 31305.0,
"GeologicFormation": "231CHNL",
"Altitude": 4130.0,
"geoconnex": "https://geoconnex.us/nmwdi/st/locations/3798"
},
"geometry": {
"type": "Point",
"coordinates": [
-103.6153979576192,
35.071781644432996
]
},
"id": "3798"
}
Note that id
is not a property
. It is a member of this feature at the same level of the JSON tree as type
, properties
, and geometry
. Just like the 2016 GeoJSON spec says it should be.
In all the other sources, a unique ID could be found in the properties member. I suppose we could do that here also, with the PointID
field. But a more compliant parser would recognize the id
member (if present) and use it.
Current testing depends too heavily on the environment. Configure a better mock for the db for the unit-tests.
I've not been keeping up with proper tests... refactor to make testing easier/better.
Also need to set up proper mocks and fixtures for httpx and postgresql unit-testing.
Instead of treating this fork and its upstream repo as separate products, they need to have unified deployment pipelines. This issue can be closed once images created from default branch are pushed to nldi/crawler (link is for dev).
The CrawlerSource
object holds information and methods about sources.
Right now, we have a separate function which takes that source to generate a collection of features (via ijson
and a download function). Because that code will never be used without access to a specific feature, I believe it would be better to turn it into a method on the CrawlerSource
object.
sources = src.SQLRepo(db_uri) # ...or CSVRepo() ...or JSONrepo()
source = sources.get(source_id=13)
for f in source.get_features():
# yada, yada, yada
Upon reading the GeoJSON for a feature returned from a source, map it to a Feature
and insert into the correct table in nldi-demo
database.
The GeoJSON parsing is already sorted out using ijson
. Just need to:
Related to internetofwater/nldi-db#97
Create a pipeline reference to an existing pipeline.
This pipeline will be developed in the chs gitlab and will lift the containerized crawler to AWS ECR.
means by which sql statements are sanitized to remove attempts at injection.
Recommend to invert the dependency relationship between a "Feature" object and the ORM implementation of that object. Would rather have the ORM depend on the model rather than the other way around (as it is now).
While I don't think it makes sense to abstract away the implementation completely, I think it will be helpful to reduce coupling with the ORM for most of the business logic.
need better error-handling for failed db connections.
Create python dev environment using poetry and project.toml
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.