Coder Social home page Coder Social logo

ENH: add read_postgis about dask-geopandas HOT 6 OPEN

geopandas avatar geopandas commented on June 19, 2024
ENH: add read_postgis

from dask-geopandas.

Comments (6)

jorisvandenbossche avatar jorisvandenbossche commented on June 19, 2024

One question that comes up here (and it's the same question for spatial repartitioning a dask.dataframe to conform to given regions): how to deal with possible duplicates?

If you specify the regions for the spatial partitions and use ST_INTERSECTS, a given geometry can intersect with multiple regions.

In @mrocklin's prototype, he had some special code to check the representative point of a geometry instead, to avoid such duplicates: https://github.com/mrocklin/dask-geopandas/blob/8133969bf03d158f51faf85d020641e86c9a7e28/dask_geopandas/core.py#L339-L406

from dask-geopandas.

martinfleis avatar martinfleis commented on June 19, 2024

I was hoping that this will be resolved in a spatial repartitioning bit and we could just use the same principle here :).

Using an intersection with a representative point instead of geometry is one way, which would resolve a lot of cases. But not all, since you can have input points fixed to some grid and use the same grid for partitioning. Then one point could be intersecting the edge of 2 (or even 4) parts, so not even that is robust enough.

I would probably try the following:

  1. Generate representative_point/centroid for non-point geometry
  2. partition based on points (this is where ST_INTERSECTS would be used)
  3. find those duplicated duplicates (if we'll use query_bulk than it is easy)
  4. keep duplicates in one partition only, no matter which one. Or use this clever thing in the code you linked - https://github.com/mrocklin/dask-geopandas/blob/8133969bf03d158f51faf85d020641e86c9a7e28/dask_geopandas/core.py#L394-L406

I am just afraid that it can be expensive.

from dask-geopandas.

jorisvandenbossche avatar jorisvandenbossche commented on June 19, 2024

I am just afraid that it can be expensive.

When starting from an existing dask.dataframe (not necessarily in memory, can also be backed by reading in from eg parquet), doing a repartition is expensive anyway (since it's doing a full re-shuffle of the data), so the additional checks for intersects / duplicates might not be that of a problem.
And for this case, your points above seem like a good workflow (and I think we can probably simply keep the duplicates in one of the partitions, no matter which one, without using the clever touch/shift trick).

For reading from PostGIS, it's a bit different though. Because in this case ideally the queries you do from postgis directly give you the correct data for the partitions of the GeoDataFrame, I think? (so we wouldn't use query_bulk on those data afterwards to check if the partitioning is correct) In which case the "logic" that determines which rows to select should live in the SQL query?

from dask-geopandas.

martinfleis avatar martinfleis commented on June 19, 2024

Are we actually able to put this logic to the SQL query? If we can get "clean" chunks from SQL which can be directly mapped to partitions, that is ideal. That would not be a simple query. That is why I was mentioning that it can be expensive, as I imagined dumb reading from PostGIS and filtering on the dask side (which would require unique id coming from PostGIS).

from dask-geopandas.

jorisvandenbossche avatar jorisvandenbossche commented on June 19, 2024

Are we actually able to put this logic to the SQL query?

If we cannot (and thus doing the "dumb reading from PostGIS and filtering/repartitioning on the dask side"), I am not sure how useful the method would be.

Repartitioning will typically be something you want to do once, and then try to "persist" the result or write the result to disk using a more efficient format (and a format that preserves the spatial partitioning information). Certainly in case the PostGIS table is larger than memory, doing the full shuffle each time when reading + doing a calculation will not be very efficient, I think.

from dask-geopandas.

caspervdw avatar caspervdw commented on June 19, 2024

I have had good experience in letting postgis handle the partitioning.

In general, the partitioning should be done right at the IO level, either by having the data preprocessed in a partitioned structure, or by having an index on the data that allows efficient spatial queries (and thus spatial partitioning). The PostGIS database falls into the second category.

When using the standard GIST index on geometry, the most efficient query would be “SELECT (...) WHERE geom_col && partition_box”. So not using ST_Intersects, which might become expensive for complex geometries.

You could then add a filter with a representative point and check if it is in the bbox (half-open intervals) to make this unique in all cases. That is a rather efficient (but verbose) addition to the query. You could also do the index “ON ST_Centroid(geometry)”.

However in my experience it is often faster to accept some duplicates and filter them out when the partitions are merged. The duplicates give some superfluous computations, but this is an edge effect. For many spatial analyses you actually need the duplicates to be there.

Another issue with an external database is: how to choose the partitions without doing a full scan through the data? I always did some square grid and supplied the extent of it myself, which mostly results in a big imbalance between partition sizes. This is certainly not a show-stopper. When comparing geometries with array tiles, this is actually a good idea.

For more balanced partitions: maybe we could leverage the internal index structure of the PostGIS R-tree?

from dask-geopandas.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.