planetlabs / gpq Goto Github PK

View Code? Open in Web Editor NEW

145.0 145.0 8.0 1.09 MB

Utility for working with GeoParquet

Home Page: https://planetlabs.github.io/gpq/

License: Apache License 2.0

Dockerfile 0.09% Makefile 0.36% Go 93.98% HTML 0.55% JavaScript 5.01%

geojson geoparquet parquet

gpq's People

Contributors

Stargazers

Watchers

Forkers

jtmiclat noward404 youssef-harby jneoioi mapsgeek colinahill rstkit mdjong1

gpq's Issues

Allow uploading .geoparquet in Web Interface

If I upload a .geoparquet file to the web application, it complains that only .parquet is allowed.
I think it would be useful to also allow .geoparquet.

Support for reading from blob storage

It should be possible to use the describe, validate, and convert commands with blob storage resource names (e.g. s3://bucket/example.parquet, gs://bucket/example.parquet, azblob://bucket/example.parquet).

The gocloud.dev/blob provides support for multiple cloud providers.

I'm splitting this from #93.

More flexibility for convert?

The new convert stuff works great. It seemed it just handles WKB, would be great if it could handle WKT as well.

The other great addition would be to enable it to use an alternate geometry column by supplying the column name - often data isn't named 'geometry'.

GeoParquet 1.1 support

It'd be great to more broadly support GeoParquet 1.1. There's a range of what could be done with GPQ, in rough order of importance / effort:

Validate the new 1.1 features - recognize the new encodings, and check for the bbox (perhaps recommend adding if it's not there)
Write support for bounding box column - writing without trying to sort would be a good first step, and just assume/hope the file has a decent sort order
Option to sort by r-tree or some other nice spatial option when writing bounding box column
Read a subset of data leveraging the bbox column. Reading with bbox is less important right now as GPQ doesn't have a notion of filtering by bounds / getting a subset, but if it added it then it could efficiently grab overture data and other large datasets.
native encoding / geoarrow support, for both read and write. Perhaps would not be a huge amount of work as there's likely good arrow support in the underlying parquet/arrow libraries?

Unable to convert Parquet files with WKT geometry

As reported in #33 (comment), converting Parquet files with a geometry column containing WKT values does not work (when there is more than one row).

Publish an npm module that bundles the WASM binary

It should be easy for people to use the WASM build to convert to/from GeoParquet.

gpq: not able to view the geoparquet output in QGIS 3.28

Hi,
I'm using 3.28.14-Firenze for Win.
If I drag 6 drop a gpq geoparquet output file, the file is not rendered in QGIS and I only have a white background.
And also the table view contains no record.

The source file is contained in this zip file, that contains some shapefiles:
https://www.istat.it/storage/cartografia/confini_amministrativi/non_generalizzati/2023/Limiti01012023.zip

I create the gepparquet in the way I have detailed below.

Am I doing something wrong?

Thank you

wget -O file.zip "https://www.istat.it/storage/cartografia/confini_amministrativi/non_generalizzati/2023/Limiti01012023.zip"

unzip -o file.zip -d .

ogr2ogr -f GeoJSON -t_srs EPSG:4326 comuni.geojson Limiti01012023/Com01012023/Com01012023_WGS84.shp -lco "RFC7946=YES"

gpq convert --compression="gzip" --max 1000 --from="geojson" comuni.geojson comuni_compressed.parquet

Question: Is there is a plan to expose functionality as library code?

Hi,

I am interested in using gpq to generate GeoParquet files for Who's On First (WOF) data. Ideally I would like to do that by reading and writing data on a per-record basis rather than starting with a single GeoJSON file.

Poking through the code it appears I can stream data to gpq via STDIN which would allow me using a similar approach to how we derive PMTiles from WOF data.

That would solve me immediate problem but the functionality, specifically the convert functionality, wrapped by the gpq command would be generally useful to have a library code (outside of internal).

Support for convert to stdout

I'd like to do something like this:

gpq convert Cairo_Governorate.parquet --stdout --to=geojson | tippecanoe -o Cairo_Governorate.pmtiles --drop-densest-as-needed

Would this functionality be useful? It would require some changes in convert.go to allow for a blank positional output argument.

Support control over number of row groups as an option

When converting to GeoParquet it can be useful to set more row groups, for more efficient querying on large files. See opengeospatial/geoparquet#183

GDAL's is 'ROW_GROUP_SIZE=: Defaults to 65536. Maximum number of rows per group.'

Which seems reasonable, though I was doing like 20k default size for my experiments, so we could consider having the default be less - I didn't see negative effects, but something I read said if you have lots of parquet files then smaller row group size can affect the times of getting stats on the whole set. I think I have like 500 individual parquet files, so perhaps if it's thousands or tens of thousands it comes into effect?

Geoparquet 1.1 validation issues with geoparquet test data.

We just released geoparquet 1.1, and I tried gpq validate on the test data with the native encoding, and it got a stack trace:

% gpq validate data-multilinestring-encoding_wkb.parquet
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x2 addr=0x10 pc=0x10333abbc]

goroutine 1 [running]:
github.com/paulmach/orb/geojson.(*Geometry).Geometry(0x0?)
	/home/runner/go/pkg/mod/github.com/paulmach/[email protected]/geojson/geometry.go:49 +0x1c
github.com/planetlabs/gpq/internal/validator.(*Validator).Report(0x140008d94e0, {0x1045465a0?, 0x105580e00}, 0x140000533e0)
	/home/runner/work/gpq/gpq/internal/validator/validator.go:242 +0x1354
github.com/planetlabs/gpq/internal/validator.(*Validator).Validate(0x140007a8900?, {0x1045465a0, 0x105580e00}, {0x14cd0cb58?, 0x140004aa998?}, {0x140007a8990, 0x29})
	/home/runner/work/gpq/gpq/internal/validator/validator.go:103 +0x12c
github.com/planetlabs/gpq/cmd/gpq/command.(*ValidateCmd).Run(0x10554a058, 0x140007c3200)
	/home/runner/work/gpq/gpq/cmd/gpq/command/validate.go:47 +0x178
reflect.Value.call({0x1042c6360?, 0x10554a058?, 0x140009bfa78?}, {0x103b02567, 0x4}, {0x14000843638, 0x1, 0x1027b9dd8?})
	/opt/hostedtoolcache/go/1.21.5/x64/src/reflect/value.go:596 +0x994
reflect.Value.Call({0x1042c6360?, 0x10554a058?, 0x104273200?}, {0x14000843638?, 0x10451b840?, 0x140008d92b0?})
	/opt/hostedtoolcache/go/1.21.5/x64/src/reflect/value.go:380 +0x94
github.com/alecthomas/kong.callFunction({0x1042c6360?, 0x10554a058?, 0x0?}, 0x103b01e47?)
	/home/runner/go/pkg/mod/github.com/alecthomas/[email protected]/callbacks.go:98 +0x370
github.com/alecthomas/kong.(*Context).RunNode(0x140007c3200, 0x140008002d0, {0x140009bff08, 0x2, 0x140007c7701?})
	/home/runner/go/pkg/mod/github.com/alecthomas/[email protected]/context.go:765 +0x634
github.com/alecthomas/kong.(*Context).Run(0x104135b40?, {0x140009bff08?, 0x0?, 0x1026a9ea8?})
	/home/runner/go/pkg/mod/github.com/alecthomas/[email protected]/context.go:790 +0x138
main.main()
	/home/runner/work/gpq/gpq/cmd/gpq/main.go:32 +0x10c

It got similar results on the 'wkb' test data. But it did work just fine on the main 1.1 geoparquet example. I also generated 1.1 with arrow support gdal (just converting without arrow didn't seem to make gdal do 1.1) and it didn't stack trace, and worked as I'd expect for not being updated to 1.1:

Summary: Passed 12 checks, failed 3 checks, 5 checks not run.

 ✓ file must include a "geo" metadata key
 ✓ metadata must be a JSON object
 ✓ metadata must include a "version" string
 ✓ metadata must include a "primary_column" string
 ✓ metadata must include a "columns" object
 ✓ column metadata must include the "primary_column" name
 ✗ column metadata must include a valid "encoding" string
   ↳ unsupported encoding "point" for column "geom"
 ✓ column metadata must include a "geometry_types" list
 ✓ optional "crs" must be null or a PROJJSON object
 ✓ optional "orientation" must be a valid string
 ✓ optional "edges" must be a valid string
 ✓ optional "bbox" must be an array of 4 or 6 numbers
 ✓ optional "epoch" must be a number
 ✗ geometry columns must not be grouped
   ↳ column "geom" must not be a group
 ✗ geometry columns must be stored using the BYTE_ARRAY parquet type
   ↳ expected primitive column for "geom"

GPQ describe commands all worked well, even with arrow, which was nice.

Add installation docs

The readme should have instructions on installing gpq.

Not able to convert a geojson file

Hi,
when I run this gpq convert --from="geojson" tmp.geojson tmp.parquet I have

gpq: error: failed to generate converter from first 100 features

It's a 100 Mb geojson that I have created using ogr2ogr and a input shp file.

What can I do to solve the problem?

Thank you

describe and validate remote geoparquet files

It'd be awesome if I could run gpq describe and gpq validate on url's. Doing everything on https would be the default, but if s3:// and others are easy that'd be nice too. I'd like to easily check row groups and validity on big remote resources, so it'd be awesome if it supported this. I feel like Brandon mentioned some go library he was liking for that, but forget where I saw that, so thought I'd put up an issue.

Report info on row groups in describe?

The new 'describe' looks great, the table is super helpful. Was wondering if it might be possible to include information about row groups? I use DuckDB for this, I believe it just reports the total number of row groups, which I think is totally fine if that's easier than figuring out the max number of rows per group.

describe on remote source.coop files not working

I've only tried on source cooperative, so there's some chance it's just a problem with those. But whenever I try describe or validate on a remote url it doesn't work, with the same error message:

% gpq describe https://beta.source.coop/cholmes/overture/geoparquet-country-quad-hive/country_iso=JM/Jamaica.parquet
gpq: error: command.DescribeCmd.Run(): failed to read 
"https://beta.source.coop/cholmes/overture/geoparquet-country-quad-hive/country_iso=JM/Jamaica.parquet" 
as parquet: parquet: file is smaller than indicated metadata size

I'm on 0.20.0

feature request: associate srs to geoparquet

Hi,
I have converted to geoparquet a parquet file, using gpq convert.
And in the metadata I do not have the coordinate reference system info:

"metadata": {
    "version": "",
    "primary_column": "geometry",
    "columns": {
      "geometry": {
        "encoding": "WKB",
        "geometry_types": [
          "Polygon",
          "MultiPolygon"
        ],
        "bbox": [
          313279.2514000004,
          3933846.2156000007,
          1312016.1506000003,
          5220292.292199999
        ]
      }
    }
  }

It would be useful to have a cli option in convert command, something like ogr2ogr: gpq convert -a_srs EPSG:32633.

Thank you for this useful tool

About compression: is it normal for it to be so low?

Hi,
I'm testing gpq on the official administrative boundaries of Italy. The source file is this zip file:
https://www.istat.it/storage/cartografia/confini_amministrativi/non_generalizzati/2023/Limiti01012023.zip

It has a folder structure, with shapefiles in it. I am doing the tests on the Limiti01012023/Com01012023/Com01012023_WGS84.shp file:

I convert it to geojson using ogr2ogr;
using this geojson I create a gzip compressed geoparquet file, it has the size of 70 MB
using the same geojson I create an uncompressed geoparquet file, it has the size of 76 MB

They are almost equal in size. Some notes:

if I gzip the uncompressed parquet file I get a 57 MB file
if I create a sozip shp version of the source file, I get a 59 MB file

I know, I can't compare these outputs, however, it seems to me very limited compression in gpq output. Is it normal?
Am I doing something wrong?

Below the way I have tested all.

Thank you

wget -O file.zip "https://www.istat.it/storage/cartografia/confini_amministrativi/non_generalizzati/2023/Limiti01012023.zip"

unzip -o file.zip -d .

ogr2ogr -f GeoJSON -t_srs EPSG:4326 comuni.geojson Limiti01012023/Com01012023/Com01012023_WGS84.shp -lco "RFC7946=YES"

gpq convert --compression="gzip" --max 1000 --from="geojson" comuni.geojson comuni_compressed.parquet

gpq convert --compression="uncompressed" --max 1000 --from="geojson" comuni.geojson comuni_uncompressed.parquet

ogr2ogr -t_srs EPSG:4326 Com01012023_WGS84.shp.zip Limiti01012023/Com01012023/Com01012023_WGS84.shp

Simple way to add metadata to geoparquet file

I think it would be nice that if you have a Parquet file to just be able to provide the JSON metadata and write it to the file. Something like:

gpq add_metadata in.parquet out.geoparquet metadata.json
or
gpq convert in.parquet out.geoparquet --metadata="metadata.json"

In my use case I know the metadata upfront and just need to add the geoparquet metadata to the file.
Could also be a simple way to fix/override geoparuqet metadata in case of bugs such as #45

Support Overture parquet conversion to GeoParquet

The new overture maps has parquet in WKB, but when I try to convert it I get:

% gpq convert 20230725_211237_00132_5p54t_25816df1-b864-49c0-a9a3-a13da4f37a90 out2.parquet --from=parquet --to=geoparquet
gpq: error: encoding parquet data page: encoding not supported for type BYTE_ARRAY

Sample data is at https://storage.googleapis.com/open-geodata/ch/20230725_211237_00132_5p54t_3b7d7eb3-dd9c-442a-a9b9-404dc936c5d9

support newline delimited geojson?

Maybe I'm doing something wrong, but I'm doing planet data search SkySatCollect | gpq convert --from geojson --to geoparquet > gpq-out.parquet and I just get one feature. Planet's API emits newline delimited geojson. It'd be great to be able to stream from it and other api's, as collecting all the new lined geojsons into a single geojson can take a lot of memory.

Dealing with parquet without geometry columns

I was checking a few files to see if they were compliant, but wasn't looking super closely and did convert with one that had no geometries in it. GPQ happily converted it, and then 'describe' showed:

╭────────────────────────────────────────────┬────────┬────────────┬────────────┬─────────────┬──────────┬────────────────┬────────┬────────╮
│ COLUMN                                     │ TYPE   │ ANNOTATION │ REPETITION │ COMPRESSION │ ENCODING │ GEOMETRY TYPES │ BOUNDS │ DETAIL │
├────────────────────────────────────────────┼────────┼────────────┼────────────┼─────────────┼──────────┼────────────────┼────────┼────────┤
│ CBSA Code                                  │ binary │ string     │ 0..1       │ zstd        │          │                │        │        │
│ Metropolitan Division Code                 │ double │            │ 0..1       │ zstd        │          │                │        │        │
│ CSA Code                                   │ double │            │ 0..1       │ zstd        │          │                │        │        │
│ CBSA Title                                 │ binary │ string     │ 0..1       │ zstd        │          │                │        │        │
│ Metropolitan/Micropolitan Statistical Area │ binary │ string     │ 0..1       │ zstd        │          │                │        │        │
│ Metropolitan Division Title                │ binary │ string     │ 0..1       │ zstd        │          │                │        │        │
│ CSA Title                                  │ binary │ string     │ 0..1       │ zstd        │          │                │        │        │
│ County/County Equivalent                   │ binary │ string     │ 0..1       │ zstd        │          │                │        │        │
│ State Name                                 │ binary │ string     │ 0..1       │ zstd        │          │                │        │        │
│ FIPS State Code                            │ binary │ string     │ 0..1       │ zstd        │          │                │        │        │
│ FIPS County Code                           │ binary │ string     │ 0..1       │ zstd        │          │                │        │        │
│ Central/Outlying County                    │ binary │ string     │ 0..1       │ zstd        │          │                │        │        │
│ stcofips                                   │ binary │ string     │ 0..1       │ zstd        │          │                │        │        │
├────────────────────────────────────────────┼────────┼────────────┴────────────┴─────────────┴──────────┴────────────────┴────────┴────────┤
│ ROWS                                       │ 1916   │                                                                                     │
│ VERSION                                    │ 1.0.0  │                                                                                     │
╰────────────────────────────────────────────┴────────┴─────────────────────────────────────────────────────────────────────────────────────╯

The 1.0.0 version threw me off a bit. I think it's technically valid in the spec, and looks like gpq writes out metadata, but not sure if we should call a parquet file without geometries 1.0.0.

The file does not validate:

 ✓ file must include a "geo" metadata key
 ✓ metadata must be a JSON object
 ✓ metadata must include a "version" string
 ✓ metadata must include a "primary_column" string
 ✓ metadata must include a "columns" object
 ✓ column metadata must include the "primary_column" name
 ✓ column metadata must include a valid "encoding" string
 ✓ column metadata must include a "geometry_types" list
 ✓ optional "crs" must be null or a PROJJSON object
 ✓ optional "orientation" must be a valid string
 ✓ optional "edges" must be a valid string
 ✓ optional "bbox" must be an array of 4 or 6 numbers
 ✓ optional "epoch" must be a number
 ✗ geometry columns must not be grouped
   ↳ missing geometry column "geometry"
 ! geometry columns must be stored using the BYTE_ARRAY parquet type
   ↳ not checked
 ! geometry columns must be required or optional, not repeated
   ↳ not checked
 ! all geometry values match the "encoding" metadata
   ↳ not checked
 ! all geometry types must be included in the "geometry_types" metadata (if not empty)
   ↳ not checked
 ! all polygon geometries must follow the "orientation" metadata (if present)
   ↳ not checked
 ! all geometries must fall within the "bbox" metadata (if present)
   ↳ not checked

It could be nice to do a 'has geometry column' check first, and just inform people that the data their validating does not have a geometry.

It also might be nice to put in some 'warning' when you try to convert a file that does not have a geometry. Or could even say it's not allowed (maybe allow some force) option.

Anyways, I think the situation is ok now, but we could likely help people a bit more. I think we're going to see awhile where there's parquet files that aren't geoparquet, and it'd be nice to help people along.

Better warnings / info in `describe` on non-compliant GeoParquet

The new describe is awesome, but if I put in non-compliant geoparquet there's little messaging that I have a file that's not quite right:

 gpq describe taxi.parquet 
╭────────────┬────────┬─────────────────────────────────┬────────────┬─────────────╮
│ COLUMN     │ TYPE   │ ANNOTATION                      │ REPETITION │ COMPRESSION │
├────────────┼────────┼─────────────────────────────────┼────────────┼─────────────┤
│ OBJECTID   │ int32  │ int(bitwidth=32, issigned=true) │ 0..1       │ snappy      │
│ Shape_Leng │ double │                                 │ 0..1       │ snappy      │
│ Shape_Area │ double │                                 │ 0..1       │ snappy      │
│ zone       │ binary │ string                          │ 0..1       │ snappy      │
│ LocationID │ int32  │ int(bitwidth=32, issigned=true) │ 0..1       │ snappy      │
│ borough    │ binary │ string                          │ 0..1       │ snappy      │
│ geom       │ binary │                                 │ 0..1       │ snappy      │
├────────────┼────────┼─────────────────────────────────┴────────────┴─────────────┤
│ ROWS       │ 262    │                                                            │
╰────────────┴────────┴────────────────────────────────────────────────────────────╯

If I convert it then I get an additional row:

├──────────┼────────┼────────────┴────────────┴─────────────┴──────────┴────────────────┴────────┴────────┤
│ ROWS     │ 3233   │                                                                                     │
│ VERSION  │ 1.0.0  │                                                                                     │
╰──────────┴────────┴─────────────────────────────────────────────────────────────────────────────────────╯

I think it'd be nice to try to always put something about the compliance. Like maybe always have VERSION, but if it's not compliant than say non-compliant (might also be nice to call it 'geoparquet version' or something). It could also be nice to say if it's a 'compatible parquet' file, like it's geom and data looks like 4326, and recommend people use gpq convert.

describe: report about compression method

I think it would be interesting (e.g. for investigating #46) to report the compression method in gpq describe

Not able to convert geojson files when the schema is not inferrable

Hi,

I am experiencing this issue with gpq:

gpq: error: failed to create schema after reading 39 features

Based on #142 the answer is clear: there are no non null values in any of the features for one of the columns.
Indeed, if I edit the file and add just one everything works fine.

The problem is that unlike in the linked issue it is not possible for me to increase the amount of rows scanned because all the rows have nulls, and this is a case that is pretty common with the files I am dealing with.

While this strict behaviour is understandableby default, is is preventing me from adopting the tool. The ogr2ogr behaviour is maybe questionable (in my case the incriminating column is being added as a string instead of an int), it at least produces an output that is usable.

So perhaps an option to --drop-non-inferrable-columns, or --import-ambiguous-columns-as-strings would be a useful escape hatch for gpq users.
(pre-processing json is of course an option too but more invovled)

gpq convert output of Overture parquet files cannot be read by GDAL

I was testing the Overture maps data and realised it is only available in parquet and not geoparquet format. As I understand it this is a user case for gpq as mentioned in #57

The tools runs fine and seems to produce output, but I cannot read this using GDAL. Apologies if this is user error or should be a GDAL issue instead - please close if this is the case.

Full steps to recreate below (note I was using gpq on a Windows machine, and testing the output on both Windows and Linux.

Download data:

aws s3 cp --region us-west-2 --no-sign-request --recursive s3://overturemaps-us-west-2/release/2023-10-19-alpha.0/theme=buildings C:\Temp\buildings.parquet

Run conversion:

$env:PATH += ";D:\Tools\gpq-windows-amd64"
gpq version
# 0.20.0

gpq convert part-00769-87dd7d19-acc8-4d4f-a5ba-20b407a79638.c000.zstd.parquet test.geo.parquet --from="parquet" --to="geoparquet"

# also tried without compression (no difference in terms of validity)

gpq convert part-00769-87dd7d19-acc8-4d4f-a5ba-20b407a79638.c000.zstd.parquet test.geo.parquet --from="parquet" --to="geoparquet" --compression="uncompressed"

gpq validate test.geo.parquet 

Summary: Passed 20 checks.

 ✓ file must include a "geo" metadata key
 ✓ metadata must be a JSON object
 ✓ metadata must include a "version" string
 ✓ metadata must include a "primary_column" string
 ✓ metadata must include a "columns" object
 ✓ column metadata must include the "primary_column" name
 ✓ column metadata must include a valid "encoding" string
 ✓ column metadata must include a "geometry_types" list
 ✓ optional "crs" must be null or a PROJJSON object
 ✓ optional "orientation" must be a valid string
 ✓ optional "edges" must be a valid string
 ✓ optional "bbox" must be an array of 4 or 6 numbers
 ✓ optional "epoch" must be a number
 ✓ geometry columns must not be grouped
 ✓ geometry columns must be stored using the BYTE_ARRAY parquet type
 ✓ geometry columns must be required or optional, not repeated
 ✓ all geometry values match the "encoding" metadata
 ✓ all geometry types must be included in the "geometry_types" metadata (if not empty)
 ✓ all polygon geometries must follow the "orientation" metadata (if present)
 ✓ all geometries must fall within the "bbox" metadata (if present)

QGIS opens the file but the attribute table is empty. Testing with ogrinfo:

ogrinfo --version
# GDAL 3.7.2, released 2023/09/05
ogrinfo test.geo.parquet

Warning 1: Field brand.names.common of unhandled type list<element: struct<value: string, language: string>> ignored
Warning 1: Field brand.names.official of unhandled type list<element: struct<value: string, language: string>> ignored
Warning 1: Field brand.names.alternate of unhandled type list<element: struct<value: string, language: string>> ignored
Warning 1: Field brand.names.short of unhandled type list<element: struct<value: string, language: string>> ignored
Warning 1: Field addresses of unhandled type list<element: struct<freeform: string, locality: string, postCode: string, region: string, country: string>> ignored
Warning 1: Field names.common of unhandled type list<element: struct<value: string, language: string>> ignored
Warning 1: Field names.official of unhandled type list<element: struct<value: string, language: string>> ignored
Warning 1: Field names.alternate of unhandled type list<element: struct<value: string, language: string>> ignored
Warning 1: Field names.short of unhandled type list<element: struct<value: string, language: string>> ignored
Warning 1: Field sources of unhandled type list<element: struct<property: string, dataset: string, recordId: string, confidence: double>> ignored
INFO: Open of `test.geo.parquet'
      using driver `Parquet' successful.
1: test.geo

Trying to read the data gives the likely cause of the issue: ERROR 1: ReadNext() failed: Malformed levels. min: 2 max: 2 out of range. Max Level: 1.

ogrinfo test.geo.parquet -al

Warning 1: Field brand.names.common of unhandled type list<element: struct<value: string, language: string>> ignored
Warning 1: Field brand.names.official of unhandled type list<element: struct<value: string, language: string>> ignored
Warning 1: Field brand.names.alternate of unhandled type list<element: struct<value: string, language: string>> ignored
Warning 1: Field brand.names.short of unhandled type list<element: struct<value: string, language: string>> ignored
Warning 1: Field addresses of unhandled type list<element: struct<freeform: string, locality: string, postCode: string, region: string, country: string>> ignored
Warning 1: Field names.common of unhandled type list<element: struct<value: string, language: string>> ignored
Warning 1: Field names.official of unhandled type list<element: struct<value: string, language: string>> ignored
Warning 1: Field names.alternate of unhandled type list<element: struct<value: string, language: string>> ignored
Warning 1: Field names.short of unhandled type list<element: struct<value: string, language: string>> ignored
Warning 1: Field sources of unhandled type list<element: struct<property: string, dataset: string, recordId: string, confidence: double>> ignored
INFO: Open of `test.geo.parquet'
      using driver `Parquet' successful.

Layer name: test.geo
Geometry: Unknown (any)
Feature Count: 815104
ERROR 1: ReadNext() failed: Malformed levels. min: 2 max: 2 out of range.  Max Level: 1
Layer SRS WKT:
GEOGCRS["WGS 84",
    ENSEMBLE["World Geodetic System 1984 ensemble",
        MEMBER["World Geodetic System 1984 (Transit)"],
        MEMBER["World Geodetic System 1984 (G730)"],
        MEMBER["World Geodetic System 1984 (G873)"],
        MEMBER["World Geodetic System 1984 (G1150)"],
        MEMBER["World Geodetic System 1984 (G1674)"],
        MEMBER["World Geodetic System 1984 (G1762)"],
        MEMBER["World Geodetic System 1984 (G2139)"],
        ELLIPSOID["WGS 84",6378137,298.257223563,
            LENGTHUNIT["metre",1]],
        ENSEMBLEACCURACY[2.0]],
    PRIMEM["Greenwich",0,
        ANGLEUNIT["degree",0.0174532925199433]],
    CS[ellipsoidal,2],
        AXIS["geodetic latitude (Lat)",north,
            ORDER[1],
            ANGLEUNIT["degree",0.0174532925199433]],
        AXIS["geodetic longitude (Lon)",east,
            ORDER[2],
            ANGLEUNIT["degree",0.0174532925199433]],
    USAGE[
        SCOPE["Horizontal component of 3D system."],
        AREA["World."],
        BBOX[-90,-180,90,180]],
    ID["EPSG",4326]]
Data axis to CRS axis mapping: 2,1
Geometry Column = geometry
categories.main: String (0.0)
categories.alternate: StringList (0.0)
level: Integer (0.0)
socials: StringList (0.0)
subType: String (0.0)
numFloors: Integer (0.0)
entityId: String (0.0)
class: String (0.0)
sourceTags: String(JSON) (0.0)
localityType: String (0.0)
emails: StringList (0.0)
drivingSide: String (0.0)
adminLevel: Integer (0.0)
road: String (0.0)
isoCountryCodeAlpha2: String (0.0)
isoSubCountryCode: String (0.0)
updateTime: String (0.0)
wikidata: String (0.0)
confidence: Real (0.0)
defaultLanguage: String (0.0)
brand.wikidata: String (0.0)
isIntermittent: Integer(Boolean) (0.0)
connectors: StringList (0.0)
surface: String (0.0)
version: Integer (0.0)
phones: StringList (0.0)
id: String (0.0)
context: String (0.0)
height: Real (0.0)
maritime: Integer(Boolean) (0.0)
websites: StringList (0.0)
isSalt: Integer(Boolean) (0.0)
bbox.minx: Real (0.0)
bbox.maxx: Real (0.0)
bbox.miny: Real (0.0)
bbox.maxy: Real (0.0)
ERROR 1: ReadNext() failed: Malformed levels. min: 2 max: 2 out of range.  Max Level: 1

Testing with the GDAL validate script from here


apt-get install python3-pip --fix-missing
python3 -m pip install jsonschema
python3 validate_geoparquet.py --check-data test.geo.parquet

Warning 1: Field brand.names.common of unhandled type list<element: struct<value: string, language: string>> ignored
Warning 1: Field brand.names.official of unhandled type list<element: struct<value: string, language: string>> ignored
Warning 1: Field brand.names.alternate of unhandled type list<element: struct<value: string, language: string>> ignored
Warning 1: Field brand.names.short of unhandled type list<element: struct<value: string, language: string>> ignored
Warning 1: Field addresses of unhandled type list<element: struct<freeform: string, locality: string, postCode: string, region: string, country: string>> ignored
Warning 1: Field names.common of unhandled type list<element: struct<value: string, language: string>> ignored
Warning 1: Field names.official of unhandled type list<element: struct<value: string, language: string>> ignored
Warning 1: Field names.alternate of unhandled type list<element: struct<value: string, language: string>> ignored
Warning 1: Field names.short of unhandled type list<element: struct<value: string, language: string>> ignored
Warning 1: Field sources of unhandled type list<element: struct<property: string, dataset: string, recordId: string, confidence: double>> ignored
Segmentation fault

Specify the dst crs in convert?

I don't think gpq currently contains a method to specify the target crs. Also I see that by default you use "OGC:CRS84", what is your rationale for that? Why not, for example, use "EPSG:4326"?

I'll add a little bit of context on my use case. So I just used gpq to convert a 'big' collection of parquet files to geoparquet by simply doing gpq convert non-geo.parquet valid-geo.parquet in a for loop. Further in my processing chain I load these geoparquet files using GeoPandas, but I ran into an issue because when the crs == "OGC:CRS84" it cannot be converted to epgs. Although it's expected behaviour I'm mostly just curious why you use "OGC:CRS84" instead of "EPSG:4326".

gdf = gpd.read_parquet("valid-geo.parquet")
print(gdf.crs.to_epsg()) # None
print(gdf.to_crs(4326).to_epsg()) # 4326

I'll probably change my routines from gdf.crs.to_epsg() to gdf.crs.to_string(), but I guess that several others rely on to_epsg() as well when using GeoPandas, so I thought it's worth opening a discussion point here.

Convert: Compression toggle doesn't work?

I ran the following commands:

gpq convert C:\Dev\rapidai4eo\stac-v1.0\rapidai4eo_v1_source_pf\pf.parquet rapidai4eo_v1_source_pf_snappy.geoparquet
gpq convert C:\Dev\rapidai4eo\stac-v1.0\rapidai4eo_v1_source_pf\pf.parquet rapidai4eo_v1_source_pf_snappy.geoparquet --compression="snappy"

The files I get are exactly the same. I suspect both files are gzip compressed and the compression parameter was not taken into account.

gpq convert creates invalid file

I used gpq convert to create a geoparquet file from a parquet file:
gpq convert C:\Dev\rapidai4eo\stac-v1.0\rapidai4eo_v1_source_pf\pf.parquet rapidai4eo_v1_source_pf_snappy.geoparquet --compression="snappy"
Afterwards, I ran gpq validate on it and the new file is invalid.
It is missing the geoparquet version number (empty string).

v0.11.0 on Windows 10

More info about the data?

The new gpq validation is awesome, but it'd be nice if it was easy to get a few more bits of info:

Number of features
Geometry type
CRS info
Bounding box (I do see this in 'describe', but I think it'd be nice to be higher level to get a sense without having to see all the columns).

I could see two routes for this:

Report them as you're doing validation. Like it'd say something in the data section about how many features it's validated to all have proper info. And then instead of just 'all geometry types must be included in geometry_types metadata' it could say 'geometry type metadata is Polygon, and all geometries are polygons'. And similar with bounding box - report the bounding box and report if all fall in it.

(This does highlight two potential 'warnings' - if the bbox reported is much bigger than the actual bounds of the geometry, and if the geometry types is more flexible than needed - like it isn't specified but all the data is actually Polygons. Ideally there'd be nice quick operations in gpq to fix this.

Have an 'info' command like ogrinfo, that just reports on this info.

Error when doing describe / validate with non-geo parquet file

Trying to figure out if a parquet file at https://open.quiltdata.com/b/spatial-ucr/tree/census/administrative/counties.parquet is valid geoparquet. Run 'describe' and 'validate' and get:

gpq: error: command.ValidateCmd.Run(): unable to parse geo metadata: json: cannot unmarshal string into Go struct field GeometryColumn.columns.crs of type geoparquet.Proj

Note that 'convert' works fine, and then can describe it / validate it.

Issue converting a complicated GeoParquet file

I got an error when running gpq convert AR.parquet Argentina-overture.parquet - was hoping it'd upgrade to 1.0.0 (and I could adjust row groups after I got the initial one working. AR.parquet is from https://data.source.coop/cholmes/overture/geoparquet-country-quad-2/AR.parquet but got issues with others at https://beta.source.coop/cholmes/overture/geoparquet-country-quad-2 They're ones that I converted from Overture's parquet distribution, loaded into DuckDB, written out as Parquet and then used geopandas to turn into GeoParquet.

Error was:
gpq: error: command.ConvertCmd.Run(): transform generated an unexpected type, got struct, expected struct

Feature suggestion: extract command

Hi! A processing I found useful when using geoparquet files is creating subsets of data with either using bbox or excluding/selecting columns.

rough suggested implementation

gpq extract -bbox=120,10.1,121.4,11 -geom_col=geometry -exclude_cols=value,label source.geoparquet target.geoparquet

I can work on the implementation of this in the upcoming weeks but would like to know if others would find this useful!

planetlabs / gpq Goto Github PK

gpq's People

Contributors

Stargazers

Watchers

Forkers

gpq's Issues

Recommend Projects

Recommend Topics

Recommend Org