planetlabs / gpq Goto Github PK
View Code? Open in Web Editor NEWUtility for working with GeoParquet
Home Page: https://planetlabs.github.io/gpq/
License: Apache License 2.0
Utility for working with GeoParquet
Home Page: https://planetlabs.github.io/gpq/
License: Apache License 2.0
If I upload a .geoparquet file to the web application, it complains that only .parquet is allowed.
I think it would be useful to also allow .geoparquet.
It should be possible to use the describe
, validate
, and convert
commands with blob storage resource names (e.g. s3://bucket/example.parquet
, gs://bucket/example.parquet
, azblob://bucket/example.parquet
).
The gocloud.dev/blob
provides support for multiple cloud providers.
I'm splitting this from #93.
The new convert stuff works great. It seemed it just handles WKB, would be great if it could handle WKT as well.
The other great addition would be to enable it to use an alternate geometry column by supplying the column name - often data isn't named 'geometry'.
It'd be great to more broadly support GeoParquet 1.1. There's a range of what could be done with GPQ, in rough order of importance / effort:
As reported in #33 (comment), converting Parquet files with a geometry column containing WKT values does not work (when there is more than one row).
It should be easy for people to use the WASM build to convert to/from GeoParquet.
Hi,
I'm using 3.28.14-Firenze for Win.
If I drag 6 drop a gpq geoparquet output file, the file is not rendered in QGIS and I only have a white background.
And also the table view contains no record.
The source file is contained in this zip file, that contains some shapefiles:
https://www.istat.it/storage/cartografia/confini_amministrativi/non_generalizzati/2023/Limiti01012023.zip
I create the gepparquet in the way I have detailed below.
Am I doing something wrong?
Thank you
wget -O file.zip "https://www.istat.it/storage/cartografia/confini_amministrativi/non_generalizzati/2023/Limiti01012023.zip"
unzip -o file.zip -d .
ogr2ogr -f GeoJSON -t_srs EPSG:4326 comuni.geojson Limiti01012023/Com01012023/Com01012023_WGS84.shp -lco "RFC7946=YES"
gpq convert --compression="gzip" --max 1000 --from="geojson" comuni.geojson comuni_compressed.parquet
Hi,
I am interested in using gpq
to generate GeoParquet files for Who's On First (WOF) data. Ideally I would like to do that by reading and writing data on a per-record basis rather than starting with a single GeoJSON file.
Poking through the code it appears I can stream data to gpq
via STDIN which would allow me using a similar approach to how we derive PMTiles from WOF data.
That would solve me immediate problem but the functionality, specifically the convert functionality, wrapped by the gpq
command would be generally useful to have a library code (outside of internal
).
I'd like to do something like this:
gpq convert Cairo_Governorate.parquet --stdout --to=geojson | tippecanoe -o Cairo_Governorate.pmtiles --drop-densest-as-needed
Would this functionality be useful? It would require some changes in convert.go
to allow for a blank positional output argument.
When converting to GeoParquet it can be useful to set more row groups, for more efficient querying on large files. See opengeospatial/geoparquet#183
GDAL's is 'ROW_GROUP_SIZE=: Defaults to 65536. Maximum number of rows per group.'
Which seems reasonable, though I was doing like 20k default size for my experiments, so we could consider having the default be less - I didn't see negative effects, but something I read said if you have lots of parquet files then smaller row group size can affect the times of getting stats on the whole set. I think I have like 500 individual parquet files, so perhaps if it's thousands or tens of thousands it comes into effect?
We just released geoparquet 1.1, and I tried gpq validate
on the test data with the native encoding, and it got a stack trace:
% gpq validate data-multilinestring-encoding_wkb.parquet
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x2 addr=0x10 pc=0x10333abbc]
goroutine 1 [running]:
github.com/paulmach/orb/geojson.(*Geometry).Geometry(0x0?)
/home/runner/go/pkg/mod/github.com/paulmach/[email protected]/geojson/geometry.go:49 +0x1c
github.com/planetlabs/gpq/internal/validator.(*Validator).Report(0x140008d94e0, {0x1045465a0?, 0x105580e00}, 0x140000533e0)
/home/runner/work/gpq/gpq/internal/validator/validator.go:242 +0x1354
github.com/planetlabs/gpq/internal/validator.(*Validator).Validate(0x140007a8900?, {0x1045465a0, 0x105580e00}, {0x14cd0cb58?, 0x140004aa998?}, {0x140007a8990, 0x29})
/home/runner/work/gpq/gpq/internal/validator/validator.go:103 +0x12c
github.com/planetlabs/gpq/cmd/gpq/command.(*ValidateCmd).Run(0x10554a058, 0x140007c3200)
/home/runner/work/gpq/gpq/cmd/gpq/command/validate.go:47 +0x178
reflect.Value.call({0x1042c6360?, 0x10554a058?, 0x140009bfa78?}, {0x103b02567, 0x4}, {0x14000843638, 0x1, 0x1027b9dd8?})
/opt/hostedtoolcache/go/1.21.5/x64/src/reflect/value.go:596 +0x994
reflect.Value.Call({0x1042c6360?, 0x10554a058?, 0x104273200?}, {0x14000843638?, 0x10451b840?, 0x140008d92b0?})
/opt/hostedtoolcache/go/1.21.5/x64/src/reflect/value.go:380 +0x94
github.com/alecthomas/kong.callFunction({0x1042c6360?, 0x10554a058?, 0x0?}, 0x103b01e47?)
/home/runner/go/pkg/mod/github.com/alecthomas/[email protected]/callbacks.go:98 +0x370
github.com/alecthomas/kong.(*Context).RunNode(0x140007c3200, 0x140008002d0, {0x140009bff08, 0x2, 0x140007c7701?})
/home/runner/go/pkg/mod/github.com/alecthomas/[email protected]/context.go:765 +0x634
github.com/alecthomas/kong.(*Context).Run(0x104135b40?, {0x140009bff08?, 0x0?, 0x1026a9ea8?})
/home/runner/go/pkg/mod/github.com/alecthomas/[email protected]/context.go:790 +0x138
main.main()
/home/runner/work/gpq/gpq/cmd/gpq/main.go:32 +0x10c
It got similar results on the 'wkb' test data. But it did work just fine on the main 1.1 geoparquet example. I also generated 1.1 with arrow support gdal (just converting without arrow didn't seem to make gdal do 1.1) and it didn't stack trace, and worked as I'd expect for not being updated to 1.1:
Summary: Passed 12 checks, failed 3 checks, 5 checks not run.
✓ file must include a "geo" metadata key
✓ metadata must be a JSON object
✓ metadata must include a "version" string
✓ metadata must include a "primary_column" string
✓ metadata must include a "columns" object
✓ column metadata must include the "primary_column" name
✗ column metadata must include a valid "encoding" string
↳ unsupported encoding "point" for column "geom"
✓ column metadata must include a "geometry_types" list
✓ optional "crs" must be null or a PROJJSON object
✓ optional "orientation" must be a valid string
✓ optional "edges" must be a valid string
✓ optional "bbox" must be an array of 4 or 6 numbers
✓ optional "epoch" must be a number
✗ geometry columns must not be grouped
↳ column "geom" must not be a group
✗ geometry columns must be stored using the BYTE_ARRAY parquet type
↳ expected primitive column for "geom"
GPQ describe commands all worked well, even with arrow, which was nice.
The readme should have instructions on installing gpq
.
Hi,
when I run this gpq convert --from="geojson" tmp.geojson tmp.parquet
I have
gpq: error: failed to generate converter from first 100 features
It's a 100 Mb geojson that I have created using ogr2ogr and a input shp file.
What can I do to solve the problem?
Thank you
It'd be awesome if I could run gpq describe
and gpq validate
on url's. Doing everything on https would be the default, but if s3:// and others are easy that'd be nice too. I'd like to easily check row groups and validity on big remote resources, so it'd be awesome if it supported this. I feel like Brandon mentioned some go library he was liking for that, but forget where I saw that, so thought I'd put up an issue.
The new 'describe' looks great, the table is super helpful. Was wondering if it might be possible to include information about row groups? I use DuckDB for this, I believe it just reports the total number of row groups, which I think is totally fine if that's easier than figuring out the max number of rows per group.
I've only tried on source cooperative, so there's some chance it's just a problem with those. But whenever I try describe or validate on a remote url it doesn't work, with the same error message:
% gpq describe https://beta.source.coop/cholmes/overture/geoparquet-country-quad-hive/country_iso=JM/Jamaica.parquet
gpq: error: command.DescribeCmd.Run(): failed to read
"https://beta.source.coop/cholmes/overture/geoparquet-country-quad-hive/country_iso=JM/Jamaica.parquet"
as parquet: parquet: file is smaller than indicated metadata size
I'm on 0.20.0
Hi,
I have converted to geoparquet a parquet file, using gpq convert
.
And in the metadata I do not have the coordinate reference system info:
"metadata": {
"version": "",
"primary_column": "geometry",
"columns": {
"geometry": {
"encoding": "WKB",
"geometry_types": [
"Polygon",
"MultiPolygon"
],
"bbox": [
313279.2514000004,
3933846.2156000007,
1312016.1506000003,
5220292.292199999
]
}
}
}
It would be useful to have a cli option in convert command, something like ogr2ogr: gpq convert -a_srs EPSG:32633
.
Thank you for this useful tool
Hi,
I'm testing gpq on the official administrative boundaries of Italy. The source file is this zip file:
https://www.istat.it/storage/cartografia/confini_amministrativi/non_generalizzati/2023/Limiti01012023.zip
It has a folder structure, with shapefiles in it. I am doing the tests on the Limiti01012023/Com01012023/Com01012023_WGS84.shp
file:
They are almost equal in size. Some notes:
I know, I can't compare these outputs, however, it seems to me very limited compression in gpq output. Is it normal?
Am I doing something wrong?
Below the way I have tested all.
Thank you
wget -O file.zip "https://www.istat.it/storage/cartografia/confini_amministrativi/non_generalizzati/2023/Limiti01012023.zip"
unzip -o file.zip -d .
ogr2ogr -f GeoJSON -t_srs EPSG:4326 comuni.geojson Limiti01012023/Com01012023/Com01012023_WGS84.shp -lco "RFC7946=YES"
gpq convert --compression="gzip" --max 1000 --from="geojson" comuni.geojson comuni_compressed.parquet
gpq convert --compression="uncompressed" --max 1000 --from="geojson" comuni.geojson comuni_uncompressed.parquet
ogr2ogr -t_srs EPSG:4326 Com01012023_WGS84.shp.zip Limiti01012023/Com01012023/Com01012023_WGS84.shp
I think it would be nice that if you have a Parquet file to just be able to provide the JSON metadata and write it to the file. Something like:
gpq add_metadata in.parquet out.geoparquet metadata.json
or
gpq convert in.parquet out.geoparquet --metadata="metadata.json"
In my use case I know the metadata upfront and just need to add the geoparquet metadata to the file.
Could also be a simple way to fix/override geoparuqet metadata in case of bugs such as #45
The new overture maps has parquet in WKB, but when I try to convert it I get:
% gpq convert 20230725_211237_00132_5p54t_25816df1-b864-49c0-a9a3-a13da4f37a90 out2.parquet --from=parquet --to=geoparquet
gpq: error: encoding parquet data page: encoding not supported for type BYTE_ARRAY
Sample data is at https://storage.googleapis.com/open-geodata/ch/20230725_211237_00132_5p54t_3b7d7eb3-dd9c-442a-a9b9-404dc936c5d9
Maybe I'm doing something wrong, but I'm doing planet data search SkySatCollect | gpq convert --from geojson --to geoparquet > gpq-out.parquet
and I just get one feature. Planet's API emits newline delimited geojson. It'd be great to be able to stream from it and other api's, as collecting all the new lined geojsons into a single geojson can take a lot of memory.
I was checking a few files to see if they were compliant, but wasn't looking super closely and did convert
with one that had no geometries in it. GPQ happily converted it, and then 'describe' showed:
╭────────────────────────────────────────────┬────────┬────────────┬────────────┬─────────────┬──────────┬────────────────┬────────┬────────╮
│ COLUMN │ TYPE │ ANNOTATION │ REPETITION │ COMPRESSION │ ENCODING │ GEOMETRY TYPES │ BOUNDS │ DETAIL │
├────────────────────────────────────────────┼────────┼────────────┼────────────┼─────────────┼──────────┼────────────────┼────────┼────────┤
│ CBSA Code │ binary │ string │ 0..1 │ zstd │ │ │ │ │
│ Metropolitan Division Code │ double │ │ 0..1 │ zstd │ │ │ │ │
│ CSA Code │ double │ │ 0..1 │ zstd │ │ │ │ │
│ CBSA Title │ binary │ string │ 0..1 │ zstd │ │ │ │ │
│ Metropolitan/Micropolitan Statistical Area │ binary │ string │ 0..1 │ zstd │ │ │ │ │
│ Metropolitan Division Title │ binary │ string │ 0..1 │ zstd │ │ │ │ │
│ CSA Title │ binary │ string │ 0..1 │ zstd │ │ │ │ │
│ County/County Equivalent │ binary │ string │ 0..1 │ zstd │ │ │ │ │
│ State Name │ binary │ string │ 0..1 │ zstd │ │ │ │ │
│ FIPS State Code │ binary │ string │ 0..1 │ zstd │ │ │ │ │
│ FIPS County Code │ binary │ string │ 0..1 │ zstd │ │ │ │ │
│ Central/Outlying County │ binary │ string │ 0..1 │ zstd │ │ │ │ │
│ stcofips │ binary │ string │ 0..1 │ zstd │ │ │ │ │
├────────────────────────────────────────────┼────────┼────────────┴────────────┴─────────────┴──────────┴────────────────┴────────┴────────┤
│ ROWS │ 1916 │ │
│ VERSION │ 1.0.0 │ │
╰────────────────────────────────────────────┴────────┴─────────────────────────────────────────────────────────────────────────────────────╯
The 1.0.0 version threw me off a bit. I think it's technically valid in the spec, and looks like gpq writes out metadata, but not sure if we should call a parquet file without geometries 1.0.0.
The file does not validate:
✓ file must include a "geo" metadata key
✓ metadata must be a JSON object
✓ metadata must include a "version" string
✓ metadata must include a "primary_column" string
✓ metadata must include a "columns" object
✓ column metadata must include the "primary_column" name
✓ column metadata must include a valid "encoding" string
✓ column metadata must include a "geometry_types" list
✓ optional "crs" must be null or a PROJJSON object
✓ optional "orientation" must be a valid string
✓ optional "edges" must be a valid string
✓ optional "bbox" must be an array of 4 or 6 numbers
✓ optional "epoch" must be a number
✗ geometry columns must not be grouped
↳ missing geometry column "geometry"
! geometry columns must be stored using the BYTE_ARRAY parquet type
↳ not checked
! geometry columns must be required or optional, not repeated
↳ not checked
! all geometry values match the "encoding" metadata
↳ not checked
! all geometry types must be included in the "geometry_types" metadata (if not empty)
↳ not checked
! all polygon geometries must follow the "orientation" metadata (if present)
↳ not checked
! all geometries must fall within the "bbox" metadata (if present)
↳ not checked
It could be nice to do a 'has geometry column' check first, and just inform people that the data their validating does not have a geometry.
It also might be nice to put in some 'warning' when you try to convert a file that does not have a geometry. Or could even say it's not allowed (maybe allow some force
) option.
Anyways, I think the situation is ok now, but we could likely help people a bit more. I think we're going to see awhile where there's parquet files that aren't geoparquet, and it'd be nice to help people along.
The new describe is awesome, but if I put in non-compliant geoparquet there's little messaging that I have a file that's not quite right:
gpq describe taxi.parquet
╭────────────┬────────┬─────────────────────────────────┬────────────┬─────────────╮
│ COLUMN │ TYPE │ ANNOTATION │ REPETITION │ COMPRESSION │
├────────────┼────────┼─────────────────────────────────┼────────────┼─────────────┤
│ OBJECTID │ int32 │ int(bitwidth=32, issigned=true) │ 0..1 │ snappy │
│ Shape_Leng │ double │ │ 0..1 │ snappy │
│ Shape_Area │ double │ │ 0..1 │ snappy │
│ zone │ binary │ string │ 0..1 │ snappy │
│ LocationID │ int32 │ int(bitwidth=32, issigned=true) │ 0..1 │ snappy │
│ borough │ binary │ string │ 0..1 │ snappy │
│ geom │ binary │ │ 0..1 │ snappy │
├────────────┼────────┼─────────────────────────────────┴────────────┴─────────────┤
│ ROWS │ 262 │ │
╰────────────┴────────┴────────────────────────────────────────────────────────────╯
If I convert it then I get an additional row:
├──────────┼────────┼────────────┴────────────┴─────────────┴──────────┴────────────────┴────────┴────────┤
│ ROWS │ 3233 │ │
│ VERSION │ 1.0.0 │ │
╰──────────┴────────┴─────────────────────────────────────────────────────────────────────────────────────╯
I think it'd be nice to try to always put something about the compliance. Like maybe always have VERSION, but if it's not compliant than say non-compliant
(might also be nice to call it 'geoparquet version' or something). It could also be nice to say if it's a 'compatible parquet' file, like it's geom and data looks like 4326, and recommend people use gpq convert.
I think it would be interesting (e.g. for investigating #46) to report the compression method in gpq describe
Hi,
I am experiencing this issue with gpq
:
gpq: error: failed to create schema after reading 39 features
Based on #142 the answer is clear: there are no non null values in any of the features for one of the columns.
Indeed, if I edit the file and add just one everything works fine.
The problem is that unlike in the linked issue it is not possible for me to increase the amount of rows scanned because all the rows have nulls, and this is a case that is pretty common with the files I am dealing with.
While this strict behaviour is understandableby default, is is preventing me from adopting the tool. The ogr2ogr
behaviour is maybe questionable (in my case the incriminating column is being added as a string
instead of an int
), it at least produces an output that is usable.
So perhaps an option to --drop-non-inferrable-columns
, or --import-ambiguous-columns-as-strings
would be a useful escape hatch for gpq users.
(pre-processing json is of course an option too but more invovled)
I was testing the Overture maps data and realised it is only available in parquet and not geoparquet format. As I understand it this is a user case for gpq as mentioned in #57
The tools runs fine and seems to produce output, but I cannot read this using GDAL. Apologies if this is user error or should be a GDAL issue instead - please close if this is the case.
Full steps to recreate below (note I was using gpq on a Windows machine, and testing the output on both Windows and Linux.
Download data:
aws s3 cp --region us-west-2 --no-sign-request --recursive s3://overturemaps-us-west-2/release/2023-10-19-alpha.0/theme=buildings C:\Temp\buildings.parquet
Run conversion:
$env:PATH += ";D:\Tools\gpq-windows-amd64"
gpq version
# 0.20.0
gpq convert part-00769-87dd7d19-acc8-4d4f-a5ba-20b407a79638.c000.zstd.parquet test.geo.parquet --from="parquet" --to="geoparquet"
# also tried without compression (no difference in terms of validity)
gpq convert part-00769-87dd7d19-acc8-4d4f-a5ba-20b407a79638.c000.zstd.parquet test.geo.parquet --from="parquet" --to="geoparquet" --compression="uncompressed"
gpq validate test.geo.parquet
Summary: Passed 20 checks.
✓ file must include a "geo" metadata key
✓ metadata must be a JSON object
✓ metadata must include a "version" string
✓ metadata must include a "primary_column" string
✓ metadata must include a "columns" object
✓ column metadata must include the "primary_column" name
✓ column metadata must include a valid "encoding" string
✓ column metadata must include a "geometry_types" list
✓ optional "crs" must be null or a PROJJSON object
✓ optional "orientation" must be a valid string
✓ optional "edges" must be a valid string
✓ optional "bbox" must be an array of 4 or 6 numbers
✓ optional "epoch" must be a number
✓ geometry columns must not be grouped
✓ geometry columns must be stored using the BYTE_ARRAY parquet type
✓ geometry columns must be required or optional, not repeated
✓ all geometry values match the "encoding" metadata
✓ all geometry types must be included in the "geometry_types" metadata (if not empty)
✓ all polygon geometries must follow the "orientation" metadata (if present)
✓ all geometries must fall within the "bbox" metadata (if present)
QGIS opens the file but the attribute table is empty. Testing with ogrinfo
:
ogrinfo --version
# GDAL 3.7.2, released 2023/09/05
ogrinfo test.geo.parquet
Warning 1: Field brand.names.common of unhandled type list<element: struct<value: string, language: string>> ignored
Warning 1: Field brand.names.official of unhandled type list<element: struct<value: string, language: string>> ignored
Warning 1: Field brand.names.alternate of unhandled type list<element: struct<value: string, language: string>> ignored
Warning 1: Field brand.names.short of unhandled type list<element: struct<value: string, language: string>> ignored
Warning 1: Field addresses of unhandled type list<element: struct<freeform: string, locality: string, postCode: string, region: string, country: string>> ignored
Warning 1: Field names.common of unhandled type list<element: struct<value: string, language: string>> ignored
Warning 1: Field names.official of unhandled type list<element: struct<value: string, language: string>> ignored
Warning 1: Field names.alternate of unhandled type list<element: struct<value: string, language: string>> ignored
Warning 1: Field names.short of unhandled type list<element: struct<value: string, language: string>> ignored
Warning 1: Field sources of unhandled type list<element: struct<property: string, dataset: string, recordId: string, confidence: double>> ignored
INFO: Open of `test.geo.parquet'
using driver `Parquet' successful.
1: test.geo
Trying to read the data gives the likely cause of the issue: ERROR 1: ReadNext() failed: Malformed levels. min: 2 max: 2 out of range. Max Level: 1
.
ogrinfo test.geo.parquet -al
Warning 1: Field brand.names.common of unhandled type list<element: struct<value: string, language: string>> ignored
Warning 1: Field brand.names.official of unhandled type list<element: struct<value: string, language: string>> ignored
Warning 1: Field brand.names.alternate of unhandled type list<element: struct<value: string, language: string>> ignored
Warning 1: Field brand.names.short of unhandled type list<element: struct<value: string, language: string>> ignored
Warning 1: Field addresses of unhandled type list<element: struct<freeform: string, locality: string, postCode: string, region: string, country: string>> ignored
Warning 1: Field names.common of unhandled type list<element: struct<value: string, language: string>> ignored
Warning 1: Field names.official of unhandled type list<element: struct<value: string, language: string>> ignored
Warning 1: Field names.alternate of unhandled type list<element: struct<value: string, language: string>> ignored
Warning 1: Field names.short of unhandled type list<element: struct<value: string, language: string>> ignored
Warning 1: Field sources of unhandled type list<element: struct<property: string, dataset: string, recordId: string, confidence: double>> ignored
INFO: Open of `test.geo.parquet'
using driver `Parquet' successful.
Layer name: test.geo
Geometry: Unknown (any)
Feature Count: 815104
ERROR 1: ReadNext() failed: Malformed levels. min: 2 max: 2 out of range. Max Level: 1
Layer SRS WKT:
GEOGCRS["WGS 84",
ENSEMBLE["World Geodetic System 1984 ensemble",
MEMBER["World Geodetic System 1984 (Transit)"],
MEMBER["World Geodetic System 1984 (G730)"],
MEMBER["World Geodetic System 1984 (G873)"],
MEMBER["World Geodetic System 1984 (G1150)"],
MEMBER["World Geodetic System 1984 (G1674)"],
MEMBER["World Geodetic System 1984 (G1762)"],
MEMBER["World Geodetic System 1984 (G2139)"],
ELLIPSOID["WGS 84",6378137,298.257223563,
LENGTHUNIT["metre",1]],
ENSEMBLEACCURACY[2.0]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
CS[ellipsoidal,2],
AXIS["geodetic latitude (Lat)",north,
ORDER[1],
ANGLEUNIT["degree",0.0174532925199433]],
AXIS["geodetic longitude (Lon)",east,
ORDER[2],
ANGLEUNIT["degree",0.0174532925199433]],
USAGE[
SCOPE["Horizontal component of 3D system."],
AREA["World."],
BBOX[-90,-180,90,180]],
ID["EPSG",4326]]
Data axis to CRS axis mapping: 2,1
Geometry Column = geometry
categories.main: String (0.0)
categories.alternate: StringList (0.0)
level: Integer (0.0)
socials: StringList (0.0)
subType: String (0.0)
numFloors: Integer (0.0)
entityId: String (0.0)
class: String (0.0)
sourceTags: String(JSON) (0.0)
localityType: String (0.0)
emails: StringList (0.0)
drivingSide: String (0.0)
adminLevel: Integer (0.0)
road: String (0.0)
isoCountryCodeAlpha2: String (0.0)
isoSubCountryCode: String (0.0)
updateTime: String (0.0)
wikidata: String (0.0)
confidence: Real (0.0)
defaultLanguage: String (0.0)
brand.wikidata: String (0.0)
isIntermittent: Integer(Boolean) (0.0)
connectors: StringList (0.0)
surface: String (0.0)
version: Integer (0.0)
phones: StringList (0.0)
id: String (0.0)
context: String (0.0)
height: Real (0.0)
maritime: Integer(Boolean) (0.0)
websites: StringList (0.0)
isSalt: Integer(Boolean) (0.0)
bbox.minx: Real (0.0)
bbox.maxx: Real (0.0)
bbox.miny: Real (0.0)
bbox.maxy: Real (0.0)
ERROR 1: ReadNext() failed: Malformed levels. min: 2 max: 2 out of range. Max Level: 1
Testing with the GDAL validate script from here
apt-get install python3-pip --fix-missing
python3 -m pip install jsonschema
python3 validate_geoparquet.py --check-data test.geo.parquet
Warning 1: Field brand.names.common of unhandled type list<element: struct<value: string, language: string>> ignored
Warning 1: Field brand.names.official of unhandled type list<element: struct<value: string, language: string>> ignored
Warning 1: Field brand.names.alternate of unhandled type list<element: struct<value: string, language: string>> ignored
Warning 1: Field brand.names.short of unhandled type list<element: struct<value: string, language: string>> ignored
Warning 1: Field addresses of unhandled type list<element: struct<freeform: string, locality: string, postCode: string, region: string, country: string>> ignored
Warning 1: Field names.common of unhandled type list<element: struct<value: string, language: string>> ignored
Warning 1: Field names.official of unhandled type list<element: struct<value: string, language: string>> ignored
Warning 1: Field names.alternate of unhandled type list<element: struct<value: string, language: string>> ignored
Warning 1: Field names.short of unhandled type list<element: struct<value: string, language: string>> ignored
Warning 1: Field sources of unhandled type list<element: struct<property: string, dataset: string, recordId: string, confidence: double>> ignored
Segmentation fault
I don't think gpq currently contains a method to specify the target crs. Also I see that by default you use "OGC:CRS84", what is your rationale for that? Why not, for example, use "EPSG:4326"?
I'll add a little bit of context on my use case. So I just used gpq
to convert a 'big' collection of parquet files to geoparquet by simply doing gpq convert non-geo.parquet valid-geo.parquet
in a for loop. Further in my processing chain I load these geoparquet files using GeoPandas
, but I ran into an issue because when the crs == "OGC:CRS84"
it cannot be converted to epgs. Although it's expected behaviour I'm mostly just curious why you use "OGC:CRS84" instead of "EPSG:4326".
gdf = gpd.read_parquet("valid-geo.parquet")
print(gdf.crs.to_epsg()) # None
print(gdf.to_crs(4326).to_epsg()) # 4326
I'll probably change my routines from gdf.crs.to_epsg()
to gdf.crs.to_string()
, but I guess that several others rely on to_epsg() as well when using GeoPandas, so I thought it's worth opening a discussion point here.
I ran the following commands:
gpq convert C:\Dev\rapidai4eo\stac-v1.0\rapidai4eo_v1_source_pf\pf.parquet rapidai4eo_v1_source_pf_snappy.geoparquet
gpq convert C:\Dev\rapidai4eo\stac-v1.0\rapidai4eo_v1_source_pf\pf.parquet rapidai4eo_v1_source_pf_snappy.geoparquet --compression="snappy"
The files I get are exactly the same. I suspect both files are gzip compressed and the compression parameter was not taken into account.
I used gpq convert to create a geoparquet file from a parquet file:
gpq convert C:\Dev\rapidai4eo\stac-v1.0\rapidai4eo_v1_source_pf\pf.parquet rapidai4eo_v1_source_pf_snappy.geoparquet --compression="snappy"
Afterwards, I ran gpq validate on it and the new file is invalid.
It is missing the geoparquet version number (empty string).
v0.11.0 on Windows 10
The new gpq validation is awesome, but it'd be nice if it was easy to get a few more bits of info:
I could see two routes for this:
(This does highlight two potential 'warnings' - if the bbox reported is much bigger than the actual bounds of the geometry, and if the geometry types is more flexible than needed - like it isn't specified but all the data is actually Polygons. Ideally there'd be nice quick operations in gpq to fix this.
Trying to figure out if a parquet file at https://open.quiltdata.com/b/spatial-ucr/tree/census/administrative/counties.parquet is valid geoparquet. Run 'describe' and 'validate' and get:
gpq: error: command.ValidateCmd.Run(): unable to parse geo metadata: json: cannot unmarshal string into Go struct field GeometryColumn.columns.crs of type geoparquet.Proj
Note that 'convert' works fine, and then can describe it / validate it.
I got an error when running gpq convert AR.parquet Argentina-overture.parquet
- was hoping it'd upgrade to 1.0.0 (and I could adjust row groups after I got the initial one working. AR.parquet is from https://data.source.coop/cholmes/overture/geoparquet-country-quad-2/AR.parquet but got issues with others at https://beta.source.coop/cholmes/overture/geoparquet-country-quad-2 They're ones that I converted from Overture's parquet distribution, loaded into DuckDB, written out as Parquet and then used geopandas to turn into GeoParquet.
Error was:
gpq: error: command.ConvertCmd.Run(): transform generated an unexpected type, got struct, expected struct
Hi! A processing I found useful when using geoparquet files is creating subsets of data with either using bbox or excluding/selecting columns.
rough suggested implementation
gpq extract -bbox=120,10.1,121.4,11 -geom_col=geometry -exclude_cols=value,label source.geoparquet target.geoparquet
I can work on the implementation of this in the upcoming weeks but would like to know if others would find this useful!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.