geoarrow-r's Introduction

GeoArrow Specification

This repository contains a specification for storing geospatial data in Apache Arrow and Arrow-compatible data structures and formats.

The Apache Arrow project specifies a standardized language-independent columnar memory format. It enables shared computational libraries, zero-copy shared memory and streaming messaging, interprocess communication, and is supported by many programming languages and data libraries.

Spatial information can be represented as a collection of discrete objects using points, lines and polygons (i.e., vector data). The Simple Feature Access standard provides a widely used abstraction, defining a set of geometries: Point, LineString, Polygon, MultiPoint, MultiLineString, MultiPolygon, and GeometryCollection. Next to a geometry, simple features can also have non-spatial attributes that describe the feature.

Geospatial data often comes in tabular format, with one or more columns with feature geometries and additional columns with feature attributes. The Arrow columnar memory model is well-suited to store both vector features and their attribute data. The GeoArrow specification defines how the vector features (geometries) can be stored in Arrow (and Arrow-compatible) data structures.

This repository contains the specifications for:

  • The memory layout for storing geometries in an Arrow array (
  • The Arrow extension type definitions that ensure type-level metadata (e.g., CRS) is propagated when used in Arrow implementations (

Defining a standard and efficient way to store geospatial data in the Arrow memory layout enables interoperability between different tools and ensures geospatial tools can leverage the growing Apache Arrow ecosystem:

  • Efficient, columnar file formats. Leveraging the performant and compact storage of Apache Parquet as a vector data format in geospatial tools using GeoParquet
  • Accelerated between-process geospatial data exchange using Apache Arrow IPC message format and Apache Arrow Flight
  • Zero-copy in-process geospatial data transport using the Apache Arrow C Data Interface (e.g., GDAL)
  • Shared libraries for geospatial data type representation and computation for query engines that support columnar data formats (e.g., Velox, DuckDB, and Acero)

Relationship with GeoParquet

The GeoParquet specification originally started in this repo, but was moved out into its own repo (, leaving this repo to focus on the Arrow-specific specifications (Arrow layout and extension type metadata). Whereas GeoParquet is a file-level metadata specification, GeoArrow is a field-level metadata and memory layout specification that applies in-memory (e.g., an Arrow array), on disk (e.g., using Parquet readers/writers provided by an Arrow implementation), and over the wire (e.g., using the Arrow IPC format).


  • geoarrow-c: geospatial type system and generic coordinate-shuffling library written in C with bindings in C++, R, and Python
  • geoarrow-rs: Rust implementation of the GeoArrow specification and bindings to GeoRust algorithms for efficient spatial operations on GeoArrow memory. Includes JavaScript (WebAssembly) bindings.
  • geoarrow-python: Python bindings to geoarrow-c and geoarrow-rs that provide integrations with libraries like pyarrow, pandas, and geopandas.
  • geoarrow-wasm: WebAssembly module based on geoarrow-rs

geoarrow-r's Issues

Support RecordBatch with geoarrow

Using some arrow-rs, geoarrow-rust, and extendr magic, I am able to return a RecordBatch with a geoarrow array in it to R as a nanoarrow_array_stream, however, using geoarrow-r I've not been able to get this as a geoarrow array. I can use to get it into a data.frame but without any nice geometry column

#> ℹ Loading serdesri
furl <- ""
url <- paste0(furl, "/query?where=1=1&outFields=*&f=json&resultRecordCount=100")
req <- httr2::request(url)
resp <- httr2::req_perform(req)
json <- httr2::resp_body_string(resp)

# parse body as RecordBatch
res <- parse_esri_json_raw_geoarrow(resp$body, 2)
#> <nanoarrow_array_stream struct<OBJECTID: int64, NAME: string, STATE_NAME: string, STATE_FIPS: string, FIPS: string, SQMI: double, POPULATION: int32, POP_SQMI: double, STATE_ABBR: string, COUNTY_FIPS: string, Shape__Area: double, Shape__Length: double, geometry: geoarrow.polygon{list<rings: list<vertices: fixed_size_list(2)<xy: double>>>}>>
#>  $ get_schema:function ()  
#>  $ get_next  :function (schema = x$get_schema(), validate = TRUE)  
#>  $ release   :function ()  

x <-
#> Warning in warn_unregistered_extension_type(x): geometry: Converting unknown
#> extension geoarrow.polygon{list<rings: list<vertices: fixed_size_list(2)<xy:
#> double>>>} as storage type
#> Warning in warn_unregistered_extension_type(storage): geometry: Converting
#> unknown extension geoarrow.polygon{list<rings: list<vertices:
#> fixed_size_list(2)<xy: double>>>} as storage type
#> <list_of<list_of<list_of<double>>>[6]>
#> [[1]]
#> <list_of<list_of<double>>[1]>
#> [[1]]
#> <list_of<double>[39]>
#> ... truncated it for everyone's sake

as_geoarrow_array() method for sf objects

I suspected that as_geoarrow_array() would be able to work for anything that returns wk::is_handleable() == TRUE. as_geoarrow_array() fails on sf objects but succeed on s3 objects.

WKB with non-2D dimensions doesn't follow ISO encoding

I noticed that WKB geometry with xyz, xym or xyzm coordinate dimension use the 30th and 31th most-significant bits of the int32 flag at offset 1 of the WKB instead of the 2D_code+1000, 2D_code+2000, 2D_code+3000 used by ISO WKB.

Handling of geoparquet when not loading `geoarrow`

First of all, thanks for this awesome work. It's been great to see the progress on all this :-)

In the example on the readme, you load a .parquet file that contains a geometry example. Since there is not a separate naming format/convention (e.g. .geo.parquet or .geoparquet), I might not know that there is a geometry in there, so I just load arrow and open the dataset as normal. Looking at the geometry column would be confusing to me. This behavior differs whether I have the geoarrow package loaded or not.


open_dataset("~/Desktop/nc.parquet") |>
  head(n = 1) |>
  pull(geometry, as_vector = TRUE)
#> <arrow_binary[1]>
#> [1] 01, 06, 00, 00, 00, 01, 00, 00, 00, 01, 03, 00, 00, 00, 01, 00, 00, 00, 1b, 00, 00, 00, 00, 00, 00, a0, 41, 5e, 54, c0, 00, 00, ...

open_dataset("~/Desktop/nc.parquet") |>
  head(n = 1) |>
  pull(geometry, as_vector = TRUE)
#> <geoarrow_wkb[1]>
#> [1] MULTIPOLYGON (((-81.47276 36.23436, -81.54084 36.27251, -81.56198 36.27359, -81.63306 36.34069, -81.74107 36.39178, -81.69828 36.47178...

This issue might should be in the R arrow package, but I'm wondering if arrow should detect when there is a geometry column present and adjust behavior (the metadata is in there, so this information is known). For example, when calling collect(), should there be a warning that a geometry column is being collected and that geoarrow::st_collect() might be the better option (as in #21)? Or a warning when opening a geoparquet without geoarrow loaded?


nc = open_dataset("~/Desktop/nc.parquet") 
# We know there is a geometry from the metadata
#> [1] "{\"version\":\"0.3.0\",\"primary_column\":\"geometry\",\"columns\":{\"geometry\":{\"encoding\":\"WKB\",\"crs\":\"GEOGCS[\\\"NAD27\\\",DATUM[\\\"North_American_Datum_1927\\\",SPHEROID[\\\"Clarke 1866\\\",6378206.4,294.978698213898]],PRIMEM[\\\"Greenwich\\\",0],UNIT[\\\"degree\\\",0.0174532925199433,AUTHORITY[\\\"EPSG\\\",\\\"9122\\\"]],AXIS[\\\"Latitude\\\",NORTH],AXIS[\\\"Longitude\\\",EAST],AUTHORITY[\\\"EPSG\\\",\\\"4267\\\"]]\",\"bbox\":[-84.3239,33.882,-75.457,36.5896],\"geometry_type\":\"MultiPolygon\"}}}"

`st_collect()`, `st_as_sf()`, and default conversion from Arrow to R

Right now, geoarrow doesn't convert to sf by default and instead maintains a zero-copy shell around the ChunkedArray from whence it came. This is instantaneous and is kind of like ALTREP for geometry, since we can't do ALTREP on lists like Arrow does for character, integer, and factor. This is up to 10x faster and prevents a full copy of the geometry column. I also rather like that it maintains neutrality between terra, sf, vapour, wk, or others that may come along in the future...who are we to guess where the user wants to put the geometry column next? The destination could be Arrow itself (e.g., via group_by() %>% write_dataset()), or the column could get dropped, filtered, or rearranged before calling an sf method.

However, 99% of the time a user just wants an sf object. After #20 we can use sf::st_as_sf() on an arrow_dplyr_query to collect() it into an sf object, and @boshek suggested st_collect(), which is a way better name and is more explicit than a st_as_sf(). There's also st_geometry(), st_crs(), st_bbox(), and st_as_crs() methods for the geoarrow_vctr column; however, we still get an awkward error if we collect() and then try to convert to sf:

vctr <- geoarrow::geoarrow(wk::wkt("POINT (0 1)", crs = "OGC:CRS84"))
df <- data.frame(geometry = vctr)
#> Error in st_sf(x, ..., agr = agr, sf_column_name = sf_column_name): no simple features geometry column present

That might be solvable in sf, although I'd like to give the current implementation a chance to get tested to collect feedback on whether this is or is not a problem for anybody before committing to the current zero-copy-shell-by-default.

Release geoarrow 0.1.0

(this is still a few months off, but is a hook to keep track of/discuss progress related to the initial release)

First release:

Prepare for release:

  • devtools::build_readme()
  • urlchecker::url_check()
  • devtools::check(remote = TRUE, manual = TRUE)
  • devtools::check_win_devel()
  • rhub::check_for_cran()
  • rhub::check(platform = 'ubuntu-rchk')
  • rhub::check_with_sanitizers()
  • Review pkgdown reference index for, e.g., missing topics
  • Draft blog post

Submit to CRAN:

  • usethis::use_version('minor')
  • devtools::submit_cran()
  • Approve email

Wait for CRAN...

  • Accepted 🎉
  • usethis::use_github_release()
  • usethis::use_dev_version()
  • usethis::use_news_md()
  • Finish blog post
  • Tweet
  • Add link to blog post in pkgdown news menu

Handle multiple dimensions among features/respect strict = TRUE

A few options:

  • Error (what happens now)
  • Fill extra dimensions with NaN if strict is TRUE and there are extra dimensions
  • Drop dimensions if strict is TRUE and the dimension isn't supposed to be there

Perhaps all of those (make a user opt-in to extra dimensions filled with NaN)? Either way, strict = TRUE is might not be respected or might give a different error because the schemas aren't compatible (clearly this isn't tested).

point-default.parquet is not readable with pyarrow / arrow C++

>>> import pyarrow.parquet as pq
>>> pq.read_table('inst/example_parquet/point-default.parquet')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/even/arrow/cpp/build/myvenv/lib/python3.8/site-packages/pyarrow/", line 1996, in read_table
    return, use_threads=use_threads,
  File "/home/even/arrow/cpp/build/myvenv/lib/python3.8/site-packages/pyarrow/", line 1831, in read
    table = self._dataset.to_table(
  File "pyarrow/_dataset.pyx", line 323, in pyarrow._dataset.Dataset.to_table
  File "pyarrow/_dataset.pyx", line 2311, in pyarrow._dataset.Scanner.to_table
  File "pyarrow/error.pxi", line 143, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Expected all lists to be of size=2 but index 3 had size=0

On a OGR Parquet driver I'm developing, I can also reproduce the same issue with NULL Point. It seems that the Arrow C++ library doesn't correctly handle writing (or reading ? I'm not sure which side is broken) a NULL entry for a FixedSizeList in the Parquet format (this works correctly for Feather). The workaround I found is to write a POINT EMPTY instead of a NULL entry.

Error when loading the example - missing Z values

Hi there,

When i try and load a geoparquet (including the example) i get the following error. I think this is to do with the Z dimension as when i tried it with a geoparquet with a z dimension it loaded fine.

nc <- sf::read_sf(system.file("shape/nc.shp", package = "sf")) write_geoparquet(nc, "nc.parquet") read_geoparquet_sf("nc.parquet")

Error in geoarrow_schema_wkb(name = schema$name, format = schema$format, : startsWith(format, "w:") || isTRUE(format %in% c("z", "Z")) is not TRUE

`geoarrow()` drops CRS sometimes?

# convert geometry to geoarrow encoding
  geom <- as_geoarrow(
    schema_override = geoarrow_schema_wkb()
  # TODO: this shouldn't drop CRS but it does
  geom <- geoarrow(geom)

geoarrow still drops CRS

Hello, I see that #16 is marked as closed. However I still encounter this issue on a newly installed version of the package on Windows (R 4.2.2).

# Fetch data and transform to sf
countries <- world(resolution=2, level=0, path = ".") 
countries <- st_as_sf(countries)

# Write and read the output
write_geoparquet(countries, "countries.parquet")
countries_reload <- read_geoparquet("countries.parquet")

# There is no crs in the reloaded file
# Coordinate Reference System: NA

# There was one in the initial object
# Coordinate Reference System:
#   User input: GEOGCRS["unknown",
#     DATUM["World Geodetic System 1984",
#         ELLIPSOID["WGS 84",6378137,298.257223563,
#             LENGTHUNIT["metre",1]],
#         ID["EPSG",6326]],
#     PRIMEM["Greenwich",0,
#         ANGLEUNIT["degree",0.0174532925199433],
#         ID["EPSG",8901]],
#     CS[ellipsoidal,2],
#         AXIS["longitude",east,
#             ORDER[1],
#             ANGLEUNIT["degree",0.0174532925199433,
#                 ID["EPSG",9122]]],
#         AXIS["latitude",north,
#             ORDER[2],
#             ANGLEUNIT["degree",0.0174532925199433,
#                 ID["EPSG",9122]]]] 
#   wkt:
# GEOGCRS["unknown",
#     DATUM["World Geodetic System 1984",
#         ELLIPSOID["WGS 84",6378137,298.257223563,
#             LENGTHUNIT["metre",1]],
#         ID["EPSG",6326]],
#     PRIMEM["Greenwich",0,
#         ANGLEUNIT["degree",0.0174532925199433],
#         ID["EPSG",8901]],
#     CS[ellipsoidal,2],
#         AXIS["longitude",east,
#             ORDER[1],
#             ANGLEUNIT["degree",0.0174532925199433,
#                 ID["EPSG",9122]]],
#         AXIS["latitude",north,
#             ORDER[2],
#             ANGLEUNIT["degree",0.0174532925199433,
#                 ID["EPSG",9122]]]] 

geoarrow schema not interpretable by `arrow::as_schema()`

The schema created by infer_geoarrow_schema() cannot be parsed by arrow::as_schema(). I also am having problems with parsing the schema using Rust FFI bindings. I wonder if these could be related.

x <- sf::st_read(system.file("shape/nc.shp", package = "sf")) |> 
  sf::st_geometry() |> 
#> Reading layer `nc' from data source 
#>   `/Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library/sf/shape/nc.shp' 
#>   using driver `ESRI Shapefile'
#> Simple feature collection with 100 features and 14 fields
#> Geometry type: MULTIPOLYGON
#> Dimension:     XY
#> Bounding box:  xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
#> Geodetic CRS:  NAD27

geoarrow::infer_geoarrow_schema(x) |> 
#> Error: Invalid: Cannot import schema: ArrowSchema describes non-struct type geoarrow.multipolygon <CRS: {
#>   "$schema": "https://pro...

Created on 2024-01-28 with reprex v2.0.2

Notebook Viewer in RStudio errors when viewing a geoarrow_vctr

#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>     filter, lag
#> The following objects are masked from 'package:base':
#>     intersect, setdiff, setequal, union
library(arrow, warn.conflicts = FALSE)

bucket <- s3_bucket("voltrondata-public-datasets")
ds <- open_dataset(bucket$path("phl-parking"))
ds %>% 
  head() %>% 
#> # A tibble: 6 × 13
#>   anon_ticket_number issue_datetime      state anon_plate_id division location  
#>                <int> <dttm>              <chr>         <int>    <int> <chr>     
#> 1              39985 2011-12-31 21:17:00 PA          1606959       NA 832 N 40T…
#> 2              41812 2011-12-31 21:54:00 PA           503820       NA 7200 N 19…
#> 3              41814 2011-12-31 21:45:00 PA          1102245       NA 7900 PROV…
#> 4              46288 2011-12-31 20:09:00 NJ           427139       NA 450 N 6TH…
#> 5              46289 2011-12-31 20:10:00 NJ           308463       NA 448 N 6TH…
#> 6              46290 2011-12-31 20:12:00 PA          1585402       NA 446 N 6TH…
#> # … with 7 more variables: violation_desc <chr>, fine <dbl>,
#> #   issuing_agency <chr>, gps <lgl>, zip_code <int>, geometry <grrw_pnt>,
#> #   year <int>

(Except in in RStudio Notebook I get:

Error in wk_handle.geoarrow_vctr(handleable, wkt_format_handler(precision = precision,  : 
  `` is an external pointer to NULL

Release geoarrow 0.2.0

First release:

Prepare for release:

  • git pull
  • urlchecker::url_check()
  • devtools::build_readme()
  • devtools::check(remote = TRUE, manual = TRUE)
  • devtools::check_win_devel()
  • git push
  • Draft blog post

Submit to CRAN:

  • usethis::use_version('minor')
  • devtools::submit_cran()
  • Approve email

Wait for CRAN...

  • Accepted 🎉
  • Finish & publish blog post
  • Add link to blog post in pkgdown news menu
  • usethis::use_github_release()
  • usethis::use_dev_version(push = TRUE)
  • usethis::use_news_md()
  • Tweet

