arctern-io / arctern Goto Github PK

View Code? Open in Web Editor NEW

103.0 14.0 48.0 68.18 MB

License: Apache License 2.0

CMake 2.62% Shell 6.14% Cuda 1.96% C++ 43.68% Python 34.50% C 0.21% Dockerfile 0.90% Groovy 0.30% TSQL 9.69%

gis gis-platform giscience geospatial geolocation gpu-acceleration gpu-programming

arctern's Introduction

Arctern Docs

Arctern 中文文档

Overview

Arctern is a fast scalable spatial-temporal analytics framework.

Scalability is key to building productive data science pipelines. To address the scalability challenge, we launched Arctern, an open-source spatial-temporal analytic framework for boosting end-to-end data science performance. Arctern aims to improve scalability from two aspects:

Unified data analytic and processing interface across different platforms, from laptops to clusters and cloud.
Rich and consistent algorithms and models, including trajectory processing, spatial clustering, and regression, etc., across different data science pipeline stages.

Arctern's approach and current progress

We adopt GeoPandas‘s interface and plan to build the GeoDataFrame/GeoSeries that scale both up and out. On top of GeoDataFrame/GeoSeries, we will develop a consistent spatial-temporal algorithm set across execution environments.

We have now developed an efficient multi-thread GeoSeries implementation, and the distributed version is in progress. In the latest version 0.2.0, Arctern achieves 24x speedup against GeoPandas. Even under single-thread execution, Arctern outperforms GeoPandas 7x on average. The detailed evaluation results are illustrated in the figure below.

We are also conducting experimental GPU acceleration for spatial-temporal data analysis and rendering. By now Arctern provides six GPU-accelerated rendering methods and eight spatial-relation operations, which outperform their CPU-based counterparts with up to 36x speedup.

In the next few releases, our team will focus on:

Developing a distributed version of GeoSeries. Our first distributed implementation of GeoDataFrame/GeoSeries will be based on Spark. It is developed in sync with Spark 3.0 since its preview release. Spark's supports on GPU scheduling and column-based processing is highly in line with our idea of high-performance spatial-temporal data processing. Besides, the introduced Koalas interface offers a promising option for implementing consistent GeoDataFrame/GeoSeries interfaces on Spark.
Enriching our spatial-temporal algorithm sets. We will concentrate on KNN search and trajectory analysis in the project's early stages.

arctern's People

Contributors

Stargazers

Watchers

Forkers

czs007 xige-16 bigsheeper neza2017 fluorinedog czpmango op-hunter liangliu linxing58 fxllyf yxm1536 shengjh wuruilongll gracieeea longjiquan cydrain yamasite xiaolingis willgis gracezzzzz loguo tlincy moe-of-faith phantom8548 cupchen allenyu1987 chengpu zeroseekyou quziyan abrams90 guoxiangzhou xiaocai2333 guorentong emma-song become-nice szhorizon jeffoverflow yijialee bosszou amyyh makeling ishaansharma myfreebrain supermonica xunliu ybtsdst yummy0929 bankxi

arctern's Issues

WKT ambiguity

I found that the WKT form does not specify the coordinate system type, which means that you can convert a string of type WKT to a spatial object in any coordinate system. This may be an issue to consider, as arctern's current interfaces are defined in WKT form.

I did the following tests to verify the above view:

select st_distance('LINESTRING (11 2,3 4)'::geometry,'POLYGON ((0 0,0 1,3 3,1 0,0 0))'::geometry) ; -- sql1
select st_distance('LINESTRING (11 2,3 4)'::geography,'POLYGON ((0 0,0 1,3 3,1 0,0 0))'::geography) ; -- sql2
select st_distance('LINESTRING (11 2,3 4)'::geography,'POLYGON ((0 0,0 1,3 3,1 0,0 0))'::geometry) ; -- sql3

The results are :

sql1 : 0.970142500145332
sql2 : 107417.14877794
sql3 : 107417.14877794   (just same as sql2)

You can see that the sql1 and sql2 results are different.

Therefore, I tried to add extra information to the WKT string to avoid possible ambiguity caused by the above phenomenon.

Here is my test SQL statement (I chose POINT and LINESTRING to avoid possible errors):

SELECT st_distance(
ST_Transform(ST_GeomFromText('POINT (1 1)',4326),3857),
ST_Transform(ST_GeomFromText('LINESTRING (0 0,0 1)',4326),3857)
); -- sql4

select st_distance('POINT(1 1)'::geography,'LINESTRING(0 0,0 1)'::geography); -- sql5

The results are :

sql4 : 111319.490793272
sql5 : 111302.64933943

It can be found that the results of sql4 and sql5 are close to each other. I am not sure whether it is the error caused by the coordinate system mapping, but it can also be verified that adding additional information can avoid the above ambiguity.

Note: all tests are in postgis.

ST_IsValid bug and other function implementation related to IsValid

I found some differences in arctern's parsing rules for WKT strings. Some data that would be reported incorrectly in postgis, but not arctern.

I tested the ST_IsValid function in arctern :

def run_st_tmp(spark):
    register_funcs(spark)
    input = []

    input.extend([('POINT (1 8 2 4 )kdjff',)])
    input.extend([('POLYGON ((1 1,1 2,2 2,2 1,1 1)),((dkjfkjd0 0,1 -1,3 4,-2 3,0 0))',)])

    df = spark.createDataFrame(data=input, schema=['geos']).cache()
    df.createOrReplaceTempView("t1")
    spark.sql("select ST_IsValid_UDF(geos) from t1").show(100,0)

I got the following results :

+--------------------+
|ST_IsValid_UDF(geos)|
+--------------------+
|    true            |
|    true            |
+--------------------+

Our ST_IsValid implementation, first call OGRGeometryFactory: : createFromWkt, but OGRGeometryFactory: : createFromWkt input check is weak, so it produce the correct results.

I also looked at the implementation of other functions.There is no IsValid check before calling the gdal API. Refer to the gdal website API documentation as follows:

`
Geometry validity is not checked. In case you are unsure of the validity of the input geometries', call IsValid () before, otherwise the result took be wrong.

refer to https://gdal.org/api/vector_c_api.html? Highlight = isvalid
`

So here are two Suggestions:

OGRGeometryFactory: : createFromWkt legitimacy examination and OGR_G_IsValid examination priority need to question
Gdal function is not responsible for the validity examination. Our other functions should do an IsValid check before calling the gdal C API.

get following error while run spark tests: run_st_transform

get following error while run spark tests: run_st_transform(spark_session)

file path: GIS/spark/pyspark/example/gis/spark_udf_ex.py

ERROR 1: PROJ: proj_create_from_database: Open of /home/liangliu/anaconda3/envs/zgis_dev/share/proj failed
terminate called after throwing an instance of 'std::runtime_error*'

I want Arctern provide interface of the current system information

Is your feature request related to a problem? Please describe.
When I am using Arctern, I can’t know the current Arctern system information（E.g. version）

Describe the solution you'd like
Add system information interface for Arctern

ST_Overlaps bug

I got different output when I use specific wkt as input to the ST_Overlaps function (compared to geospark).

geospark test :
spark.sql("SELECT ST_Overlaps ( ST_GeomFromWKT('POLYGON ((0 0,0 1,1 1,1 0,0 0))') , ST_GeomFromWKT('MULTIPOLYGON ( ((0 0, 0 2, 2 3,2 0,0 0)) )') )").show(false)
output : false

GIS test :
wkt_arrow_array1 = { POLYGON ((0 0,0 1,1 1,1 0,0 0))}
wkt_arrow_array2 = { MULTIPOLYGON ( ((0 0, 0 2, 2 3,2 0,0 0)) )}
zilliz::gis::ST_Overlaps(wkt_arrow_array1,wkt_arrow_array2)
output : true
postgis test :
select st_overlaps('POLYGON ((0 0,0 1,1 1,1 0,0 0))'::geometry,'MULTIPOLYGON ( ((0 0, 0 2, 2 3,2 0,0 0)) )'::geometry);
output : false

st_distance difference

in postgis, distance with an empty geometry is like 'empty'
postgis:
SELECT ST_distance('POINT EMPTY'::geometry,'POINT(1 2)'::geometry);

result:
postgres=# SELECT ST_distance('POINT EMPTY'::geometry,'POINT(1 2)'::geometry);
st_distance
||-------------

(1 row)

per actern, result is 0

different results for some st_equals for arctern and postgis

in arctern the results for the following specific data are all false:
select st_equals_udf(left, right) as geos from test_equals

in postgis, these sqls results are all true
select st_equals('LINESTRING (0 0, 10 10)'::geometry, 'LINESTRING (0 0, 5 5, 10 10)'::geometry);
select st_equals('LINESTRING (10 10, 0 0)'::geometry, 'LINESTRING (0 0, 5 5, 10 10)'::geometry);
select st_equals('LINESTRING(0 0, 1 1)'::geometry, 'LINESTRING(1 1, 0 0)'::geometry);

ST_Union_Aggr exception

I got an exception when run my test code below :

from pyspark.sql import SparkSession
from zilliz_pyspark import register_funcs

def run_st_union(spark):
    test_df = spark.read.json("/xxx/st_union.json").cache()
    test_df.createOrReplaceTempView("st_union")
    register_funcs(spark)
    spark.sql("select ST_Union_Aggr_UDF(geos) from (select ST_PolygonFromEnvelope_UDF(a,c,b,d) as geos from st_union) as foo").show(100,0)

#main here.

st_union.json is just like :

{"a": 13.9, "c": 82.2, "b": 19.1, "d": 83.4}
{"a": 10.1, "c": 91.9, "b": 19.7, "d": 98.3}
{"a": 16.1, "c": 93.3, "b": 16.6, "d": 94.0}
{"a": 11.0, "c": 88.3, "b": 18.7, "d": 98.2}
{"a": 13.9, "c": 82.2, "b": 19.1, "d": 83.4}
{"a": 12.0, "c": 81.5, "b": 16.2, "d": 90.6}
{"a": 10.4, "c": 87.5, "b": 11.7, "d": 92.2}
{"a": 15.5, "c": 88.7, "b": 18.6, "d": 98.4}
{"a": 14.8, "c": 83.0, "b": 16.9, "d": 85.6}
{"a": 10.8, "c": 83.9, "b": 16.5, "d": 84.4}
{"a": 12.5, "c": 80.8, "b": 14.8, "d": 97.1}

The messege is :

ERROR 1: TopologyException: Input geom 0 is invalid: Self-intersection at or near point 14.899999999999999 95.099999999999994 at 14.899999999999999 95.099999999999994
20/02/29 15:43:16 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 3)

postgis test :

sql :
drop table t1;
create table t1 (a real,c real,b real,d real);
insert into t1 values 
(10.1,91.9,19.7,98.3),
(16.1,93.3,16.6,94.0),
(11.0,88.3,18.7,98.2),
(13.9,82.2,19.1,83.4),
(12.0,81.5,16.2,90.6),
(10.4,87.5,11.7,92.2),
(15.5,88.7,18.6,98.4),
(14.8,83.0,16.9,85.6),
(10.8,83.9,16.5,84.4),
(12.5,80.8,14.8,97.1)
;
select st_astext(st_union(geo)) from (select st_makeEnvelope(a,c,b,d) as geo from t1) as foo;

result :
 POLYGON((16.8999996185303 83.4000015258789,19.1000003814697 83.4000015258789,19.1000003814697 82.1999969482422,16.2000007629395 82.1999969482422,16.2000007629395 81.5,14.8000001907349 81.5,14.8000001907349 80.8000030517578,12.5 80.8000030517578,12.5 81.5,12 81.5,12 83.9000015258789,10.8000001907349 83.9000015258789,10.8000001907349 84.4000015258789,12 84.4000015258789,12 88.3000030517578,11.6999998092651 88.3000030517578,11.6999998092651 87.5,10.3999996185303 87.5,10.3999996185303 91.9000015258789,10.1000003814697 91.9000015258789,10.1000003814697 98.3000030517578,15.5 98.3000030517578,15.5 98.4000015258789,18.6000003814697 98.4000015258789,18.6000003814697 98.30
00030517578,19.7000007629395 98.3000030517578,19.7000007629395 91.9000015258789,18.7000007629395 91.9000015258789,18.7000007629395 88.3000030517578,16.2000007629395 88.
3000030517578,16.2000007629395 85.5999984741211,16.8999996185303 85.5999984741211,16.8999996185303 83.4000015258789))

Add conda build and upload in Jenkins CI

Describe the solution you'd like
Add conda build and upload in Jenkins CI

Add Cpplint & Clang-format & Clang-tidy for GIS

Describe the solution you'd like
Add Cpplint & Clang-format & Clang-tidy for GIS

ST_Buffer bug

I got different output when I use specific wkt as input to the ST_Buffer function (compared to geospark).

geospark test :
spark.sql("SELECT ST_Buffer( ST_GeomFromWKT('MULTIPOLYGON ( ((0 0, 1 4, 1 0,0 0)), ((0 0,1 0,0 1,0 0)) )') , 0)").show(1,0)
output : POLYGON ((0.2 0.8, 1 4, 1 0, 0.2 0.8))

GIS test :
wkt_arrow_array = {MULTIPOLYGON ( ((0 0, 1 4, 1 0,0 0)), ((0 0,1 0,0 1,0 0)) ) }
zilliz::gis::ST_Buffer(wkt_arrow_array,0)
output : POLYGON ((0 0,0 1,0.2 0.8,1 4,1 0,0 0))
postgis test :
select st_astext(st_buffer('MULTIPOLYGON ( ((0 0, 1 4, 1 0,0 0)), ((0 0,1 0,0 1,0 0)) )'::geometry,0))
output : POLYGON((0 0,0 1,0.2 0.8,1 4,1 0,0 0))

Good integration practices for Pytest

We need to check if the GIS project conforms to good integration practices

View： https://docs.pytest.org/en/latest/goodpractices.html#goodpractices

st_isvalid difference

in postgis:

select st_isvalid("POINT (30)");
select st_isvalid("POINT (,)");
select st_isvalid("POINT (a b)");
select st_isvalid("MULTIPOINT ()");
select st_isvalid("MULTIPOINT (,)");
select st_isvalid("POINT(1 2 3 4 5 6 7)");
select st_isvalid("LINESTRING(1 1)");
select st_isvalid("MULTIPOINT(1 1, 2 2");

all return ERROR while executing these in psql

in arctern, all of them return FALSE

st_envelope_udf different result with postgis

in our st_envelope_udf function the result will be 'POINT (0 0)'

actually, it's different on envelope all empty geometry types

postgis:
select st_astext(st_envelope('POLYGON EMPTY'::geometry));
result:
st_astext

POLYGON EMPTY

ArrowData is an internal API of arrow, so its GetValues method shouldn't be used

https://github.com/zilliztech/GIS/blob/ac7171a9d9f71bbb1e9a3d2d0499a9f79cdad651/cpp/src/arrow/render_api.cpp#L89

This code is broken at arrow 0.15

error: ‘using element_type = struct arrow::ArrayData’ {aka ‘struct arrow::ArrayData’} has no member named ‘GetValues’
   55 |   vertices_x_ = (uint32_t*)x_array->data()->GetValues<uint8_t>(1);

Cpplint attent copyright information

Describe the solution you'd like
Cpplint attent copyright information

Deploy Arctern cluster with Docker compose

Describe the solution you'd like
Deploy Arctern cluster with Docker compose

Some geometry cases that are not valid

My test code :

from osgeo import ogr

p0 =ogr.CreateGeometryFromWkt('POINT (1 8)')
p1 =ogr.CreateGeometryFromWkt('MULTIPOINT (1 1,3 4)')
p2 =ogr.CreateGeometryFromWkt('LINESTRING (1 1,1 2,2 3)')
p3 =ogr.CreateGeometryFromWkt('MULTILINESTRING ((1 1,1 2),(2 4,1 9,1 8))' )
p4 =ogr.CreateGeometryFromWkt('MULTILINESTRING ((1 1,3 4))')
p5 =ogr.CreateGeometryFromWkt('POLYGON ((1 1,1 2,2 2,2 1,1 1))')
p6 =ogr.CreateGeometryFromWkt('POLYGON ((1 1,1 2,2 2,2 1,1 1)),((0 0,1 -1,3 4,-2 3,0 0))') 
p7 =ogr.CreateGeometryFromWkt('POLYGON ((1 1,1 2,2 2,2 1,1 1),(0 0,1 -1,3 4,-2 3,0 0))')
p8 =ogr.CreateGeometryFromWkt('MULTIPOLYGON (((1 1,1 2,2 2,2 1,1 1)),((0 0,1 -1,3 4,-2 3,0 0)) )')
p9 =ogr.CreateGeometryFromWkt('POINT EMPTY')
p10=ogr.CreateGeometryFromWkt('LINESTRING EMPTY')
p11=ogr.CreateGeometryFromWkt('POLYGON EMPTY')
p12=ogr.CreateGeometryFromWkt('MULTIPOINT EMPTY')
p13=ogr.CreateGeometryFromWkt('MULTILINESTRING EMPTY')
p14=ogr.CreateGeometryFromWkt('MULTIPOLYGON EMPTY')
p15=ogr.CreateGeometryFromWkt('GEOMETRYCOLLECTION EMPTY')
p16=ogr.CreateGeometryFromWkt('CIRCULARSTRING (0 2, -1 1,0 0, 0.5 0, 1 0, 2 1, 1 2, 0.5 2, 0 2)')
p17=ogr.CreateGeometryFromWkt('COMPOUNDCURVE(CIRCULARSTRING(0 2, -1 1,0 0),(0 0, 0.5 0, 1 0),CIRCULARSTRING( 1 0, 2 1, 1 2),(1 2, 0.5 2, 0 2))')
p18=ogr.CreateGeometryFromWkt('GEOMETRYCOLLECTION ( LINESTRING ( 90 190, 120 190, 50 60, 130 10, 190 50, 160 90, 10 150, 90 190 ), POINT(90 190) ) ')
p19=ogr.CreateGeometryFromWkt('MULTICURVE ((5 5, 3 5, 3 3, 0 3), CIRCULARSTRING (0 0, 0.2 1, 0.5 1.4), COMPOUNDCURVE (CIRCULARSTRING (0 0,1 1,1 0),(1 0,0 1)))')
p20=ogr.CreateGeometryFromWkt('CURVEPOLYGON(CIRCULARSTRING(0 0, 4 0, 4 4, 0 4, 0 0),(1 1, 3 3, 3 1, 1 1))')
p21=ogr.CreateGeometryFromWkt('CURVEPOLYGON(COMPOUNDCURVE(CIRCULARSTRING(0 0,2 0, 2 1, 2 3, 4 3),(4 3, 4 5, 1 4, 0 0)), CIRCULARSTRING(1.7 1, 1.4 0.4, 1.6 0.4, 1.6 0.5, 1.7 1) )')
p22=ogr.CreateGeometryFromWkt('MULTISURFACE(CURVEPOLYGON(CIRCULARSTRING(0 0, 4 0, 4 4, 0 4, 0 0),(1 1, 3 3, 3 1, 1 1)),((10 10, 14 12, 11 10, 10 10),(11 11, 11.5 11, 11 11.5, 11 11)))')
p23=ogr.CreateGeometryFromWkt('MULTISURFACE Z (CURVEPOLYGON Z (CIRCULARSTRING Z (-2 0 0, -1 -1 1, 0 0 2, 1 -1 3, 2 0 4, 0 2 2, -2 0 0), (-1 0 1, 0 0.5 2, 1 0 3, 0 1 3, -1 0 1)), ((7 8 7, 10 10 5, 6 14 3, 4 11 4, 7 8 7)))')
p24=ogr.CreateGeometryFromWkt('MULTISURFACE (CURVEPOLYGON (CIRCULARSTRING (-2 0, -1 -1, 0 0, 1 -1, 2 0, 0 2, -2 0), (-1 0, 0 0.5, 1 0, 0 1, -1 0)), ((7 8, 10 10, 6 14, 4 11, 7 8)))')
p25=ogr.CreateGeometryFromWkt('POLYHEDRALSURFACE (((0 0,0 0,0 1,0 0)),((0 0,0 1,1 0,0 0)),((0 0,1 0,0 0,0 0)),((1 0,0 1,0 0,1 0)))')
p26=ogr.CreateGeometryFromWkt('TRIANGLE ((1 2,4 5,7 8,1 2))')
p27=ogr.CreateGeometryFromWkt('TIN ( ((0 0, 0 0, 0 1, 0 0)), ((0 0, 0 1, 1 1, 0 0)) )')

isValid0 =p0.IsValid()
isValid1 =p1.IsValid()
isValid2 =p2.IsValid()
isValid3 =p3.IsValid()
isValid4 =p4.IsValid()
isValid5 =p5.IsValid()
isValid6 =p6.IsValid()
isValid7 =p7.IsValid()
isValid8 =p8.IsValid()
isValid9 =p9.IsValid()
isValid10=p10.IsValid()
isValid11=p11.IsValid()
isValid12=p12.IsValid()
isValid13=p13.IsValid()
isValid14=p14.IsValid()
isValid15=p15.IsValid()
isValid16=p16.IsValid()
isValid17=p17.IsValid()
isValid18=p18.IsValid()
isValid19=p19.IsValid()
isValid20=p20.IsValid()
isValid21=p21.IsValid()
isValid22=p22.IsValid()
isValid23=p23.IsValid()
isValid24=p24.IsValid()
isValid25=p25.IsValid()
isValid26=p26.IsValid()
isValid27=p27.IsValid()

isValid0 
isValid1 
isValid2 
isValid3 
isValid4 
isValid5 
isValid6 
isValid7 
isValid8 
isValid9 
isValid10
isValid11
isValid12
isValid13
isValid14
isValid15
isValid16
isValid17
isValid18
isValid19
isValid20
isValid21
isValid22
isValid23
isValid24
isValid25
isValid26
isValid27

test result :

>>> isValid0 
True
>>> isValid1 
True
>>> isValid2 
True
>>> isValid3 
True
>>> isValid4 
True
>>> isValid5 
True
>>> isValid6 
True
>>> isValid7 
False
>>> isValid8 
False
>>> isValid9 
True
>>> isValid10
True
>>> isValid11
True
>>> isValid12
True
>>> isValid13
True
>>> isValid14
True
>>> isValid15
True
>>> isValid16
True
>>> isValid17
True
>>> isValid18
True
>>> isValid19
True
>>> isValid20
True
>>> isValid21
True
>>> isValid22
False
>>> isValid23
True
>>> isValid24
True
>>> isValid25
False
>>> isValid26
False
>>> isValid27
False

Add Unittest in Jenkins CI

Describe the solution you'd like
Add Unittest in Jenkins CI

Implementation of ST_PrecisionReduce

ST_PrecisionReduce is wanted in current version, so if it could be done by gdal 3.0.4, try to implement it by boost

ST_Union_Aggr_UDF error

ST_Union_Aggr_UDF throw exception when multipolygon is combined with others.

arctern test code :

def run_st_union(spark):
    register_funcs(spark)
    test_data1 = []
    test_data1.extend([('MULTIPOINT (1 1,3 4)',)])
    test_data1.extend([('LINESTRING (1 1,1 2,2 3)',)]) 
    test_data1.extend([('MULTILINESTRING ((1 1,1 2),(2 4,1 9,1 8))',)])
    test_data1.extend([('MULTILINESTRING ((1 1,3 4))',)])
    test_data1.extend([('POLYGON ((1 1,1 2,2 2,2 1,1 1))',)])
    test_data1.extend([('MULTIPOLYGON ( ((1 1,1 2,2 2,2 1,1 1)),((0 0,1 -1,3 4,-2 3,0 0)) )',)]) # topologyEX
    union_aggr_df1 = spark.createDataFrame(data=test_data1, schema=['geos']).cache()
    union_aggr_df1.createOrReplaceTempView("union_aggr1")
    rs = spark.sql("select ST_Union_Aggr_UDF(geos) from union_aggr1").show(100,0)

postgis sql :

drop table if exists test_union;
create table test_union (geos geometry);
insert into test_union values 
('MULTIPOINT (1 1,3 4)'),
('LINESTRING (1 1,1 2,2 3)'),
('MULTILINESTRING ((1 1,1 2),(2 4,1 9,1 8))'), 
('MULTILINESTRING ((1 1,3 4))'),
('POLYGON ((1 1,1 2,2 2,2 1,1 1))'),
('MULTIPOLYGON (((1 1,1 2,2 2,2 1,1 1)),((0 0,1 -1,3 4,-2 3,0 0)) )')
;
select st_astext(st_union(geos)) from test_union;

arctern result :

ERROR 1: TopologyException: Input geom 1 is invalid: Self-intersection at or near point 1.8 1 at 1.8 1
ERROR 10: Pointer 'hGeom' is NULL in 'OGR_G_ExportToWkt'.

terminate called after throwing an instance of 'std::runtime_error'
  what():  gdal error code = 6

postgis result :

GEOMETRYCOLLECTION(LINESTRING(2 4,1 9,1 8),POLYGON((2 1.5,2 1,1.8 1,1 -1,0 0,-2 3,3 4,2 1.5)))

ST_Contains bug

I got different output when I use specific wkt as input to the ST_Contains function (compared to geospark).

geospark test :
spark.sql("SELECT ST_Contains( ST_GeomFromWKT('POLYGON ((0 0,4 0,4 4,0 4,0 0))') , ST_GeomFromWKT('POINT (4 0)') )").show(false)
output : true

GIS test :
wkt_arrow_array1 = { POLYGON ((0 0,4 0,4 4,0 4,0 0))}
wkt_arrow_array2 = { POINT (4 0)}
zilliz::gis::ST_Contains(wkt_arrow_array1,wkt_arrow_array2)
output : false
postgis test :
select st_contains('POLYGON ((0 0,4 0,4 4,0 4,0 0))'::geometry,'POINT (4 0)'::geometry);
output : false

Add Jenkins CI for GIS

Describe the solution you'd like
Add Jenkins CI for GIS

Loading conda environment

Describe the solution you'd like
Loading conda environment

Migrate render engine from MegaWise to this repo

interface changes:

vega string to c++ struct
decouple layers from engine
remove cosmo-dependency
remove bulletin-dependency

Change base docker image in GPU version build environment Dockerfile

Describe the solution you'd like
Change base docker image in GPU version build environment Dockerfile

Conda environment conflicts with the system environment

Describe the bug
The version of libprotobuf is 2.6.1 in the system environment,but it is 3.11.0 in the conda environment. When I execute unittest, the program reports an error.

Steps/Code to reproduce behavior

[libprotobuf FATAL google/protobuf/stubs/common.cc:87] This program was compiled against version 2.6.1 of the Protocol Buffer runtime library, which is not compatible with the installed version (3.11.0).  Contact the program author for an update.  If you compiled the program yourself, make sure that your headers are from the same version of Protocol Buffers as your link-time library.  (Version verification failed in "/build/mir-O8_xaj/mir-0.26.3+16.04.20170605/obj-x86_64-linux-gnu/src/protobuf/mir_protobuf.pb.cc".)

[2020-02-17T13:26:45.455Z] terminate called after throwing an instance of 'google::protobuf::FatalException'

[2020-02-17T13:26:45.455Z]   what():  This program was compiled against version 2.6.1 of the Protocol Buffer runtime library, which is not compatible with the installed version (3.11.0).  Contact the program author for an update.  If you compiled the program yourself, make sure that your headers are from the same version of Protocol Buffers as your link-time library.  (Version verification failed in "/build/mir-O8_xaj/mir-0.26.3+16.04.20170605/obj-x86_64-linux-gnu/src/protobuf/mir_protobuf.pb.cc".)

Expected behavior

Execute unittest and return correct results in docker

Environment details

Ubuntu 18.04 x86_64
Docker version 19.03.1
GIS v0.1.0 GPU build environment Docker image
conda branch

Update build Conda package document

Report incorrect documentation

Location of incorrect documentation
https://github.com/zilliztech/arctern/blob/conda/doc/Build-Conda-Package.md

Suggested fix for documentation
Update build Conda package document

Add pylint for GIS

Describe the solution you'd like
Add pylint for GIS

ST_IsValid crashed if input is not valid geometry

If the input is not valid geometry,like Im not polygon, ST_IsValid will crash and throw exception with error message

unknown file: Failure
C++ exception with description "gdal error code = 3" thrown in the test body.

This is the test code,and it will throw exception.

arrow::StringBuilder string_builder;
std::shared_ptr<arrow::Array> polygons;
string_builder.Append("my is not polygon");
string_builder.Finish(&polygons);
auto vaild_mark = ST_IsValid(polygons);

st_union_aggr_udf get error in sql execution

sql:
select st_union_aggr_udf(myshape) from (select st_polygonfromenvelope_udf(a,c,b,d) as myshape from polygontable)

polygon.json

polygon.txt

what would happen if cpp throw exception?

In the macro of CHECK_GDAL, it would throw an exception of std::runtime_error if gdal return error, would python catch this exception?
what happen in pyspark when cpp throw exception?

Add python wrapper for render engine

The following design issues need to be discussed:

should we use pyarrow as interface?
how to organize vega as part of interface?
do we need a map for passing meta?

st_npoints difference

in postgis:
select st_npoints(st_geomfromtext('POLYGON ((30 10, 40 40, 20 40, 10 20, 30 10))'));
select st_npoints(st_geomfromtext('POLYGON ((1 2, 3 4, 5 6, 1 2))'));
select st_npoints(st_geomfromtext('POLYGON ((1 1, 3 1, 3 3, 1 3, 1 1))'));
select st_npoints(st_geomfromtext('MULTIPOINT(0 0, 7 7)'));
select st_npoints(st_geomfromtext('GEOMETRYCOLLECTION(POINT(1 1), LINESTRING( 1 1 , 2 2, 3 3))'));
select st_npoints(st_geomfromtext('POINT EMPTY'));

results
5
4
5
2
4
0

in arctern:
results
0
0
0
0
0
1

st_intersection issue

for data below:
{"left": "POLYGON ((40 21, 40 22, 40 23, 40 21))", "right": "POLYGON ((2 2, 9 2, 9 9, 2 9, 2 2))"}
{"left": "POINT(1 3)", "right": "LINESTRING (0 0, 10 10)"}
{"left": "POINT(-1 4)", "right": "LINESTRING (0 0, 10 10)"}
{"left": "POINT(10 1)", "right": "LINESTRING (0 0, 10 10)"}
{"left": "POINT(7 9)", "right": "LINESTRING (0 0, 10 10)"}

in arctern:
{"ST_Intersection_UDF(left, right)":"POLYGON EMPTY"}
{"ST_Intersection_UDF(left, right)":"POINT EMPTY"}
{"ST_Intersection_UDF(left, right)":"POINT EMPTY"}
{"ST_Intersection_UDF(left, right)":"POINT EMPTY"}
{"ST_Intersection_UDF(left, right)":"POINT EMPTY"}

in postgis:
GEOMETRYCOLLECTION EMPTY
GEOMETRYCOLLECTION EMPTY
GEOMETRYCOLLECTION EMPTY
GEOMETRYCOLLECTION EMPTY
GEOMETRYCOLLECTION EMPTY

Add build environment Dockerfile

Describe the solution you'd like
Add build environment Dockerfile

ST_Length bug

I got different output when I use polygon's wkt as input to the ST_Length function (compared to geospark).

geospark test :
spark.sql("SELECT ST_Length(ST_GeomFromWKT('POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))'))").show(1,0)
output : 4.0

spark.sql("SELECT ST_Length(ST_GeomFromWKT('MULTIPOLYGON ( ((0 0, 1 4, 1 0,0 0)))'))").show(1,0)
output : 9.123105625617661

spark.sql("SELECT ST_Length(ST_GeomFromWKT('MULTIPOLYGON ( ((0 0, 0 4, 4 4, 4 0, 0 0)), ((0 0, 0 1, 4 1, 4 0, 0 0)) )'))").show(1,0)
output : 26.0

GIS test :
wkt_arrow_array = {POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0)) ,MULTIPOLYGON ( ((0 0, 1 4, 1 0,0 0)) ) , MULTIPOLYGON ( ((0 0, 0 4, 4 4, 4 0, 0 0)), ((0 0, 0 1, 4 1, 4 0, 0 0)) )}
zilliz::gis::ST_Area(wkt_arrow_array)
output : 0 , 0 , 0
postgis test
output : 0 , 0 , 0

Add build stage in Jenkins CI

Describe the solution you'd like
Add build stage in Jenkins CI

st_envelope_udf empty geoms results different with postgis

our sql:
select st_envelope_udf(geos) as geos from test_envelope
input:
{"geos": "POLYGON EMPTY"}
{"geos": "LINESTRING EMPTY"}
{"geos": "POINT EMPTY"}
{"geos": "MULTIPOLYGON EMPTY"}
{"geos": "MULTILINESTRING EMPTY"}
{"geos": "MULTIPOINT EMPTY"}
{"geos": "GEOMETRYCOLLECTION EMPTY"}

result:
{"geos":"POINT (0 0)"}
{"geos":"POINT (0 0)"}
{"geos":"POINT (0 0)"}
{"geos":"POINT (0 0)"}
{"geos":"POINT (0 0)"}
{"geos":"POINT (0 0)"}
{"geos":"POINT (0 0)"}

in POSTGIS
sqls:
select st_astext(st_envelope('POLYGON EMPTY'::geometry));
select st_astext(st_envelope('LINESTRING EMPTY'::geometry));
select st_astext(st_envelope('POINT EMPTY'::geometry));
select st_astext(st_envelope('MULTIPOLYGON EMPTY'::geometry));
select st_astext(st_envelope('MULTILINESTRING EMPTY'::geometry));
select st_astext(st_envelope('MULTIPOINT EMPTY'::geometry));
select st_astext(st_envelope('GEOMETRYCOLLECTION EMPTY'::geometry));

result:
POLYGON EMPTY
LINESTRING EMPTY
POINT EMPTY
MULTIPOLYGON EMPTY
MULTILINESTRING EMPTY
MULTIPOINT EMPTY
GEOMETRYCOLLECTION EMPTY

[ RUN      ] geometry_test.test_ST_Area
Warning 1: OGR_G_Area() called against non-surface geometry type.
Warning 1: OGR_G_Area() called against non-surface geometry type.
Warning 1: OGR_G_Area() called against non-surface geometry type.
[       OK ] geometry_test.test_ST_Area (1 ms)
[ RUN      ] geometry_test.test_ST_Centroid
[       OK ] geometry_test.test_ST_Centroid (0 ms)
[ RUN      ] geometry_test.test_ST_Length
Warning 1: OGR_G_Length() called against a non-curve geometry type.
Warning 1: OGR_G_Length() called against a non-curve geometry type.
Warning 1: OGR_G_Length() called against a non-curve geometry type.
Warning 1: OGR_G_Length() called against a non-curve geometry type.
Warning 1: OGR_G_Length() called against a non-curve geometry type.
Warning 1: OGR_G_Length() called against a non-curve geometry type.
[       OK ] geometry_test.test_ST_Length (0 ms)

So, I suggest to check geometry type before call ST_Area and ST_Length

st_area_udf of a linestring should be 0

sql: select st_area_udf(geos) as my_area from test_area

data: {"geos": "LINESTRING (77.29 29.07,77.42 29.26,77.27 29.31,77.29 29.07)"}

result: {"my_area":0.01750000000000007}

expected: 0.0

guess in this case it was treated as a polygon

Add pod tolerations to Jenkins slave pods

Is your feature request related to a problem? Please describe.

Describe the solution you'd like
Add pod tolerations to Jenkins slave pods

Describe alternatives you've considered

Additional context

however per GeoSpark, should not raise exception here
log:

org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/local/bin/spark/python/lib/pyspark.zip/pyspark/worker.py", line 577, in main
    eval_type = read_int(infile)
  File "/usr/local/bin/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 837, in read_int
    raise EOFError
EOFError

        at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:484)
        at org.apache.spark.sql.execution.python.PythonArrowOutput$$anon$1.read(PythonArrowOutput.scala:99)
        at org.apache.spark.sql.execution.python.PythonArrowOutput$$anon$1.read(PythonArrowOutput.scala:49)
        at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:437)
        at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:489)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:726)
        at org.apache.spark.sql.execution.columnar.CachedRDDBuilder$$anon$1.hasNext(InMemoryRelation.scala:132)
        at org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221)
        at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:299)
        at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1370)
        at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1297)
        at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1361)
        at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1185)
        at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:360)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:311)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:127)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:441)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:444)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.UnsupportedOperationException: Unsupported data type: null

ST_Centroid bug

I got different output when I use specific wkt as input to the ST_Centroid function (compared to geospark).

geospark test :
spark.sql("SELECT ST_Centroid(ST_GeomFromWKT('MULTIPOLYGON ( ((0 0, 1 4, 1 0,0 0)), ((0 0,1 0,0 1,0 0)) )'))").show(1,0)
output : POINT (0.7777777777777778 1.6666666666666667)

GIS test :
wkt_arrow_array = {MULTIPOLYGON ( ((0 0, 1 4, 1 0,0 0)), ((0 0,1 0,0 1,0 0)) )}
zilliz::gis::ST_Centroid(wkt_arrow_array)
output : POINT (0.6 1.13333333333333)
postgis test :
select st_astext(st_centroid('MULTIPOLYGON ( ((0 0, 1 4, 1 0,0 0)), ((0 0,1 0,0 1,0 0)) )'::geometry));
output : POINT(0.6 1.13333333333333)

ST_Area bug

I got different output when I use specific wkt as input to the ST_Area function (compared to geospark).

geospark test :
spark.sql("SELECT ST_Area(ST_GeomFromWKT('LINESTRING (0 0, 1 0, 1 1, 0 0)'))").show(1,0)
output : 0

spark.sql("SELECT ST_Area(ST_GeomFromWKT('MULTIPOLYGON ( ((0 0, 1 4, 1 0,0 0)), ((0 0,1 0,0 1,0 0)) ) '))").show(1,0)
output : 1.5

GIS test :
wkt_arrow_array = {LINESTRING (0 0, 1 0, 1 1, 0 0) , MULTIPOLYGON ( ((0 0, 1 4, 1 0,0 0)), ((0 0,1 0,0 1,0 0)) ) }
zilliz::gis::ST_Area(wkt_arrow_array)
output : 0.5 ,2.5