mapoflife / fossa Goto Github PK
View Code? Open in Web Editor NEWThis project forked from robinkraft/fossa
Chewing through GBIF species occurrence data like lemurs, fossa (cryptoprocta ferox) is a cat-like carnivorous mammal from Madagascar
This project forked from robinkraft/fossa
Chewing through GBIF species occurrence data like lemurs, fossa (cryptoprocta ferox) is a cat-like carnivorous mammal from Madagascar
Need to know the tuple size limit. Some of our names will create tuples over that limit, so we'll need to detect that limit and run a workaround.
Let's hook in our cartodb-clj lib so that we can push the INSERT lines from #4 directly to CartoDB. We can do it during the query or from the resulting text files.
Looks like some of the latlons in the raw GBIF data are non-numeric. I'd like to filter out bad values for now, but we should also look at the incoming data to see why we're seeing values like 0:0
for latitude.
cascading.pipe.OperatorException: [1e1c1d25-6153-4f9e-840...][sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)] operator Each failed executing operation
at cascading.flow.stream.FunctionEachStage.receive(FunctionEachStage.java:94)
at cascading.flow.stream.FunctionEachStage.receive(FunctionEachStage.java:38)
at cascading.flow.stream.FilterEachStage.receive(FilterEachStage.java:60)
at cascading.flow.stream.FilterEachStage.receive(FilterEachStage.java:33)
at cascading.flow.stream.FilterEachStage.receive(FilterEachStage.java:60)
at cascading.flow.stream.FilterEachStage.receive(FilterEachStage.java:33)
at cascading.flow.stream.FunctionEachStage$1.collect(FunctionEachStage.java:67)
at cascading.tuple.TupleEntryCollector.safeCollect(TupleEntryCollector.java:93)
at cascading.tuple.TupleEntryCollector.add(TupleEntryCollector.java:86)
at cascading.operation.Identity.operate(Identity.java:110)
at cascading.flow.stream.FunctionEachStage.receive(FunctionEachStage.java:86)
at cascading.flow.stream.FunctionEachStage.receive(FunctionEachStage.java:38)
at cascading.flow.stream.FunctionEachStage$1.collect(FunctionEachStage.java:67)
at cascading.tuple.TupleEntryCollector.safeCollect(TupleEntryCollector.java:93)
at cascading.tuple.TupleEntryCollector.add(TupleEntryCollector.java:86)
at cascading.operation.Identity.operate(Identity.java:110)
at cascading.flow.stream.FunctionEachStage.receive(FunctionEachStage.java:86)
at cascading.flow.stream.FunctionEachStage.receive(FunctionEachStage.java:38)
at cascading.flow.stream.FilterEachStage.receive(FilterEachStage.java:60)
at cascading.flow.stream.FilterEachStage.receive(FilterEachStage.java:33)
at cascading.flow.stream.FunctionEachStage$1.collect(FunctionEachStage.java:67)
at cascading.tuple.TupleEntryCollector.safeCollect(TupleEntryCollector.java:93)
at cascading.tuple.TupleEntryCollector.add(TupleEntryCollector.java:86)
at cascalog.ClojureMap.operate(ClojureMap.java:35)
at cascading.flow.stream.FunctionEachStage.receive(FunctionEachStage.java:86)
at cascading.flow.stream.FunctionEachStage.receive(FunctionEachStage.java:38)
at cascading.flow.stream.FilterEachStage.receive(FilterEachStage.java:60)
at cascading.flow.stream.FilterEachStage.receive(FilterEachStage.java:33)
at cascading.flow.stream.FunctionEachStage$1.collect(FunctionEachStage.java:67)
at cascading.tuple.TupleEntryCollector.safeCollect(TupleEntryCollector.java:93)
at cascading.tuple.TupleEntryCollector.add(TupleEntryCollector.java:86)
at cascading.operation.Identity.operate(Identity.java:110)
at cascading.flow.stream.FunctionEachStage.receive(FunctionEachStage.java:86)
at cascading.flow.stream.FunctionEachStage.receive(FunctionEachStage.java:38)
at cascading.flow.stream.SourceStage.map(SourceStage.java:102)
at cascading.flow.stream.SourceStage.run(SourceStage.java:58)
at cascading.flow.hadoop.FlowMapper.run(FlowMapper.java:124)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:441)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:377)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: java.lang.RuntimeException: java.lang.NumberFormatException: Invalid number: 0:0
at cascalog.ClojureCascadingBase.applyFunction(ClojureCascadingBase.java:71)
at cascalog.ClojureMap.operate(ClojureMap.java:34)
at cascading.flow.stream.FunctionEachStage.receive(FunctionEachStage.java:86)
... 43 more
Caused by: java.lang.NumberFormatException: Invalid number: 0:0
at clojure.lang.LispReader.readNumber(LispReader.java:253)
at clojure.lang.LispReader.read(LispReader.java:171)
at clojure.lang.RT.readString(RT.java:1707)
at clojure.core$read_string.invoke(core.clj:3361)
at clojure.lang.Var.invoke(Var.java:415)
at clojure.lang.AFn.applyToHelper(AFn.java:161)
at clojure.lang.Var.applyTo(Var.java:532)
at cascalog.ClojureCascadingBase.applyFunction(ClojureCascadingBase.java:68)
... 45 more
I think it's safe, but can we blast all feature branches other than feathre/fix-4
?
In the output, we want only unique names. Then for each unique name, we want only unique points in the MULTIPOINT.
Write and run a Cascalog query against GBIF that outputs a textline of unique name counts. For example:
passer domesticus 1500000
puma concolor 23000
...
See #24 for context.
Via request from Rob and Walter, filter out eBird records via dataresourceid
column. Pinging GBIF now for eBird dataresourceid
...
Given a coordinate with a latitude of 3.14, keep it as 3.14 instead of 3.1400000.
Basically for each unique name we'll store a MULTIPOINT of all unique points. We'll also store an array of OccurrenceID strings, one per point. For points with multiple IDs, the value will be a list of CSV IDs. The max points to test is 2 million.
Here's how to create the table on CartoDB:
SELECT AddGeometryColumn('points', 'the_geom_multipoint', 4326, 'MULTIPOINT', 2)
ALTER TABLE points ADD COLUMN occids text[]
Then we need to load in 2 million points like this:
INSERT INTO points (name, occids, the_geom_multipoint) values ('testname', '{"1","10,11,12,13"}', st_geomfromtext('MULTIPOINT ((0.896666666667 9.93166666667), (19.583334 47.166668))', 4326))
And finally test the performance of this query:
SELECT (ST_DumpPoints(ST_Transform(t.the_geom_multipoint,3857))).geom as the_geom_webmercator, unnest(t.occids) from gbif_points_test as t WHERE t.name = 'testname'
If the performance isn't great, Vizz thinks we might consider unpacking points to a new table once they are uploaded.
These latlon pairs are problems:
-14.21 -6.77973E+12
10716879872 -84983709696
A representative stacktrace for the second case follows. Fortunately there are only maybe 250 of these. But they aren't handled cleanly at the moment.
There are a couple of data we need to be handling. First, the lat or lon could be missing entirely (i.e. "\N" or "N"). Then they could be valid numbers, but have type conversion and formatting issues (e.g. for very large numbers). Then they could simply be outside the valid latlon ranges (i.e. off the map).
Latlons should probably just be handled totally separately from other fields.
Caused by: java.lang.NumberFormatException: For input string: "-6779730000000"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
at java.lang.Integer.parseInt(Integer.java:461)
at java.lang.Integer.parseInt(Integer.java:499)
at fossa.utils$handle_zeros.invoke(utils.clj:157)
at fossa.utils$round_to.invoke(utils.clj:168)
at clojure.lang.AFn.applyToHelper(AFn.java:163)
at clojure.lang.AFn.applyTo(AFn.java:151)
at clojure.core$apply.invoke(core.clj:603)
at clojure.core$partial$fn__444.doInvoke(core.clj:2343)
at clojure.lang.RestFn.invoke(RestFn.java:408)
at clojure.core$map$fn__465.invoke(core.clj:2432)
at clojure.lang.LazySeq.sval(LazySeq.java:42)
at clojure.lang.LazySeq.seq(LazySeq.java:60)
at clojure.lang.RT.seq(RT.java:473)
at clojure.core$seq.invoke(core.clj:133)
at clojure.core$concat$fn__106.invoke(core.clj:662)
at clojure.lang.LazySeq.sval(LazySeq.java:42)
at clojure.lang.LazySeq.seq(LazySeq.java:60)
at clojure.lang.LazySeq.toArray(LazySeq.java:140)
at cascalog.Util.coerceToTuple(Util.java:116)
at cascalog.ClojureMap.operate(ClojureMap.java:35)
at cascading.flow.stream.FunctionEachStage.receive(FunctionEachStage.java:86)
... 49 more
Again, just a space saver.
To save space, encode seasonality with integer strings (e.g., "0") instead of "N summer", etc.
Create our deployment infrastructure. Let's ride on lein-emr for this.
We basically want to output INSERT statement lines from the Cascalog query.
For a result vector:
[["Acidobacteria"
["242135095" "244666043"]
"MULTIPOINT ((9.93166666667 0.896666666667), (15.509722 73.88306))"]]
We would get:
INSERT INTO gbif_points (name, occids, the_geom_multipoint) values ('Acidobacteria', '{"242135095", "244666043"}', ST_GeomFromText('MULTIPOINT ((9.93166666667 0.896666666667), (15.509722 73.88306))', 4326))
If year, month, or precision is non-numeric, replace the value with empty string.
June-Aug = "summer"
Sep-Nov = "fall"
Dec-Feb = "winter"
Mar-May = "spring"
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.