Comments (14)
Robin/Aaron --- Lets get to the bottom of the non-numeric data before
throwing it out. Aaron has some scripts to do data cleaning on some weird
entries that might be part of the problem. Good?
On Fri, Nov 9, 2012 at 8:05 AM, Robin Kraft [email protected]:
Looks like some of the latlons in the raw GBIF data are non-numeric. I'd
like to filter out bad values for now, but we should also look at the
incoming data to see why we're seeing values like 0:0 for latitude.cascading.pipe.OperatorException: [1e1c1d25-6153-4f9e-840...][sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)] operator Each failed executing operation
at cascading.flow.stream.FunctionEachStage.receive(FunctionEachStage.java:94)
at cascading.flow.stream.FunctionEachStage.receive(FunctionEachStage.java:38)
at cascading.flow.stream.FilterEachStage.receive(FilterEachStage.java:60)
at cascading.flow.stream.FilterEachStage.receive(FilterEachStage.java:33)
at cascading.flow.stream.FilterEachStage.receive(FilterEachStage.java:60)
at cascading.flow.stream.FilterEachStage.receive(FilterEachStage.java:33)
at cascading.flow.stream.FunctionEachStage$1.collect(FunctionEachStage.java:67)
at cascading.tuple.TupleEntryCollector.safeCollect(TupleEntryCollector.java:93)
at cascading.tuple.TupleEntryCollector.add(TupleEntryCollector.java:86)
at cascading.operation.Identity.operate(Identity.java:110)
at cascading.flow.stream.FunctionEachStage.receive(FunctionEachStage.java:86)
at cascading.flow.stream.FunctionEachStage.receive(FunctionEachStage.java:38)
at cascading.flow.stream.FunctionEachStage$1.collect(FunctionEachStage.java:67)
at cascading.tuple.TupleEntryCollector.safeCollect(TupleEntryCollector.java:93)
at cascading.tuple.TupleEntryCollector.add(TupleEntryCollector.java:86)
at cascading.operation.Identity.operate(Identity.java:110)
at cascading.flow.stream.FunctionEachStage.receive(FunctionEachStage.java:86)
at cascading.flow.stream.FunctionEachStage.receive(FunctionEachStage.java:38)
at cascading.flow.stream.FilterEachStage.receive(FilterEachStage.java:60)
at cascading.flow.stream.FilterEachStage.receive(FilterEachStage.java:33)
at cascading.flow.stream.FunctionEachStage$1.collect(FunctionEachStage.java:67)
at cascading.tuple.TupleEntryCollector.safeCollect(TupleEntryCollector.java:93)
at cascading.tuple.TupleEntryCollector.add(TupleEntryCollector.java:86)
at cascalog.ClojureMap.operate(ClojureMap.java:35)
at cascading.flow.stream.FunctionEachStage.receive(FunctionEachStage.java:86)
at cascading.flow.stream.FunctionEachStage.receive(FunctionEachStage.java:38)
at cascading.flow.stream.FilterEachStage.receive(FilterEachStage.java:60)
at cascading.flow.stream.FilterEachStage.receive(FilterEachStage.java:33)
at cascading.flow.stream.FunctionEachStage$1.collect(FunctionEachStage.java:67)
at cascading.tuple.TupleEntryCollector.safeCollect(TupleEntryCollector.java:93)
at cascading.tuple.TupleEntryCollector.add(TupleEntryCollector.java:86)
at cascading.operation.Identity.operate(Identity.java:110)
at cascading.flow.stream.FunctionEachStage.receive(FunctionEachStage.java:86)
at cascading.flow.stream.FunctionEachStage.receive(FunctionEachStage.java:38)
at cascading.flow.stream.SourceStage.map(SourceStage.java:102)
at cascading.flow.stream.SourceStage.run(SourceStage.java:58)
at cascading.flow.hadoop.FlowMapper.run(FlowMapper.java:124)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:441)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:377)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: java.lang.RuntimeException: java.lang.NumberFormatException: Invalid number: 0:0
at cascalog.ClojureCascadingBase.applyFunction(ClojureCascadingBase.java:71)
at cascalog.ClojureMap.operate(ClojureMap.java:34)
at cascading.flow.stream.FunctionEachStage.receive(FunctionEachStage.java:86)
... 43 more
Caused by: java.lang.NumberFormatException: Invalid number: 0:0
at clojure.lang.LispReader.readNumber(LispReader.java:253)
at clojure.lang.LispReader.read(LispReader.java:171)
at clojure.lang.RT.readString(RT.java:1707)
at clojure.core$read_string.invoke(core.clj:3361)
at clojure.lang.Var.invoke(Var.java:415)
at clojure.lang.AFn.applyToHelper(AFn.java:161)
at clojure.lang.Var.applyTo(Var.java:532)
at cascalog.ClojureCascadingBase.applyFunction(ClojureCascadingBase.java:68)
... 45 more—
Reply to this email directly or view it on GitHubhttps://github.com//issues/11.
from fossa.
Huh, so what other strange values are we seeing aside from 0:0
?
from fossa.
Here are a few:
22?
44d
37.5023?N
65d
52?
On Nov 9, 2012, at 10:35 AM, Aaron Steele [email protected] wrote:
Huh, so what other strange values are we seeing aside from 0:0?
—
Reply to this email directly or view it on GitHub.
from fossa.
Turns out they're just formatted degrees minutes seconds instead of decimal degrees.
5°52.5'N
7°14'35"N
33d 18m s N
40d 30m s N
This gist includes about a hundred samples.
from fossa.
What do the ?latitudeinterpreted
and ?latitudeinterpreted
values look like for these records? Same?
from fossa.
OK, just had a quick call with Robin.
Let's delete this line here and then replace the latlon-valid?
implementation with this one (note we're converting lat/lon to numbers):
(defn latlon-valid?
"Return true if lat and lon are valid, otherwise return false."
[lat lon]
(try
(let [{:keys [lat-min lat-max lon-min lon-max]} latlon-range
[lat lon] (map read-string [lat lon])]
(and (<= lat lat-max)
(>= lat lat-min)
(<= lon lon-max)
(>= lon lon-min)))
(catch Exception e false)))
This will filter out invalid lat/lon values without throwing exceptions. Gold?
from fossa.
FYI the above mods are proposed for pull request #10.
from fossa.
How many records are in the wrong format, total? If significant, could we
convert to decimal degrees instead of losing perfectly valid (but
improperly formatted) data?
On Fri, Nov 9, 2012 at 9:14 AM, Aaron Steele [email protected]:
OK, just had a quick call with Robin.
Let's delete this line herehttps://github.com/MapofLife/fossa/blob/develop/src/clj/fossa/core.clj#L33and then replace the
latlon-valid? implementation with this one (note we're converting lat/lon
to numbers):(defn latlon-valid?
"Return true if lat and lon are valid, otherwise return false."
[lat lon](try
%28let [{:keys [lat-min lat-max lon-min lon-max]} latlon-range
[lat lon] %28map read-string [lat lon]%29]
%28and %28<= lat lat-max%29
%28>= lat lat-min%29
%28<= lon lon-max%29
%28>= lon lon-min%29%29%29
%28catch Exception e false%29))This will filter out invalid lat/lon values without throwing exceptions.
Gold?—
Reply to this email directly or view it on GitHubhttps://github.com//issues/11#issuecomment-10234659.
from fossa.
Out of 75+ or so million records, I saw ~53k problem records. So less than 0.01% of observations. If they're distributed evenly throughout the whole dataset, I expect to see drop 250k records.
On Nov 9, 2012, at 11:20 AM, Rob [email protected] wrote:
How many records are in the wrong format, total? If significant, could we
convert to decimal degrees instead of losing perfectly valid (but
improperly formatted) data?On Fri, Nov 9, 2012 at 9:14 AM, Aaron Steele [email protected]:
OK, just had a quick call with Robin.
Let's delete this line herehttps://github.com/MapofLife/fossa/blob/develop/src/clj/fossa/core.clj#L33and then replace the
latlon-valid? implementation with this one (note we're converting lat/lon
to numbers):(defn latlon-valid?
"Return true if lat and lon are valid, otherwise return false."
[lat lon](try
%28let [{:keys [lat-min lat-max lon-min lon-max]} latlon-range
[lat lon] %28map read-string [lat lon]%29]
%28and %28<= lat lat-max%29
%28>= lat lat-min%29
%28<= lon lon-max%29
%28>= lon lon-min%29%29%29
%28catch Exception e false%29))This will filter out invalid lat/lon values without throwing exceptions.
Gold?—
Reply to this email directly or view it on GitHubhttps://github.com//issues/11#issuecomment-10234659.—
Reply to this email directly or view it on GitHub.
from fossa.
I'm assuming 375m records. I only got through 75 million before I stopped.
from fossa.
There's about 450 million. Robin, can you look around for Java library for converting these coordinate representations to decimal degrees? I'll look to.
from fossa.
Pretty small percentage, and the records vary in representation and
encoding, so conversion might be a bit costly for all cases. Might be
worth it to do conversion of the two most common cases and let the others
drop.
-r
On Fri, Nov 9, 2012 at 9:28 AM, Aaron Steele [email protected]:
There's about 450 million. Robin, can you look around for Java library for
converting these coordinate representations to decimal degrees? I'll look
to.—
Reply to this email directly or view it on GitHubhttps://github.com//issues/11#issuecomment-10235231.
from fossa.
Yeah, this opens a can of worms for sure. A quick look and I'm not finding good libraries for these conversions. I propose we drop records without valid decimal degrees. Once we have this end-to-end flow working we can come back and make adjustments as needed.
from fossa.
Just chatted with Rob, we're on the same page. Closing issue.
from fossa.
Related Issues (17)
- Stress test schema HOT 7
- Unique names and points HOT 6
- Change Cascalog query output to INSERT statements
- CartoDB upload
- Add seasonality filter, inferred from month HOT 7
- Elastic MapReduce deployment script HOT 6
- Cleanup branches HOT 4
- Filter out eBird on dataresourceid HOT 2
- Encode season using integers HOT 8
- Empty string non-numeric year, month, and precision
- Replace precision values like 0.000 with 0 HOT 5
- Don't zero pad coordinates
- Tuple size limit HOT 21
- Query for unique name counts HOT 3
- screen latlons more effectively
- Insert names and counts to CartoDB
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from fossa.