Coder Social home page Coder Social logo

Comments (14)

robgur avatar robgur commented on August 12, 2024

Robin/Aaron --- Lets get to the bottom of the non-numeric data before
throwing it out. Aaron has some scripts to do data cleaning on some weird
entries that might be part of the problem. Good?

On Fri, Nov 9, 2012 at 8:05 AM, Robin Kraft [email protected]:

Looks like some of the latlons in the raw GBIF data are non-numeric. I'd
like to filter out bad values for now, but we should also look at the
incoming data to see why we're seeing values like 0:0 for latitude.

cascading.pipe.OperatorException: [1e1c1d25-6153-4f9e-840...][sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)] operator Each failed executing operation
at cascading.flow.stream.FunctionEachStage.receive(FunctionEachStage.java:94)
at cascading.flow.stream.FunctionEachStage.receive(FunctionEachStage.java:38)
at cascading.flow.stream.FilterEachStage.receive(FilterEachStage.java:60)
at cascading.flow.stream.FilterEachStage.receive(FilterEachStage.java:33)
at cascading.flow.stream.FilterEachStage.receive(FilterEachStage.java:60)
at cascading.flow.stream.FilterEachStage.receive(FilterEachStage.java:33)
at cascading.flow.stream.FunctionEachStage$1.collect(FunctionEachStage.java:67)
at cascading.tuple.TupleEntryCollector.safeCollect(TupleEntryCollector.java:93)
at cascading.tuple.TupleEntryCollector.add(TupleEntryCollector.java:86)
at cascading.operation.Identity.operate(Identity.java:110)
at cascading.flow.stream.FunctionEachStage.receive(FunctionEachStage.java:86)
at cascading.flow.stream.FunctionEachStage.receive(FunctionEachStage.java:38)
at cascading.flow.stream.FunctionEachStage$1.collect(FunctionEachStage.java:67)
at cascading.tuple.TupleEntryCollector.safeCollect(TupleEntryCollector.java:93)
at cascading.tuple.TupleEntryCollector.add(TupleEntryCollector.java:86)
at cascading.operation.Identity.operate(Identity.java:110)
at cascading.flow.stream.FunctionEachStage.receive(FunctionEachStage.java:86)
at cascading.flow.stream.FunctionEachStage.receive(FunctionEachStage.java:38)
at cascading.flow.stream.FilterEachStage.receive(FilterEachStage.java:60)
at cascading.flow.stream.FilterEachStage.receive(FilterEachStage.java:33)
at cascading.flow.stream.FunctionEachStage$1.collect(FunctionEachStage.java:67)
at cascading.tuple.TupleEntryCollector.safeCollect(TupleEntryCollector.java:93)
at cascading.tuple.TupleEntryCollector.add(TupleEntryCollector.java:86)
at cascalog.ClojureMap.operate(ClojureMap.java:35)
at cascading.flow.stream.FunctionEachStage.receive(FunctionEachStage.java:86)
at cascading.flow.stream.FunctionEachStage.receive(FunctionEachStage.java:38)
at cascading.flow.stream.FilterEachStage.receive(FilterEachStage.java:60)
at cascading.flow.stream.FilterEachStage.receive(FilterEachStage.java:33)
at cascading.flow.stream.FunctionEachStage$1.collect(FunctionEachStage.java:67)
at cascading.tuple.TupleEntryCollector.safeCollect(TupleEntryCollector.java:93)
at cascading.tuple.TupleEntryCollector.add(TupleEntryCollector.java:86)
at cascading.operation.Identity.operate(Identity.java:110)
at cascading.flow.stream.FunctionEachStage.receive(FunctionEachStage.java:86)
at cascading.flow.stream.FunctionEachStage.receive(FunctionEachStage.java:38)
at cascading.flow.stream.SourceStage.map(SourceStage.java:102)
at cascading.flow.stream.SourceStage.run(SourceStage.java:58)
at cascading.flow.hadoop.FlowMapper.run(FlowMapper.java:124)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:441)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:377)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: java.lang.RuntimeException: java.lang.NumberFormatException: Invalid number: 0:0
at cascalog.ClojureCascadingBase.applyFunction(ClojureCascadingBase.java:71)
at cascalog.ClojureMap.operate(ClojureMap.java:34)
at cascading.flow.stream.FunctionEachStage.receive(FunctionEachStage.java:86)
... 43 more
Caused by: java.lang.NumberFormatException: Invalid number: 0:0
at clojure.lang.LispReader.readNumber(LispReader.java:253)
at clojure.lang.LispReader.read(LispReader.java:171)
at clojure.lang.RT.readString(RT.java:1707)
at clojure.core$read_string.invoke(core.clj:3361)
at clojure.lang.Var.invoke(Var.java:415)
at clojure.lang.AFn.applyToHelper(AFn.java:161)
at clojure.lang.Var.applyTo(Var.java:532)
at cascalog.ClojureCascadingBase.applyFunction(ClojureCascadingBase.java:68)
... 45 more


Reply to this email directly or view it on GitHubhttps://github.com//issues/11.

from fossa.

eightysteele avatar eightysteele commented on August 12, 2024

Huh, so what other strange values are we seeing aside from 0:0?

from fossa.

robinkraft avatar robinkraft commented on August 12, 2024

Here are a few:

22?
44d
37.5023?N
65d
52?

On Nov 9, 2012, at 10:35 AM, Aaron Steele [email protected] wrote:

Huh, so what other strange values are we seeing aside from 0:0?


Reply to this email directly or view it on GitHub.

from fossa.

robinkraft avatar robinkraft commented on August 12, 2024

Turns out they're just formatted degrees minutes seconds instead of decimal degrees.

5°52.5'N
7°14'35"N
33d 18m s N
40d 30m s N

This gist includes about a hundred samples.

from fossa.

eightysteele avatar eightysteele commented on August 12, 2024

What do the ?latitudeinterpreted and ?latitudeinterpreted values look like for these records? Same?

from fossa.

eightysteele avatar eightysteele commented on August 12, 2024

OK, just had a quick call with Robin.

Let's delete this line here and then replace the latlon-valid? implementation with this one (note we're converting lat/lon to numbers):

(defn latlon-valid?
  "Return true if lat and lon are valid, otherwise return false."
  [lat lon]
     (try
       (let [{:keys [lat-min lat-max lon-min lon-max]} latlon-range
             [lat lon] (map read-string [lat lon])]
         (and (<= lat lat-max)
              (>= lat lat-min)
              (<= lon lon-max)
              (>= lon lon-min)))
       (catch Exception e false)))

This will filter out invalid lat/lon values without throwing exceptions. Gold?

from fossa.

eightysteele avatar eightysteele commented on August 12, 2024

FYI the above mods are proposed for pull request #10.

from fossa.

robgur avatar robgur commented on August 12, 2024

How many records are in the wrong format, total? If significant, could we
convert to decimal degrees instead of losing perfectly valid (but
improperly formatted) data?

On Fri, Nov 9, 2012 at 9:14 AM, Aaron Steele [email protected]:

OK, just had a quick call with Robin.

Let's delete this line herehttps://github.com/MapofLife/fossa/blob/develop/src/clj/fossa/core.clj#L33and then replace the
latlon-valid? implementation with this one (note we're converting lat/lon
to numbers):

(defn latlon-valid?
"Return true if lat and lon are valid, otherwise return false."
[lat lon](try
%28let [{:keys [lat-min lat-max lon-min lon-max]} latlon-range
[lat lon] %28map read-string [lat lon]%29]
%28and %28<= lat lat-max%29
%28>= lat lat-min%29
%28<= lon lon-max%29
%28>= lon lon-min%29%29%29
%28catch Exception e false%29))

This will filter out invalid lat/lon values without throwing exceptions.
Gold?


Reply to this email directly or view it on GitHubhttps://github.com//issues/11#issuecomment-10234659.

from fossa.

robinkraft avatar robinkraft commented on August 12, 2024

Out of 75+ or so million records, I saw ~53k problem records. So less than 0.01% of observations. If they're distributed evenly throughout the whole dataset, I expect to see drop 250k records.

On Nov 9, 2012, at 11:20 AM, Rob [email protected] wrote:

How many records are in the wrong format, total? If significant, could we
convert to decimal degrees instead of losing perfectly valid (but
improperly formatted) data?

On Fri, Nov 9, 2012 at 9:14 AM, Aaron Steele [email protected]:

OK, just had a quick call with Robin.

Let's delete this line herehttps://github.com/MapofLife/fossa/blob/develop/src/clj/fossa/core.clj#L33and then replace the
latlon-valid? implementation with this one (note we're converting lat/lon
to numbers):

(defn latlon-valid?
"Return true if lat and lon are valid, otherwise return false."
[lat lon](try
%28let [{:keys [lat-min lat-max lon-min lon-max]} latlon-range
[lat lon] %28map read-string [lat lon]%29]
%28and %28<= lat lat-max%29
%28>= lat lat-min%29
%28<= lon lon-max%29
%28>= lon lon-min%29%29%29
%28catch Exception e false%29))

This will filter out invalid lat/lon values without throwing exceptions.
Gold?


Reply to this email directly or view it on GitHubhttps://github.com//issues/11#issuecomment-10234659.


Reply to this email directly or view it on GitHub.

from fossa.

robinkraft avatar robinkraft commented on August 12, 2024

I'm assuming 375m records. I only got through 75 million before I stopped.

from fossa.

eightysteele avatar eightysteele commented on August 12, 2024

There's about 450 million. Robin, can you look around for Java library for converting these coordinate representations to decimal degrees? I'll look to.

from fossa.

robgur avatar robgur commented on August 12, 2024

Pretty small percentage, and the records vary in representation and
encoding, so conversion might be a bit costly for all cases. Might be
worth it to do conversion of the two most common cases and let the others
drop.
-r

On Fri, Nov 9, 2012 at 9:28 AM, Aaron Steele [email protected]:

There's about 450 million. Robin, can you look around for Java library for
converting these coordinate representations to decimal degrees? I'll look
to.


Reply to this email directly or view it on GitHubhttps://github.com//issues/11#issuecomment-10235231.

from fossa.

eightysteele avatar eightysteele commented on August 12, 2024

Yeah, this opens a can of worms for sure. A quick look and I'm not finding good libraries for these conversions. I propose we drop records without valid decimal degrees. Once we have this end-to-end flow working we can come back and make adjustments as needed.

from fossa.

eightysteele avatar eightysteele commented on August 12, 2024

Just chatted with Rob, we're on the same page. Closing issue.

from fossa.

Related Issues (17)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.