Coder Social home page Coder Social logo

opensextant / xponents Goto Github PK

View Code? Open in Web Editor NEW
42.0 42.0 7.0 80.42 MB

Geographic Place, Date/time, and Pattern entity extraction toolkit along with text extraction from unstructured data and GIS outputters.

License: Apache License 2.0

Java 85.44% Python 10.18% Shell 2.62% Batchfile 1.76%
document-conversion geocoding geonames geoparsing geotagging information-extraction nlp solr tika

xponents's Introduction

OpenSextant

The Open Spatial Extraction and Tagging (OpenSextant) software provides an unstructured textual data geotagging and geocoding capability. The U.S. Government Joint Improvised Explosive Device Defeat Organization (JIEDDO) developed this capability in coordination with other U.S. government agencies and is pleased to provide this as open source software using an Apache 2.0 license. The software relies upon the open source General Architecture for Text Engineering (GATE) natural language processing software and the Apache Solr search software. Please see below for instructions on how to access the source code and binaries.

OpenSextant Suite

This suite is various projects for geospatial and temporal extraction. The core module is OpenSextantToolbox which produces a GATE plugin and a toolkit for controlling the overall extraction and geocoding pipeline using that plugin.

Modules


Commons -- Common parent classes, data model and core utilities. TBD

Xponents -- Extractors

  • XText document conversion (to plain text)
  • XCoord coordinate extraction
  • XTemporal date/time extraction
  • FlexPat

OpenSextantToolbox -- A GATE-based plugin and various main programs for geotagging/geocoding Gazetteer -- A Solr-based gazetteer supporting mainly NGA Geonames, USGS place data, and adhoc catalogs. LanguageResources -- Linguistic tuning data doc -- Documentation, user manuals, developer guides

Peer Projects


SolrTextTagger -- A text tagging solution for high-volume word lists or data sets

GISCore -- An API manages GIS data formats.

  • geodesy geodetic primitives and routines used by OpenSextant and GISCore
  • giscore the main GISCore API which supports IO and data manipulation on GIS data

additional content: Testing -- (RELEASE TBD) test data and programs to give you ideas of the possible. GeocoderEval -- (RELEASE TBD) we've developed a framework and ground truth for evaluating OpenSextant and other geotaggers

Getting Started Using OpenSextant

In the OpenSextant binary distribution you will find ./script/default.env It contains OPENSEXTANT_HOME and other useful shell settings. WinOS version is TBD.

To Geocode files and folders please use the reference script:

  $OPENSEXTANT_HOME/script/geocode.sh   <input> <output> <format>

where input is an input file or folder output is an output file or folder; depends on format format is the format of your output: one of GDB, CSV, Shapefile, WKT, KML

Getting Started Integrating OpenSextant


Javadoc is located at OPENSEXTANT_HOME/doc/javadoc ; Typical adhoc integration will be through the o.m.o.apps.SimpleGeocoder class, which leverages o.m.o.processing.TextInput on input and GeocodingResult/Geocoding as output classes.

Integration documentation is in progress, as of April 2013.

The main library JARs of interest are:

OpenSextantToolbox.jar opensextant-apps.jar opensextant-commons.jar

And the various Xponents: xtextjar xcoordjar xtemporaljar flexpatjar

As of release time 2013-Q1, we are working on documenting and honing dependencies with other libraries, as well as our internal dependencies.

Getting Started Developing OpenSextant


For more information see ./doc/OpenSextantToolbox/doc/OpenSextant Developers Guide.docx

Set your maven proxy settings; see ./doc/developer/ for hints.

Ensure that JAVA_HOME environment variable is pointed at a Java 7 JDK.

Otherwise you may encounter Javadoc and/or compilation errors.

In the source tree, run "ant". This will build the various required components and build a release

cd ./opensextant

see that things compile

ant compile

the release step compiles all modules and prepares a release.

ant release

Alternatively, Maven can be used to build Commons, Xponents, and SolrTextTagger. For example:

 cd Xponents
 mvn install 

But complete Maven build support is not planned at this time.

xponents's People

Contributors

andrequina avatar dlutz2 avatar dsmiley avatar gavin-black avatar jgibson avatar mubaldino avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

xponents's Issues

Improve SolrProxy to not depend on EmbeddedSolrServer

Even if you don't want to use SolrProxy with an EmbeddedSolrServer, SolrProxy has that dependency; which in turn triggers a ton of Solr stuff. If SolrProxy.initialize_embedded(home,core) were implemented in a static inner class, then EmbeddedSolrServer wouldn't be required if you didn't want it.

Add intuitive output in tester tools

Type of Feature:

  • Collaboration or partnership
  • Improvement or clarification
  • New Processing

Description of Feature

XCoord, XTemp, etc -- all Examples in xponents-demo.sh should report "output" files or results in a clear manner. Things end up in ./results but you are not told that.

Trivial "Do Do" false-positives

Describe the bug
"Do. Do", "do. Do", "in Do"`, etc. are common false positives found still.

To Reproduce
Xponents 3.3

Expected behavior
Better filtering of these. Likely use a spaCy NER model to offer POS tags and eliminate obvious errs.

"Centers for Disease Control" in Kenya

Describe the bug
USGS gazetteer entries

To Reproduce
USGS entry for "CDC" or "Centers for Disease Control" is an exact match for that agency in Kenya.
The USGS entry for US "CDC" -- "Centers for Disease Control and Prevention" is spelled as singular "Center for Disease Control" and is not the complete name.

  • incorrect match occurs and is coded as Kenya consistently.
  • inadequate entries for USGS -- should fix with additional information sources.

Code scanning with Sonarqube

Type of Feature:

  • Collaboration or partnership
  • Improvement or clarification
  • New Processing

Description of Feature
Integrate Sonar code scanning as an option to pre-screen releases.
Deploy and package Sonar scan with offline docker image, given it nearly doubles the maven dependencies.

Gazetteer 2.0 -- Python ETL

Type of Feature:

  • Collaboration or partnership
  • Improvement or clarification
  • New Processing

Description of Feature

Use Python Pandas and SQLite to stage all data sources in order to support the Merged Gazetteer output.
The current Gazetteer project is dependent on Kettle v6 to v9 and Java 8. There is now some incompatibility of the project with a git checkout on linux -- Kettle "spoon" script outputs an error on "Line 130, Column 69: Invalid Escape Sequence" ... but does not mention what file or what phase of processing.

This is not worth fixing in Kettle and Gaz project. Much easier to reimplement.

SolrGazetteer doesn't close streams

SolrGazetteer reads from some IO streams it creates but it never closes them. This occurs in loadFeatureMetaMap and loadCountryNameMap.

Reader countryIO = new InputStreamReader(getClass().getResourceAsStream("/country-names-2013.csv"));
try {
//.…
} finally {
    countryIO.close();
}

SolrGazetteer doesn't have way to close Solr connection

SolrGazetteer creates a connection to Solr but never closes it. SolrGazetteer should have a close() method that closes it's connection to Solr.

Another option to consider that is more in line with dependency-injection strategies is for SolrGazetteer to not create the Solr connection itself; instead it would take it via a setter or constructor parameter. Then it would not be in charge of closing the resource because it wouldn't be the creator of it. Generally creators of close'able resources are the ones responsible for closing them.

Precision not reported correctly on found coordinates

Describe the bug
verify precision reported on coordinate extraction. prec=900 for DMS match with second resolution. Should be +/- 30m.

Confidence is also not reported on REST API.

To Reproduce
Xponents 3.3.2

Expected behavior

lat,lon = 45˚ 45' 45" x 33˚ 33' 33" , ... precision should be < 30 m.
Confidence for solid deg/min/sec should be 90+
MGRS confidence should be 90+ for grid with offset with 1km precision, and less so with less precision.
Pure decimal degree coordinates probably vary in confidence: decimal deg with hemisphere symbol is 90. Without hemisphere or other indicator of geographic nature of coord, .... its just a decimal pair (1.45352 4.55577). Confidence ~ 50 for DD w/out symbols.

Improved connection with Solr 8.x and future "contributions" sections of Solr manual

Type of Feature:

  • Collaboration or partnership
  • Improvement or clarification
  • New Processing

Description of Feature
I see the success of the Tagger handler (follow on of SolrTextTagger). Its great to see the geonames reference, etc. but the preservation of the "naive tagger" mention and not much more is a gaping hole to be filled.
https://lucene.apache.org/solr/guide/8_6/the-tagger-handler.html#tagger-performance-tips

We can list a handful of successful NLP and other uses of the TextTagger. The main example from here is the use of it in our various OpenSextant (Xponents, Gate Toolbox, etc) implementations and the production ready packaging such as is here: https://hub.docker.com/r/mubaldino/opensextant

So what is needed is to understand how to register this interest with the Solr committers and what the sort of connection is between Solr and its users. Ideally, a "contributor" could be someone that contributes applications of Solr that are registered/vetted in a new part of the Community portion of the Solr site.
https://lucene.apache.org/solr/community.html#how-to-contribute -- I can see how I can contribute to the Solr code base. I have no interest or time there. So I see the https://lucene.apache.org/solr/ home page missing a venue for its community to understand who is building on top of and applying Solr.

Model: See spaCy.io Universe (contributors are folded in directly with the project)
https://spacy.io/universe . This home page has a different feel completely for how a dev community operates and highlights the broader sense of contributor.

I hate to see Solr fall behind,... but it is hard to be heard if you are not a committer.

Marc

Geopy as a possible target

Type of Feature:
[ X ] Collaboration or partnership
[ ] Improvement or clarification
[ X ] New Processing

Description of Feature
https://geopy.readthedocs.io/en/stable -- Support a Geopy usage:

  from geopy.geocoders.opensextant import Xponents
  xp = Xponents()
  pt = xp.geocode("45.7878E 14.000N")
  # Parsed coordinate

  place = xp.reverse( pt )
  # Closest named location 

  pt = xp.geocode("Yarmouth, ME")
  # pt = just the best possible match for the above.

  pts = xp.geocode("Yarmouth", single=False)  ## Ambiguous, so this should return multiple

  pts = xp.geocode(" when in Yarmouth (down east Maine) hit the Harraseeket  lunch counter for lobster" )
  # All possible locations found in text.


Apply Feature Type in weighting evidence and confidence.

Type of Feature:
[ ] Collaboration or partnership
[x] Improvement or clarification
[ ] New Processing

Description of Feature
Account for feature class and even coding when disambiguating locations and then also assigning confidence.

  • "Boise" / feature H/STMI -- an intermittent stream
  • "Boise" / feature P/PPL -- a major city, state capital.

If "Boise" is mentioned, we should score the P/PPL location higher and eventually choose it if there is no other relevant context to say otherwise. Confidence as well should reflect how confident we are in this.

When there is sufficient evidence to indicate the stream/hydro feature is the place in question, that evidence will have to surpass that for other possibilities.

TaxCat person_names improvements

file encoding is not respected -- data is read in by python scripts as "bytes", but not consistently decoded from UTF-8 bytes to a unicode string.

also various false-positives -- Census names can be short and confusing false-positives, in a language specific manner. E.g., last name "Le" is also french stop word.

Solr 8.4+ upgrade

Type of Feature:
[ ] Collaboration or partnership
[X ] Improvement or clarification
[ ] New Processing

Description of Feature
Solr < 8.2 has a security bug in import handler

Solr 5.x Build

SolrTextTagger is now on 2.2 with Solr 5.5 as max. Some limitations in going to Solr 6.0 on STT v2.3-dev

Solr 4.10 is EOL, given Solr 6.0 is out.

SolrGazetteer lacks SolrProxy configurability

SolrGazetteer configures it's Solr connection via the global "solr.solr.home" system property, which in turn is also used by other components. So basically it forces you to use an embedded SolrServer. Instead, it should be configurable similar to how I describe in issue #5 -- offer a setter to set the SolrProxy.

Artificial name_bias entries

For example, 60,000 gazetteer rows are marked as negative name_bias, however appear to be relatively unique names.

  • name_bias = -0.50 for "Compo Yacht Club" ... but same as the name "Conference" or "Compañia Seis". Conference is obviously the outlier that is correctly marked. The others are unique, specific names.

TaxonMatcher configure() called twice

It appears that PlaceGeocoder is calling TaxonMatcher's configure method twice.

PlaceGeocoder lines:

299 personMatcher = new TaxonMatcher();
300 personMatcher.excludeTaxons("place."); /* but allow org., person., etc. */
301 personMatcher.configure();

Instantiation of TaxonMatcher() calls configure(). This is called again on line 301 and results in the creation of 2 SolrProxies of which the 2nd process is not properly cleaned up on close.

Refactor resource loading, again

Resource files must be available to SolrResourceLoader from ./lib in order to load into core.

  1. TagFilter -- loading files for GazetteerUpdateProcessorFactory is not necessary. Only basic items are needed.
  2. Instead of using items in optional JARs (e.g., kuromiji analyzer) use locally available ./conf/lang/* files (/lang/stopwords_ja.txt for example )
2017-01-06 23:11:08,266 ERROR [coreLoadExecutor-5-thread-1] org.opensextant.extractors.geo.GazetteerUpdateProcessorFactory: Init failure
java.io.IOException: No such stop filter file /org/apache/lucene/analysis/ja/stopwords.txt
	at org.opensextant.extractors.geo.TagFilter.loadLanguageStopwords(TagFilter.java:87)
	at org.opensextant.extractors.geo.TagFilter.<init>(TagFilter.java:72)
	at org.opensextant.extractors.geo.GazetteerUpdateProcessorFactory.init(GazetteerUpdateProcessorFactory.java:84)
	at org.apache.solr.core.SolrCore.createInitInstance(SolrCore.java:611)
	at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:2268)
	at org.apache.solr.update.processor.UpdateRequestProcessorChain.init(UpdateRequestProcessorChain.java:119)
	at org.apache.solr.core.SolrCore.createInitInstance(SolrCore.java:609)

Solr 6 support

Add SolrTextTagger 2.4 support with Solr 6.4+ index

Future planning
LuceneRevolution 2017, etc.

NullPointer in taxon matcher and placegeocoder

Describe the bug
.tagset on Taxon is non-null only if .addTags() is called. Added .hasTags() to check if .tagset is not null. Preferable use API method .getTags() rather than tagset attribute access.

To Reproduce
Tested with a couple nationalities that had no tagset (country code), e.g., bajan and azeri.

Expected behavior
Taxon class users should use .hasTags() to check if tagset is set.

Create a docker offline build

Type of Feature:
[ ] Collaboration or partnership
[ x ] Improvement or clarification
[ ] New Processing

Description of Feature

  • pre-packaged Maven/Java development image
  • allow recompilation of code from inside a self-contained docker image

Decimal degrees not extracted from text

Describe the bug
Unsure if this is a feature request or a bug report, but we'll start things here. I expected decimal degrees within a brick of text to be detected much like UTM, MGRS, etc. Instead the service returned zero results.

I was just examining the the core configuration for geocoord patterns and I'm not 100% sure if my test case even matches the patterns. We see this pattern in text frequently without the degrees symbol and it would be great if the service supported them:

42.312,102.121 42.312, 102.121

To Reproduce
I'm running the service using the latest 3.3 docker image on an EC2 instance running amazon linux 2. I'm exercising the endpoint utilizing curl like the following example:

curl -XPOST http://localhost:8888/xlayer/rest/process --data '{"text":"I flew to 42.312,102.121"}' | jq .

and see the following response:

{ "response": { "status": "ok", "numfound": 0 }, "annotations": [] }

Expected behavior
I would have expected the extractor to identify 42.312,102.121 as a location in decimal degrees and perform the same reverse geocode that occurs when utilizing the MGRS or DMS coords.

Thanks so much!!

Maven install fails on Xponents 3.4 master

Attempted to run mvn install on the latest master code (3.4) which failed with the following error:

[ERROR] Failed to execute goal on project opensextant-xponents: Could not resolve dependencies for project org.opensextant:opensextant-xponents:jar:3.4-SNAPSHOT: Could not find artifact org.opensextant:opensextant-xponents-core:jar:3.4-SNAPSHOT in maven-restlet-talend (https://maven.restlet.talend.com)

ZIP Code data -- from geonames.org

Type of Feature:

  • Collaboration or partnership
  • Improvement or clarification
  • New Processing

Description of Feature
Pull in Zip code entries from geonames.org as taggable text -- supported on-demand, not by default.

given the style of text it would make more sense to create "postal" index, apart from gazetteer index.

Update Restlet or migrate back to 2.3.12

Describe the bug
Restlet has migrated over to Talend open source.
Major changes in JAR provisioning

To Reproduce
Attempt maven build pulling in Xponents dependencies inside a docker.
Error from JDK indicating "PKIX" exception validating source.

Expected behavior
better documentation on how to securely access Talend's https:// site.

Use StopFilterFactory

TODO - consider using org/apache/lucene/analysis/core/StopFilterFactory to load stop terms;
this would help generalize the import and use of ./solr4/gazetteer/conf/lang/stopwords* that are already there for Solr indexing;

this was not complete, as chinese, korean, and vietnamese terms were missing from Solr's default 'conf/lang' files. But StopFilterFactory could still load such simple "wordset" lists provided from other sources.

PhoneticFilter experimentation

Baz zAz = two tokens, likely bz, zaz.
But if we find Bazzaz ==> bzaz the resulting phonetics are the same, but difficult to match.

Deir ezzor vs Der ez Zor.... again similar phonetics in a bigram or trigram, but hard to compare if phonetics are not computed as such.

Preferred Country or Location for REST or other calls

Type of Feature:

  • Collaboration or partnership
  • Improvement or clarification
  • New Processing

Description of Feature
for geotagging allow API calls to take in a list of preferred countries or locations (geohashes for now) that help scope what the caller thinks are the most relevant results.

SolrMatcherSupport configurability

SolrMatcherSupport's initialize() method currently examines global system properties to decide where it will find Solr. In general, global system properties can be handy but shouldn't be the only means of configuring things. There may be more than one Solr server being used, particularly in development (local indexing and remote gaz catalogs). This can be fixed by simply adding a setter for "solr", and then not overriding a non-null value in initialize(). Also, in the case that "solr.url" is set, it'd be better to concatenate the getCoreName() method's value to the url and with an extra '/' so that if I use a solr.url system property then I can have, say, both the "gazetteer" and "tax" cores and still use one solr.url set to http://..../solr/. Otherwise I can't use two SolrMatcherSupport's for remote'ed Solr.

Jython 2.7 support

Type of Feature:
[ ] Collaboration or partnership
[ ] Improvement or clarification
[ X ] New Processing

Description of Feature

use Jython 2.7 for basic usage of API

Test Latest TextTagger in other languages/scripts

Describe the bug
TextTagger usage with languages other than English.

To Reproduce

  • Java or Python version: Any Java (openjdk 8 and 12)
  • Usage: Arabic text produces a "zero-length token" exception from TextTagger process()
  • Data input:
  • Did you enable logging (level = DEBUG)?
  • Other notes:
15:59:47.288 [main] ERROR org.apache.solr.handler.RequestHandlerBase - java.lang.IllegalArgumentException: term:  analyzed to a zero-length token
	at org.apache.solr.handler.tagger.Tagger.process(Tagger.java:142)
	at org.apache.solr.handler.tagger.TaggerRequestHandler.handleRequestBody(TaggerRequestHandler.java:231)
	at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:199)
	at org.apache.solr.core.SolrCore.execute(SolrCore.java:2551)
	at org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:191)
	at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:194)
	at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:211)
	at org.opensextant.extraction.SolrMatcherSupport.tagTextCallSolrTagger(SolrMatcherSupport.java:181)
	at org.opensextant.extractors.geo.GazetteerMatcher.tagText(GazetteerMatcher.java:444)
	at org.opensextant.extractors.geo.GazetteerMatcher.tagText(GazetteerMatcher.java:404)
	at org.opensextant.extractors.geo.PlaceGeocoder.extract(PlaceGeocoder.java:475)
	at org.opensextant.extractors.test.TestPlaceGeocoder.tagFile(TestPlaceGeocoder.java:57)
	at org.opensextant.extractors.test.TestPlaceGeocoder.main(TestPlaceGeocoder.java:164)

Expected behavior

More reasonable behavior is expected from TextTagger -- its possible the whole Solr 7.x assembly needs to be replaced with a clean setup and fully reindex data.

Quarterly Gazetteer Release, 2020-Q1

Type of Feature:
[ ] Collaboration or partnership
[ X ] Improvement or clarification
[ ] New Processing

Description of Feature
Gazetteer update.
Include testing of updates on nationalities.csv

Parks marked as org taxons, not locations

"National Parks" and other specific park entries that are well-known entries (in JRC, for example) might be marked as organization taxon, and therefore not marked as a location.

DOMResult class conflict

javax.xml.transform.dom.DOMResult class appears to interfere with XML config file parsing.

  • Solr 4.x, 5.x
  • Java8

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.