opensextant / xponents Goto Github PK

Geographic Place, Date/time, and Pattern entity extraction toolkit along with text extraction from unstructured data and GIS outputters.

License: Apache License 2.0

Java 85.44% Python 10.18% Shell 2.62% Batchfile 1.76%

document-conversion geocoding geonames geoparsing geotagging information-extraction nlp solr tika

xponents's Introduction

OpenSextant

The Open Spatial Extraction and Tagging (OpenSextant) software provides an unstructured textual data geotagging and geocoding capability. The U.S. Government Joint Improvised Explosive Device Defeat Organization (JIEDDO) developed this capability in coordination with other U.S. government agencies and is pleased to provide this as open source software using an Apache 2.0 license. The software relies upon the open source General Architecture for Text Engineering (GATE) natural language processing software and the Apache Solr search software. Please see below for instructions on how to access the source code and binaries.

OpenSextant Suite

This suite is various projects for geospatial and temporal extraction. The core module is OpenSextantToolbox which produces a GATE plugin and a toolkit for controlling the overall extraction and geocoding pipeline using that plugin.

Modules

Commons -- Common parent classes, data model and core utilities. TBD

Xponents -- Extractors

XText document conversion (to plain text)
XCoord coordinate extraction
XTemporal date/time extraction
FlexPat

OpenSextantToolbox -- A GATE-based plugin and various main programs for geotagging/geocoding Gazetteer -- A Solr-based gazetteer supporting mainly NGA Geonames, USGS place data, and adhoc catalogs. LanguageResources -- Linguistic tuning data doc -- Documentation, user manuals, developer guides

Peer Projects

SolrTextTagger -- A text tagging solution for high-volume word lists or data sets

GISCore -- An API manages GIS data formats.

geodesy geodetic primitives and routines used by OpenSextant and GISCore
giscore the main GISCore API which supports IO and data manipulation on GIS data

additional content: Testing -- (RELEASE TBD) test data and programs to give you ideas of the possible. GeocoderEval -- (RELEASE TBD) we've developed a framework and ground truth for evaluating OpenSextant and other geotaggers

Getting Started Using OpenSextant

In the OpenSextant binary distribution you will find ./script/default.env It contains OPENSEXTANT_HOME and other useful shell settings. WinOS version is TBD.

To Geocode files and folders please use the reference script:

  $OPENSEXTANT_HOME/script/geocode.sh   <input> <output> <format>

where input is an input file or folder output is an output file or folder; depends on format format is the format of your output: one of GDB, CSV, Shapefile, WKT, KML

Getting Started Integrating OpenSextant

Javadoc is located at OPENSEXTANT_HOME/doc/javadoc ; Typical adhoc integration will be through the o.m.o.apps.SimpleGeocoder class, which leverages o.m.o.processing.TextInput on input and GeocodingResult/Geocoding as output classes.

Integration documentation is in progress, as of April 2013.

The main library JARs of interest are:

OpenSextantToolbox.jar opensextant-apps.jar opensextant-commons.jar

And the various Xponents: xtextjar xcoordjar xtemporaljar flexpatjar

As of release time 2013-Q1, we are working on documenting and honing dependencies with other libraries, as well as our internal dependencies.

Getting Started Developing OpenSextant

For more information see ./doc/OpenSextantToolbox/doc/OpenSextant Developers Guide.docx

Set your maven proxy settings; see ./doc/developer/ for hints.

Ensure that JAVA_HOME environment variable is pointed at a Java 7 JDK.

Otherwise you may encounter Javadoc and/or compilation errors.

In the source tree, run "ant". This will build the various required components and build a release

cd ./opensextant

see that things compile

ant compile

the release step compiles all modules and prepares a release.

ant release

Alternatively, Maven can be used to build Commons, Xponents, and SolrTextTagger. For example:

 cd Xponents
 mvn install

But complete Maven build support is not planned at this time.

xponents's People

Contributors

Stargazers

Watchers

Forkers

jgibson voyagersearch bradh knowlet liyingben javaecosystemresearch javaecosystemstudy

xponents's Issues

Improve SolrProxy to not depend on EmbeddedSolrServer

Even if you don't want to use SolrProxy with an EmbeddedSolrServer, SolrProxy has that dependency; which in turn triggers a ton of Solr stuff. If SolrProxy.initialize_embedded(home,core) were implemented in a static inner class, then EmbeddedSolrServer wouldn't be required if you didn't want it.

Add intuitive output in tester tools

Type of Feature:

Collaboration or partnership
Improvement or clarification
New Processing

Description of Feature

XCoord, XTemp, etc -- all Examples in xponents-demo.sh should report "output" files or results in a clear manner. Things end up in ./results but you are not told that.

Trivial "Do Do" false-positives

Describe the bug
"Do. Do", "do. Do", "in Do"`, etc. are common false positives found still.

To Reproduce
Xponents 3.3

Expected behavior
Better filtering of these. Likely use a spaCy NER model to offer POS tags and eliminate obvious errs.

Security Patch for Log4Shell

Type of Feature:

Collaboration or partnership
Improvement or clarification
New Processing

Description of Feature

https://logging.apache.org/log4j/2.x/security.html shows how to mitigate Log4J vulnerability "Log4shell" using JVM args.

Marc

"Centers for Disease Control" in Kenya

Describe the bug
USGS gazetteer entries

To Reproduce
USGS entry for "CDC" or "Centers for Disease Control" is an exact match for that agency in Kenya.
The USGS entry for US "CDC" -- "Centers for Disease Control and Prevention" is spelled as singular "Center for Disease Control" and is not the complete name.

incorrect match occurs and is coded as Kenya consistently.
inadequate entries for USGS -- should fix with additional information sources.

Code scanning with Sonarqube

Type of Feature:

Collaboration or partnership
Improvement or clarification
New Processing

Description of Feature
Integrate Sonar code scanning as an option to pre-screen releases.
Deploy and package Sonar scan with offline docker image, given it nearly doubles the maven dependencies.

Gazetteer 2.0 -- Python ETL

Type of Feature:

Collaboration or partnership
Improvement or clarification
New Processing

Description of Feature

Use Python Pandas and SQLite to stage all data sources in order to support the Merged Gazetteer output.
The current Gazetteer project is dependent on Kettle v6 to v9 and Java 8. There is now some incompatibility of the project with a git checkout on linux -- Kettle "spoon" script outputs an error on "Line 130, Column 69: Invalid Escape Sequence" ... but does not mention what file or what phase of processing.

This is not worth fixing in Kettle and Gaz project. Much easier to reimplement.

SolrGazetteer doesn't close streams

SolrGazetteer reads from some IO streams it creates but it never closes them. This occurs in loadFeatureMetaMap and loadCountryNameMap.

Reader countryIO = new InputStreamReader(getClass().getResourceAsStream("/country-names-2013.csv"));
try {
//.…
} finally {
    countryIO.close();
}

SolrGazetteer doesn't have way to close Solr connection

SolrGazetteer creates a connection to Solr but never closes it. SolrGazetteer should have a close() method that closes it's connection to Solr.

Another option to consider that is more in line with dependency-injection strategies is for SolrGazetteer to not create the Solr connection itself; instead it would take it via a setter or constructor parameter. Then it would not be in charge of closing the resource because it wouldn't be the creator of it. Generally creators of close'able resources are the ones responsible for closing them.

Precision not reported correctly on found coordinates

Describe the bug
verify precision reported on coordinate extraction. prec=900 for DMS match with second resolution. Should be +/- 30m.

Confidence is also not reported on REST API.

To Reproduce
Xponents 3.3.2

Expected behavior

lat,lon = 45˚ 45' 45" x 33˚ 33' 33" , ... precision should be < 30 m.
Confidence for solid deg/min/sec should be 90+
MGRS confidence should be 90+ for grid with offset with 1km precision, and less so with less precision.
Pure decimal degree coordinates probably vary in confidence: decimal deg with hemisphere symbol is 90. Without hemisphere or other indicator of geographic nature of coord, .... its just a decimal pair (1.45352 4.55577). Confidence ~ 50 for DD w/out symbols.

Improved connection with Solr 8.x and future "contributions" sections of Solr manual

Type of Feature:

Collaboration or partnership
Improvement or clarification
New Processing

Description of Feature
I see the success of the Tagger handler (follow on of SolrTextTagger). Its great to see the geonames reference, etc. but the preservation of the "naive tagger" mention and not much more is a gaping hole to be filled.
https://lucene.apache.org/solr/guide/8_6/the-tagger-handler.html#tagger-performance-tips

We can list a handful of successful NLP and other uses of the TextTagger. The main example from here is the use of it in our various OpenSextant (Xponents, Gate Toolbox, etc) implementations and the production ready packaging such as is here: https://hub.docker.com/r/mubaldino/opensextant

So what is needed is to understand how to register this interest with the Solr committers and what the sort of connection is between Solr and its users. Ideally, a "contributor" could be someone that contributes applications of Solr that are registered/vetted in a new part of the Community portion of the Solr site.
https://lucene.apache.org/solr/community.html#how-to-contribute -- I can see how I can contribute to the Solr code base. I have no interest or time there. So I see the https://lucene.apache.org/solr/ home page missing a venue for its community to understand who is building on top of and applying Solr.

Model: See spaCy.io Universe (contributors are folded in directly with the project)
https://spacy.io/universe . This home page has a different feel completely for how a dev community operates and highlights the broader sense of contributor.

I hate to see Solr fall behind,... but it is hard to be heard if you are not a committer.

Marc

Geopy as a possible target

Type of Feature:
[ X ] Collaboration or partnership
[ ] Improvement or clarification
[ X ] New Processing

Description of Feature
https://geopy.readthedocs.io/en/stable -- Support a Geopy usage:

  from geopy.geocoders.opensextant import Xponents
  xp = Xponents()
  pt = xp.geocode("45.7878E 14.000N")
  # Parsed coordinate

  place = xp.reverse( pt )
  # Closest named location 

  pt = xp.geocode("Yarmouth, ME")
  # pt = just the best possible match for the above.

  pts = xp.geocode("Yarmouth", single=False)  ## Ambiguous, so this should return multiple

  pts = xp.geocode(" when in Yarmouth (down east Maine) hit the Harraseeket  lunch counter for lobster" )
  # All possible locations found in text.

Apply Feature Type in weighting evidence and confidence.

Type of Feature:
[ ] Collaboration or partnership
[x] Improvement or clarification
[ ] New Processing

Description of Feature
Account for feature class and even coding when disambiguating locations and then also assigning confidence.

"Boise" / feature H/STMI -- an intermittent stream
"Boise" / feature P/PPL -- a major city, state capital.

If "Boise" is mentioned, we should score the P/PPL location higher and eventually choose it if there is no other relevant context to say otherwise. Confidence as well should reflect how confident we are in this.

When there is sufficient evidence to indicate the stream/hydro feature is the place in question, that evidence will have to surpass that for other possibilities.

TaxCat person_names improvements

file encoding is not respected -- data is read in by python scripts as "bytes", but not consistently decoded from UTF-8 bytes to a unicode string.

also various false-positives -- Census names can be short and confusing false-positives, in a language specific manner. E.g., last name "Le" is also french stop word.

Solr 8.4+ upgrade

Type of Feature:
[ ] Collaboration or partnership
[X ] Improvement or clarification
[ ] New Processing

Description of Feature
Solr < 8.2 has a security bug in import handler

Solr 5.x Build

SolrTextTagger is now on 2.2 with Solr 5.5 as max. Some limitations in going to Solr 6.0 on STT v2.3-dev

Solr 4.10 is EOL, given Solr 6.0 is out.

SolrGazetteer lacks SolrProxy configurability

SolrGazetteer configures it's Solr connection via the global "solr.solr.home" system property, which in turn is also used by other components. So basically it forces you to use an embedded SolrServer. Instead, it should be configurable similar to how I describe in issue #5 -- offer a setter to set the SolrProxy.

Artificial name_bias entries

For example, 60,000 gazetteer rows are marked as negative name_bias, however appear to be relatively unique names.

name_bias = -0.50 for "Compo Yacht Club" ... but same as the name "Conference" or "Compañia Seis". Conference is obviously the outlier that is correctly marked. The others are unique, specific names.

TaxonMatcher configure() called twice

It appears that PlaceGeocoder is calling TaxonMatcher's configure method twice.

PlaceGeocoder lines:

299 personMatcher = new TaxonMatcher();
300 personMatcher.excludeTaxons("place."); /* but allow org., person., etc. */
301 personMatcher.configure();

Instantiation of TaxonMatcher() calls configure(). This is called again on line 301 and results in the creation of 2 SolrProxies of which the 2nd process is not properly cleaned up on close.

Refactor resource loading, again

Resource files must be available to SolrResourceLoader from ./lib in order to load into core.

TagFilter -- loading files for GazetteerUpdateProcessorFactory is not necessary. Only basic items are needed.
Instead of using items in optional JARs (e.g., kuromiji analyzer) use locally available ./conf/lang/* files (/lang/stopwords_ja.txt for example )

2017-01-06 23:11:08,266 ERROR [coreLoadExecutor-5-thread-1] org.opensextant.extractors.geo.GazetteerUpdateProcessorFactory: Init failure
java.io.IOException: No such stop filter file /org/apache/lucene/analysis/ja/stopwords.txt
	at org.opensextant.extractors.geo.TagFilter.loadLanguageStopwords(TagFilter.java:87)
	at org.opensextant.extractors.geo.TagFilter.<init>(TagFilter.java:72)
	at org.opensextant.extractors.geo.GazetteerUpdateProcessorFactory.init(GazetteerUpdateProcessorFactory.java:84)
	at org.apache.solr.core.SolrCore.createInitInstance(SolrCore.java:611)
	at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:2268)
	at org.apache.solr.update.processor.UpdateRequestProcessorChain.init(UpdateRequestProcessorChain.java:119)
	at org.apache.solr.core.SolrCore.createInitInstance(SolrCore.java:609)

"German" reported as Country, vs. nationality.

version Xponents 3.4 beta.

Tend to Solr 7.7+ upgrade

https://github.com/OpenSextant/Xponents/network/alert/pom.xml/org.apache.solr:solr-core/open

DMS coordinates not detected

Describe the bug

28˚ 55' 19"N 77˚ 23' 14"W not found.
28:55:19N, 77:23:14W found.

Run FindBugs and fix issues

Describe the bug
Run FindBugs

To Reproduce
mvn findbugs:gui

Expected behavior
Better Java code.

Solr 6 support

Add SolrTextTagger 2.4 support with Solr 6.4+ index

Future planning
LuceneRevolution 2017, etc.

NullPointer in taxon matcher and placegeocoder

Describe the bug
.tagset on Taxon is non-null only if .addTags() is called. Added .hasTags() to check if .tagset is not null. Preferable use API method .getTags() rather than tagset attribute access.

To Reproduce
Tested with a couple nationalities that had no tagset (country code), e.g., bajan and azeri.

Expected behavior
Taxon class users should use .hasTags() to check if tagset is set.

Create a docker offline build

Type of Feature:
[ ] Collaboration or partnership
[ x ] Improvement or clarification
[ ] New Processing

Description of Feature

pre-packaged Maven/Java development image
allow recompilation of code from inside a self-contained docker image

Decimal degrees not extracted from text

Describe the bug
Unsure if this is a feature request or a bug report, but we'll start things here. I expected decimal degrees within a brick of text to be detected much like UTM, MGRS, etc. Instead the service returned zero results.

I was just examining the the core configuration for geocoord patterns and I'm not 100% sure if my test case even matches the patterns. We see this pattern in text frequently without the degrees symbol and it would be great if the service supported them:

42.312,102.121 42.312, 102.121

To Reproduce
I'm running the service using the latest 3.3 docker image on an EC2 instance running amazon linux 2. I'm exercising the endpoint utilizing curl like the following example:

curl -XPOST http://localhost:8888/xlayer/rest/process --data '{"text":"I flew to 42.312,102.121"}' | jq .

and see the following response:

{ "response": { "status": "ok", "numfound": 0 }, "annotations": [] }

Expected behavior
I would have expected the extractor to identify 42.312,102.121 as a location in decimal degrees and perform the same reverse geocode that occurs when utilizing the MGRS or DMS coords.

Thanks so much!!

Maven install fails on Xponents 3.4 master

Attempted to run mvn install on the latest master code (3.4) which failed with the following error:

[ERROR] Failed to execute goal on project opensextant-xponents: Could not resolve dependencies for project org.opensextant:opensextant-xponents:jar:3.4-SNAPSHOT: Could not find artifact org.opensextant:opensextant-xponents-core:jar:3.4-SNAPSHOT in maven-restlet-talend (https://maven.restlet.talend.com)

SolrTextTagger 2.5 + Solr 7.x support

Moving right along...
Main change other than Solr 7x dependency is docValues=True on record ID field.

measure index time
measure tagging speed

XText Tika CVE issue - upgrade Tika 1.13

ZIP Code data -- from geonames.org

Type of Feature:

Collaboration or partnership
Improvement or clarification
New Processing

Description of Feature
Pull in Zip code entries from geonames.org as taggable text -- supported on-demand, not by default.

given the style of text it would make more sense to create "postal" index, apart from gazetteer index.

Update Restlet or migrate back to 2.3.12

Describe the bug
Restlet has migrated over to Talend open source.
Major changes in JAR provisioning

Maven repo: https://maven.restlet.talend.com/; pointers to new Talend page for the framework.

To Reproduce
Attempt maven build pulling in Xponents dependencies inside a docker.
Error from JDK indicating "PKIX" exception validating source.

Expected behavior
better documentation on how to securely access Talend's https:// site.

Pull in geonames.org allCountries

Apparently geonames.org provides better name variants

look at usage license, etc.

FST optimizations coming in Solr 8.1

Type of Feature:
[ ] Collaboration or partnership
[X] Improvement or clarification
[ ] New Processing

Description of Feature

In Solr 8.1: FST changes/optimizations need to be tested:
https://issues.apache.org/jira/browse/LUCENE-8671

this suggests that no impact to API is in effect, but a measurement of timing or memory may be useful.

Xlayer & PlaceGeocoder: outline common entities output from services.

these are alluded to in Xlayer README, but could use more detail on what they are and what they mean.

geographic: country, place, geo, coordinate
other entities: taxon, person, org
date
patterns

Timezone Table in Joda vs. custom geonames.org table

Time to evaluate how Timezones associated with Country obj could be better automated and improved through Joda Time's TZ db

Assess OpenJDK variants.

adopt openjdk, open j9 vs. hotspot
compat with Solr 7.x

Use StopFilterFactory

TODO - consider using org/apache/lucene/analysis/core/StopFilterFactory to load stop terms;
this would help generalize the import and use of ./solr4/gazetteer/conf/lang/stopwords* that are already there for Solr indexing;

this was not complete, as chinese, korean, and vietnamese terms were missing from Solr's default 'conf/lang' files. But StopFilterFactory could still load such simple "wordset" lists provided from other sources.

PhoneticFilter experimentation

Baz zAz = two tokens, likely bz, zaz.
But if we find Bazzaz ==> bzaz the resulting phonetics are the same, but difficult to match.

Deir ezzor vs Der ez Zor.... again similar phonetics in a bigram or trigram, but hard to compare if phonetics are not computed as such.

Preferred Country or Location for REST or other calls

Type of Feature:

Collaboration or partnership
Improvement or clarification
New Processing

Description of Feature
for geotagging allow API calls to take in a list of preferred countries or locations (geohashes for now) that help scope what the caller thinks are the most relevant results.

SolrMatcherSupport configurability

SolrMatcherSupport's initialize() method currently examines global system properties to decide where it will find Solr. In general, global system properties can be handy but shouldn't be the only means of configuring things. There may be more than one Solr server being used, particularly in development (local indexing and remote gaz catalogs). This can be fixed by simply adding a setter for "solr", and then not overriding a non-null value in initialize(). Also, in the case that "solr.url" is set, it'd be better to concatenate the getCoreName() method's value to the url and with an extra '/' so that if I use a solr.url system property then I can have, say, both the "gazetteer" and "tax" cores and still use one solr.url set to http://..../solr/. Otherwise I can't use two SolrMatcherSupport's for remote'ed Solr.

Jython 2.7 support

Type of Feature:
[ ] Collaboration or partnership
[ ] Improvement or clarification
[ X ] New Processing

Description of Feature

use Jython 2.7 for basic usage of API

DATETIME output in Xponents REST is missing date fields and normalized time.

Describe the bug
"datetime" returned entity is just a text match and no normalized time. Java API for DateMatch is fine.

To Reproduce
call REST API with patterns feature

Expected behavior
datetime object should have normalized ISO time and/or epoch for time represented.

Xlayer features parsing does not report proper error

Error thrown when parsing features in the context of a Xlayer "/process" call is "null".

golly, that is not sufficient information.

Punct filter inadvertently marks places with "," as nonsense

Gazetteer unofficial entry "Washington, DC" and other forms are flagged as non-sense by punct filter.

test / solution: if both match and test entry have ", " and are near exact matches, then the match is not likely non-sense.

Test Latest TextTagger in other languages/scripts

Describe the bug
TextTagger usage with languages other than English.

To Reproduce

Java or Python version: Any Java (openjdk 8 and 12)
Usage: Arabic text produces a "zero-length token" exception from TextTagger process()
Data input:
Did you enable logging (level = DEBUG)?
Other notes:

15:59:47.288 [main] ERROR org.apache.solr.handler.RequestHandlerBase - java.lang.IllegalArgumentException: term:  analyzed to a zero-length token
	at org.apache.solr.handler.tagger.Tagger.process(Tagger.java:142)
	at org.apache.solr.handler.tagger.TaggerRequestHandler.handleRequestBody(TaggerRequestHandler.java:231)
	at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:199)
	at org.apache.solr.core.SolrCore.execute(SolrCore.java:2551)
	at org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:191)
	at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:194)
	at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:211)
	at org.opensextant.extraction.SolrMatcherSupport.tagTextCallSolrTagger(SolrMatcherSupport.java:181)
	at org.opensextant.extractors.geo.GazetteerMatcher.tagText(GazetteerMatcher.java:444)
	at org.opensextant.extractors.geo.GazetteerMatcher.tagText(GazetteerMatcher.java:404)
	at org.opensextant.extractors.geo.PlaceGeocoder.extract(PlaceGeocoder.java:475)
	at org.opensextant.extractors.test.TestPlaceGeocoder.tagFile(TestPlaceGeocoder.java:57)
	at org.opensextant.extractors.test.TestPlaceGeocoder.main(TestPlaceGeocoder.java:164)

Expected behavior

More reasonable behavior is expected from TextTagger -- its possible the whole Solr 7.x assembly needs to be replaced with a clean setup and fully reindex data.

Solr 4.x, 5.x
Java8

opensextant / xponents Goto Github PK

xponents's Introduction

OpenSextant

OpenSextant Suite

Getting Started Using OpenSextant

Set your maven proxy settings; see ./doc/developer/ for hints.

Ensure that JAVA_HOME environment variable is pointed at a Java 7 JDK.

Otherwise you may encounter Javadoc and/or compilation errors.

In the source tree, run "ant". This will build the various required components and build a release

see that things compile

the release step compiles all modules and prepares a release.

xponents's People

Contributors

Stargazers

Watchers

Forkers

xponents's Issues

Recommend Projects

Recommend Topics

Recommend Org