nasa-jpl-memex / geoparser Goto Github PK

Extract and Visualize location from any file

License: Apache License 2.0

Python 3.29% CSS 11.95% HTML 9.95% JavaScript 61.77% Shell 3.61% Batchfile 2.26% XSLT 1.22% AMPL 0.01% Java 5.86% Dockerfile 0.07%

geoparser docker gazetteer tika tika-server extract django visualize-locations solr covid-19

geoparser's Introduction

GeoParser

The Geoparser is a software tool that can process information from any type of file, extract geographic coordinates, and visualize locations on a map. Users who are interested in seeing a geographical representation of information or data can choose to search for locations using the Geoparser, through a search index or by uploading files from their computer. The Geoparser will parse the files and visualizes cities or latitude-longitude points on the map. After the information is parsed and points are plotted on the map, users are able to filter their results by density, or by searching a key word and applying a "facet" to the parsed information. On the map, users can click on location points to reveal more information about the location and how it is related to their search.

Installation (Docker)

docker build -t nasajplmemex/geo-parser --no-cache -f Dockerfile .
docker-compose up -d
Visit http://localhost:8000 on your browser

Try it out to help fight COVID!

GeoParser has been updated with a new easy to use Docker install, and also an example to download and run the COVID-19 literature data and view the locations. Use that example to explore and test out GeoParser on a real example and view locations from that dataset.

Installation (manually)

Requirements

Python 2.7
pip
Django
Tika Python

Install Requirements

Install python requirements

pip install -r requirements.txt

How to Run the Application

Run Solr Change directory to where you cloned the project cd Solr/solr-5.3.1/ ./bin/solr start

Clone lucene-geo-gazetteer repo

git clone https://github.com/chrismattmann/lucene-geo-gazetteer.git
cd lucene-geo-gazetteer
mvn install assembly:assembly
add lucene-geo-gazetteer/src/main/bin to your PATH environment variable

make sure it is working

lucene-geo-gazetteer --help
usage: lucene-geo-gazetteer
 -b,--build <gazetteer file>           The Path to the Geonames
                                       allCountries.txt
 -h,--help                             Print this message.
 -i,--index <directoryPath>            The path to the Lucene index
                                       directory to either create or read
 -s,--search <set of location names>   Location names to search the
                                       Gazetteer for

You will now need to build a Gazetteer using the Geonames.org dataset. (1.2 GB)

cd lucene-geo-gazetteer
curl -O http://download.geonames.org/export/dump/allCountries.zip
unzip allCountries.zip
lucene-geo-gazetteer -i geoIndex -b allCountries.txt

make sure it is working

lucene-geo-gazetteer -s Pasadena Texas
[
{"Texas" : [
"Texas",
"-91.92139",
"18.05333"
]},
{"Pasadena" : [
"Pasadena",
"-74.06446",
"4.6964"
]}
]

Now start lucene-geo-gazetteer server

lucene-geo-gazetteer -server

Run tika server as mentioned in https://cwiki.apache.org/confluence/display/TIKA/GeoTopicParser on port 8001. Port can be configured via config.txt
Make sure you can extract locations from Tika Server

curl -T /path/to/polar.geot -H "Content-Disposition: attachment; filename=polar.geot" http://localhost:8001/rmeta

You can obtain [file here] (https://raw.githubusercontent.com/chrismattmann/geotopicparser-utils/master/geotopics/polar.geot)

Output should be this

[
   {
      "Content-Type":"application/geotopic",
      "Geographic_LATITUDE":"39.76",
      "Geographic_LONGITUDE":"-98.5",
      "Geographic_NAME":"United States",
      "Optional_LATITUDE1":"27.33931",
      "Optional_LONGITUDE1":"-108.60288",
      "Optional_NAME1":"China",
      "X-Parsed-By":[
         "org.apache.tika.parser.DefaultParser",
         "org.apache.tika.parser.geo.topic.GeoParser"
      ],
      "X-TIKA:parse_time_millis":"1634",
      "resourceName":"polar.geot"
   }
]

Run Django server python manage.py runserver
Open in browser http://localhost:8000/ Note : Please refer to the wiki page on this github repository which can act as a guide for you on how to use GeoParser.

Technologies we Use

geoparser's People

Contributors

Stargazers

Watchers

geoparser's Issues

Have Girder login when the page load

When the GeoParser app loads, have Girder to login to be able to use Girder.
Username: girder
Password: girder

Base 64 encode: Z2lyZGVyOmdpcmRlcg==

Delete button to remove uploaded files

Delete button in front of each uploaded file should remove file from server and Solr.

Left side menu should be able to collapse

We need to hide the last side menu when the points are in map to be able to view map in full screen.
Now the menu covers 1/4 of screen.

Minimize and collate JS and CSS file

Make one single file for JS and CSS plugin.
Minimize both app and plugin JS and CSS.

Create a retry loop while querying index

https://github.com/MBoustani/GeoParser/blob/master/geoparser_app/views.py#L261

Show list of uploaded files on front-end

Uploaded files will appear as soon as they get uploaded, but if page get refreshed they will not be shown any more.

Solution:
Have server look up in uploaded folder and send list of files to front-end.

Use Solr instance to store geoparsed data

Using Solr for storing and retrieving data from.
One collection called "Uploaded_Files" will store all geoparsed uploaded files.
Schema could be something close to:
file_name = <file_name>
extracted_text =
location_names = [list of locations name]
lat/lon = [list of location dictionary(key: location name, value: lat/long)]

**This schema may change as we learn more about Solr Spatial query

Use new basemap and assign file with different color markers

Setup statics file in server

In CherryPy there is a configuration for statics file such as CSS, JS and imgs.

Upload file to server using Girder

As we are replacing current CherryPy serve to Girder, we need to define a REST URL to POST uploaded file to Girder Filesystem

GeoJS js dependencies and basic HTML

Basic map accessible via index page.

Removing OL clustering

Form for searching index

Add favicon

@lawongsta Any ideas?

Geoparsing for uploaded file

Server should start the Geoparsing process as soon as file/s being uploaded to server and send the status back to front-end.

Server static files

Setup cherrypy to server static files and template HTML file.

Remove new line characters before processing file text

simplify text like we do for indexed data before indexing file text

Show more meta data on popups and link back to solr doc

Need to show more data on popups to analyze individual bubble.

"title" : "Title of doc fetched as configured",
"descr" : "Short description fetched as per configuration"
"url" : "Link to html/image"
"solr_url" : "Link to solr_doc"

Input box for file upload

An input box on top left for file upload.

Front-end create item for each file within existing folder and upload file in that item

We have already made collection and a folder called "uploaded_files" within collection.
As user upload a file, client can create and item with in "uploaded_files" using Girder REST API.

@smadha please review Grider docs REST API to see how you can create item with in folder.

Form for adding index

Replace geograpy + geopy with GeoTopicParser

updated return_points url

@smadha I have update the "return_points" url to work with both uploaded files and crawled data.
Now you can call return_points to get point from both.
However, you need to update the code you have now that called return_points, here is the update:

For uploaded files: http://localhost:8000/return_points/<file_name>/uploaded_files
Example: http://localhost:8000/return_points/5c0024-25.pdf/uploaded_files

For crawled data: http://localhost:8000/return_points/<solr_url>/<core_name>
Example: http://localhost:8000/return_points/http://crawl.dyndns.org/solr/domain

Please update both.

Truncating Big file names on UI

Remove index.html caching from CustomAppRoot of GeoParser

As of now index.html is called once and stored as template object. To reflect any change in index.html we had to restart server.

Replacing Girder with current server

We have decided to replace the current CherryPy server with Girder (it is using CherryPy as server by itself)

Django upload file is not working

need to fix the issue with not being able to upload files using Django

Server should return unique set of locations

For example 5989-9131EN.pdf returns many location twice thrice

Quick
Quick
Quick
Data
Data
U.S.

Change map engine to OpenLayer

Pleas change the map engine from GeoJS to OpenLayer.

Top header and nav bar with icons

-Add a black bar with heading on top of page
-Use Bootstrap icons and create a navigation switch
-Remove temp images used

Update menu for crawled data input

@smadha Please update the menu as below:

Girder create Collection and Folder to Upload file

After the Front-end send files to Server, Girder should create Collection and Folder if they are not already created.

Generate Khooshe tiles in a seperate thread

Fetching all the coordinates from solr and transforming them to csv is taking lot of time.
This can be done asynchronously to save time of main thread.

Expose CustomAppRoot in GeoParser's Girder plugin

Adding exposed = True as per https://github.com/Kitware/minerva/blob/master/server/loader.py

Compare GeoTopicParser vs Geograpy + GeoPy

Here we are going to compare GeoTopicParser with Geograpy combined with GeoPy ruing over same text document.
@smadha can you please take over GeoTopicParser 1 and print the result here.
I am going to use Geograpy + GeoPy over same text and print the result here 2

Files:

TAP1500-Active-Probe-Datasheet-6.pdf
5c0024-25.pdf

Need to modify solr schema

As we are indexing millions of records I can see lot of issues with solr.

We initially did lot of handholding editing data types and using encoding but it's now constantly failing with OutOfMemoryError. I have tried increasing mem upto 1.5 gigs but it only gives us some extra time.

At this moment I am trying to create a new schema which addresses below limitations

Instead of one single huge document we scale points across multiple documents
We must be able to produce search results for a location. Query will be a LOCATION

I am planning below -

one core in geoparser solr instance for every index
one document in geoparser solr instance for each document in index
one domain can cover multiple cores. core syntax will be -> domain_[id]

I am right now doing it only for indexed data on a new branch. As uploaded files thing works with no issues.

@MBoustani

Progress bar for uploaded file

Progress bar should show the status of file being uploaded as well as file being geoparsed with status text underneath it.

Correcting x and y co-ordinates in views.py

Make installation easier

Commit l-g-g jar from maven central. https://oss.sonatype.org/content/groups/public/edu/usc/ir/lucene-geo-gazetteer/0.2/lucene-geo-gazetteer-0.2-jar-with-dependencies.jar
Commit tika-server jar from mirror. http://www-us.apache.org/dist/tika/tika-server-1.12.jar

Start solr,l-g-g,tika geotopic server through manage.py

@MBoustani @chrismattmann any thoughts on this?

Merging CherryPy webserver with Girder

GeoParser at this stage is using two servers as backend.
1- CherryPy as web server
2- Grider for file system and running jobs

Beside running two servers at same time could increase the chance of application failure there is problem if calling Girder on different port by CherryPy in different port (cross-domain issue), therefore these two can be merged to one and have Girder take care of everything.

Issue with indexing extracted text from Tika to Solr

Some text return from Tika has some characters that cannot be indexed to Solr.

Scroll or collapse files in menu

Each file in menu will show results underneath, if the result is long it goes off screen.
Each file/result should be collapsable as well as be able to scroll up and down

Support Crawled indexed data

Usually crawled data are being indexed to either Solr or Elasticsearch.
GeoParser should be able to get the URL to either of these to indexing machines and domain name, scan whole indexes and geoparse them.
The result (location name and point) will be stored in Solr internally along side with path to crawled data.

Mock REST services

Below APIs need to be mocked

[API signature], [Method],
[Sample Response]

/upload POST
Response code - 200
/status/%file_id% GET
{
"name":"File Name",
"status":"Message to be displayed to user",
"stepCount":4,
"parsedInfo":[
{
"lat":-34.6037232,
"lon":-58.3815931,
"name":"Aires Argentina",
"refCount":10,
"refContext":"Test line 4 in file uploaded",
"refUrl":"https://geo1.ggpht.com/cbk?panoid=wkEz-Hwmc44EnMsE7SuXBw&output=thumbnail"
},
{
"lat":19.4302678,
"lon":-99.1373136,
"name":"Mexico City, Mexico",
"refCount":2,
"refContext":"Test line 42 in file uploaded",
"refUrl":"https://geo0.ggpht.com/cbk?output=thumbnail&thumb=2&panoid=3DKyddof6dWPw3tx5BULbQ&w=96&h=64&yaw=176"
}
]
}
/search/index/%keyword% GET
{
"name":"File Name",
"status":"Message to be displayed to user",
"stepCount":4,
"parsedInfo":[
{
"lat":-34.6037232,
"lon":-58.3815931,
"name":"Aires Argentina",
"refCount":10,
"refContext":"Test line 4 in file uploaded",
"refUrl":"https://geo1.ggpht.com/cbk?panoid=wkEz-Hwmc44EnMsE7SuXBw&output=thumbnail"
},
{
"lat":19.4302678,
"lon":-99.1373136,
"name":"Mexico City, Mexico",
"refCount":2,
"refContext":"Test line 42 in file uploaded",
"refUrl":"https://geo0.ggpht.com/cbk?output=thumbnail&thumb=2&panoid=3DKyddof6dWPw3tx5BULbQ&w=96&h=64&yaw=176"
}
]
}

https://drive.google.com/open?id=1ASR0j0lzT8GqifZ0ep6WMBV9SaAOENPHUIqUrrR7dbo

Show list of indexed geoparsed under "save" button

After user types domain name and indexed URL put the result under save button that user can come back and on/off point on map.

GeoParser plugin for Grider

Geoparser plugin for girder can have multiple jobs running using Girder.
Each job can be called using REST URL and will return the results as JSON.

Put the color next to the document name in the upper left.

Be able to remove uploaded files on front-end

After each file uploaded, name of file will appear under "upload file" section.
@smadha Can you please put a remove icon by each file or check with @lawongsta about how to remove a file and maybe send a request to server to remove a file?
@smadha should we use REST URL to send the remove command to server for each file?

Be able to Upload file using CherryPy

I am using this page:
http://tools.cherrypy.org/wiki/DirectToDiskFileUpload
to setup upload file capability.

Server to use CherryPy instead of Flask

We are decided to use CherryPy for server instead of Flask.
CherryPy is more stable and also as we are going to join with other Memex geo projects we should be using same technologies they use.