Coder Social home page Coder Social logo

nasa-jpl-memex / geoparser Goto Github PK

View Code? Open in Web Editor NEW
53.0 9.0 23.0 162.55 MB

Extract and Visualize location from any file

License: Apache License 2.0

Python 3.29% CSS 11.95% HTML 9.95% JavaScript 61.77% Shell 3.61% Batchfile 2.26% XSLT 1.22% AMPL 0.01% Java 5.86% Dockerfile 0.07%
geoparser docker gazetteer tika tika-server extract django visualize-locations solr covid-19

geoparser's Introduction

GeoParser

The Geoparser is a software tool that can process information from any type of file, extract geographic coordinates, and visualize locations on a map. Users who are interested in seeing a geographical representation of information or data can choose to search for locations using the Geoparser, through a search index or by uploading files from their computer. The Geoparser will parse the files and visualizes cities or latitude-longitude points on the map. After the information is parsed and points are plotted on the map, users are able to filter their results by density, or by searching a key word and applying a "facet" to the parsed information. On the map, users can click on location points to reveal more information about the location and how it is related to their search.

Installation (Docker)

  1. docker build -t nasajplmemex/geo-parser --no-cache -f Dockerfile .
  2. docker-compose up -d
  3. Visit http://localhost:8000 on your browser

Try it out to help fight COVID!

GeoParser has been updated with a new easy to use Docker install, and also an example to download and run the COVID-19 literature data and view the locations. Use that example to explore and test out GeoParser on a real example and view locations from that dataset.

Installation (manually)

Requirements

  1. Python 2.7
  2. pip
  3. Django
  4. Tika Python

Install Requirements

  1. Install python requirements
pip install -r requirements.txt

How to Run the Application

  1. Run Solr Change directory to where you cloned the project cd Solr/solr-5.3.1/ ./bin/solr start

  2. Clone lucene-geo-gazetteer repo

    git clone https://github.com/chrismattmann/lucene-geo-gazetteer.git
    cd lucene-geo-gazetteer
    mvn install assembly:assembly
    add lucene-geo-gazetteer/src/main/bin to your PATH environment variable
    

    make sure it is working

    lucene-geo-gazetteer --help
    usage: lucene-geo-gazetteer
     -b,--build <gazetteer file>           The Path to the Geonames
                                           allCountries.txt
     -h,--help                             Print this message.
     -i,--index <directoryPath>            The path to the Lucene index
                                           directory to either create or read
     -s,--search <set of location names>   Location names to search the
                                           Gazetteer for
    
  3. You will now need to build a Gazetteer using the Geonames.org dataset. (1.2 GB)

    cd lucene-geo-gazetteer
    curl -O http://download.geonames.org/export/dump/allCountries.zip
    unzip allCountries.zip
    lucene-geo-gazetteer -i geoIndex -b allCountries.txt
    

    make sure it is working

    lucene-geo-gazetteer -s Pasadena Texas
    [
    {"Texas" : [
    "Texas",
    "-91.92139",
    "18.05333"
    ]},
    {"Pasadena" : [
    "Pasadena",
    "-74.06446",
    "4.6964"
    ]}
    ]
    

Now start lucene-geo-gazetteer server

lucene-geo-gazetteer -server
  1. Run tika server as mentioned in https://cwiki.apache.org/confluence/display/TIKA/GeoTopicParser on port 8001. Port can be configured via config.txt

  2. Make sure you can extract locations from Tika Server

curl -T /path/to/polar.geot -H "Content-Disposition: attachment; filename=polar.geot" http://localhost:8001/rmeta

You can obtain [file here] (https://raw.githubusercontent.com/chrismattmann/geotopicparser-utils/master/geotopics/polar.geot)

Output should be this

[
   {
      "Content-Type":"application/geotopic",
      "Geographic_LATITUDE":"39.76",
      "Geographic_LONGITUDE":"-98.5",
      "Geographic_NAME":"United States",
      "Optional_LATITUDE1":"27.33931",
      "Optional_LONGITUDE1":"-108.60288",
      "Optional_NAME1":"China",
      "X-Parsed-By":[
         "org.apache.tika.parser.DefaultParser",
         "org.apache.tika.parser.geo.topic.GeoParser"
      ],
      "X-TIKA:parse_time_millis":"1634",
      "resourceName":"polar.geot"
   }
]
  1. Run Django server python manage.py runserver

  2. Open in browser http://localhost:8000/ Note : Please refer to the wiki page on this github repository which can act as a guide for you on how to use GeoParser.

Technologies we Use

geoparser's People

Contributors

antrromet avatar aravindram avatar chrismattmann avatar danlamanna avatar dependabot[bot] avatar lawongsta avatar mboustani avatar smadha avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

geoparser's Issues

Have Girder login when the page load

When the GeoParser app loads, have Girder to login to be able to use Girder.
Username: girder
Password: girder

Base 64 encode: Z2lyZGVyOmdpcmRlcg==

Show list of uploaded files on front-end

Uploaded files will appear as soon as they get uploaded, but if page get refreshed they will not be shown any more.

Solution:
Have server look up in uploaded folder and send list of files to front-end.

Use Solr instance to store geoparsed data

Using Solr for storing and retrieving data from.
One collection called "Uploaded_Files" will store all geoparsed uploaded files.
Schema could be something close to:
file_name = <file_name>
extracted_text =
location_names = [list of locations name]
lat/lon = [list of location dictionary(key: location name, value: lat/long)]

**This schema may change as we learn more about Solr Spatial query

Geoparsing for uploaded file

Server should start the Geoparsing process as soon as file/s being uploaded to server and send the status back to front-end.

Show more meta data on popups and link back to solr doc

Need to show more data on popups to analyze individual bubble.

  • "title" : "Title of doc fetched as configured",
  • "descr" : "Short description fetched as per configuration"
  • "url" : "Link to html/image"
  • "solr_url" : "Link to solr_doc"

updated return_points url

@smadha I have update the "return_points" url to work with both uploaded files and crawled data.
Now you can call return_points to get point from both.
However, you need to update the code you have now that called return_points, here is the update:

For uploaded files: http://localhost:8000/return_points/<file_name>/uploaded_files
Example: http://localhost:8000/return_points/5c0024-25.pdf/uploaded_files

For crawled data: http://localhost:8000/return_points/<solr_url>/<core_name>
Example: http://localhost:8000/return_points/http://crawl.dyndns.org/solr/domain

Please update both.

Need to modify solr schema

As we are indexing millions of records I can see lot of issues with solr.

We initially did lot of handholding editing data types and using encoding but it's now constantly failing with OutOfMemoryError. I have tried increasing mem upto 1.5 gigs but it only gives us some extra time.

At this moment I am trying to create a new schema which addresses below limitations

  • Instead of one single huge document we scale points across multiple documents
  • We must be able to produce search results for a location. Query will be a LOCATION

I am planning below -

  • one core in geoparser solr instance for every index
  • one document in geoparser solr instance for each document in index
  • one domain can cover multiple cores. core syntax will be -> domain_[id]

I am right now doing it only for indexed data on a new branch. As uploaded files thing works with no issues.

@MBoustani

Progress bar for uploaded file

Progress bar should show the status of file being uploaded as well as file being geoparsed with status text underneath it.

Merging CherryPy webserver with Girder

GeoParser at this stage is using two servers as backend.
1- CherryPy as web server
2- Grider for file system and running jobs

Beside running two servers at same time could increase the chance of application failure there is problem if calling Girder on different port by CherryPy in different port (cross-domain issue), therefore these two can be merged to one and have Girder take care of everything.

Scroll or collapse files in menu

Each file in menu will show results underneath, if the result is long it goes off screen.
Each file/result should be collapsable as well as be able to scroll up and down

Support Crawled indexed data

Usually crawled data are being indexed to either Solr or Elasticsearch.
GeoParser should be able to get the URL to either of these to indexing machines and domain name, scan whole indexes and geoparse them.
The result (location name and point) will be stored in Solr internally along side with path to crawled data.

Mock REST services

Below APIs need to be mocked

[API signature], [Method],
[Sample Response]

  1. /upload POST
    Response code - 200
  2. /status/%file_id% GET
    {
    "name":"File Name",
    "status":"Message to be displayed to user",
    "stepCount":4,
    "parsedInfo":[
    {
    "lat":-34.6037232,
    "lon":-58.3815931,
    "name":"Aires Argentina",
    "refCount":10,
    "refContext":"Test line 4 in file uploaded",
    "refUrl":"https://geo1.ggpht.com/cbk?panoid=wkEz-Hwmc44EnMsE7SuXBw&output=thumbnail"
    },
    {
    "lat":19.4302678,
    "lon":-99.1373136,
    "name":"Mexico City, Mexico",
    "refCount":2,
    "refContext":"Test line 42 in file uploaded",
    "refUrl":"https://geo0.ggpht.com/cbk?output=thumbnail&thumb=2&panoid=3DKyddof6dWPw3tx5BULbQ&w=96&h=64&yaw=176"
    }
    ]
    }
  3. /search/index/%keyword% GET
    {
    "name":"File Name",
    "status":"Message to be displayed to user",
    "stepCount":4,
    "parsedInfo":[
    {
    "lat":-34.6037232,
    "lon":-58.3815931,
    "name":"Aires Argentina",
    "refCount":10,
    "refContext":"Test line 4 in file uploaded",
    "refUrl":"https://geo1.ggpht.com/cbk?panoid=wkEz-Hwmc44EnMsE7SuXBw&output=thumbnail"
    },
    {
    "lat":19.4302678,
    "lon":-99.1373136,
    "name":"Mexico City, Mexico",
    "refCount":2,
    "refContext":"Test line 42 in file uploaded",
    "refUrl":"https://geo0.ggpht.com/cbk?output=thumbnail&thumb=2&panoid=3DKyddof6dWPw3tx5BULbQ&w=96&h=64&yaw=176"
    }
    ]
    }

https://drive.google.com/open?id=1ASR0j0lzT8GqifZ0ep6WMBV9SaAOENPHUIqUrrR7dbo

GeoParser plugin for Grider

Geoparser plugin for girder can have multiple jobs running using Girder.
Each job can be called using REST URL and will return the results as JSON.

Be able to remove uploaded files on front-end

After each file uploaded, name of file will appear under "upload file" section.
@smadha Can you please put a remove icon by each file or check with @lawongsta about how to remove a file and maybe send a request to server to remove a file?
@smadha should we use REST URL to send the remove command to server for each file?

Server to use CherryPy instead of Flask

We are decided to use CherryPy for server instead of Flask.
CherryPy is more stable and also as we are going to join with other Memex geo projects we should be using same technologies they use.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.