Coder Social home page Coder Social logo

uw-cosmos / cosmos Goto Github PK

View Code? Open in Web Editor NEW
37.0 37.0 16.0 377.36 MB

Knowledge base construction from raw scientific documents

Python 94.41% Shell 1.05% Dockerfile 1.47% JavaScript 0.14% Jupyter Notebook 2.57% Mako 0.06% Makefile 0.09% Batchfile 0.10% Perl 0.12%

cosmos's People

Contributors

akshatabhat avatar ankur-gos avatar cambro avatar davenquinn avatar dependabot[bot] avatar deven-biehler avatar ilmcconnell avatar iross avatar johnkn avatar jw-mcgrath avatar lizhelongs avatar mwestphall avatar paul841029 avatar richardscottoz avatar ryansun117 avatar sverma25 avatar zifanl avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cosmos's Issues

Ingest Pipeline fails with ''

Code: branch apiv1
Re-create: run on cosmos0003, with two gpus in dask cluster (spawn_dask_cluster_2_gpu.sh), run ingestion script (ingest_documents_timing.sh)
Result: Ingestion fails with error on this document: 5e78f996998e17af8265ef95.pdf

Error:

ERROR :: 2020-08-13 11:17:04,329 ::
Traceback (most recent call last):
File "/ssd/iain/Cosmos/cosmos/ingestion/ingest/ingest.py", line 198, in pdf_to_images
meta, limit = parse_pdf(filename)
File "/ssd/iain/Cosmos/cosmos/ingestion/ingest/utils/pdf_extractor.py", line 57, in parse_pdf
texts.append(line.get_text().strip())
File "/ssd/iain/Cosmos/cosmos/ingestion/ingest/utils/pdf_extractor.py", line 57, in parse_pdf
texts.append(line.get_text().strip())
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/bdb.py", line 88, in trace_dispatch
return self.dispatch_line(frame)
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/bdb.py", line 113, in dispatch_line
if self.quitting: raise BdbQuit
bdb.BdbQuit
distributed.worker - WARNING - Compute Failed
Function: pdf_to_images
args: ('/hdd/iain/covid_docs_25Mar_all/5e78f996998e17af8265ef95.pdf')
kwargs: {}
Exception: Exception('Parsing error', '')

get context for returned objects

It would be super helpful to have an API route that took an extracted object id as argument and type as argument and returned spatially-adjacent objects matching type for context.

Example: get body text elements immediately above and below equation.

Autobuild docs

Instead of having to make github and commit the resultant docs/ folder, there must be a Github action to do it for us.

Cleanup old cruft

There are still relics (docker-comple files, Dockerfiles, scripts) from previous workflows scattered throughout. Need a thorough cleaning to make sure there aren't vestigial pieces around to confuse users (sorry, @ilmcconnell !)

API: should return confidence for header objects

Children now return base_confidence and postprocessing_confidence keys. The header should also return these values.

Also, since the shape of the header and each child item should be the same, it would be easier for the frontend if the API response returned a nested signature {header: <extraction>, children: <extraction>[]} instead of the current spread of the header props {header_id, ..., children: <extraction>[]}.

Ingest pipeline fails with ValueError: not enough values to unpack (expected 3, got 0)

Code: branch apiv1
Re-create: run on cosmos0003, with one gpu in dask cluster (spawn_dask_cluster.sh), run ingestion script (ingest_documents_timing.sh)
Result: Ingestion fails with error on these documents:
5a15d754cf58f10f6558336a.pdf
5e7b01ae998e17af82663153.pdf
5e7a522d998e17af82661c81.pdf
5e7a522d998e17af82661c82.pdf
5e7ac2a2998e17af82662b2f.pdf
5cd955ca0b45c76caf8922bc.pdf
5e7900aa998e17af8265f0c8.pdf
5e7900aa998e17af8265f0c7.pdf

looks like these docs are most indexes, lists not really fitting scientific paper structures.

Dask Cluster Error:

Function: xgboost_postprocess
args: ('tmp/images/5a15d754cf58f10f6558336a.pdf_2.pkl')
kwargs: {}
Exception: ValueError('not enough values to unpack (expected 3, got 0)')

Python Error:

(cosmos) [imcconnell2@cosmos0003 Cosmos]$ bash cli/ingest_documents_timing.sh
DEBUG :: 2020-08-14 00:55:48,331 :: Using selector: EpollSelector
Traceback (most recent call last): ] | 23% Completed | 30min 35.4s 1.4s
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/ssd/iain/Cosmos/cosmos/ingestion/ingest/scripts/ingest_documents.py", line 43, in
ingest_documents()
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/site-packages/click/core.py", line 829, in call
return self.main(*args, **kwargs)
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/ssd/iain/Cosmos/cosmos/ingestion/ingest/scripts/ingest_documents.py", line 39, in ingest_documents
aggregations=aggregation)
File "/ssd/iain/Cosmos/cosmos/ingestion/ingest/ingest.py", line 66, in ingest
aggregations=aggregations)
File "/ssd/iain/Cosmos/cosmos/ingestion/ingest/ingest.py", line 93, in _ingest_local
images = [i.result() for i in images]
File "/ssd/iain/Cosmos/cosmos/ingestion/ingest/ingest.py", line 93, in
images = [i.result() for i in images]
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/site-packages/distributed/client.py", line 223, in result
raise exc.with_traceback(tb)
File "/ssd/iain/Cosmos/cosmos/ingestion/ingest/process_page.py", line 71, in xgboost_postprocess
objects = postprocess(dp.postprocess_model, dp.classes, objects)
File "/ssd/iain/Cosmos/cosmos/ingestion/ingest/process/postprocess/xgboost_model/inference.py", line 9, in run_inference
p_bb, _, texts = zip(*page_objs)
ValueError: not enough values to unpack (expected 3, got 0)

Segmentation for figure parts

It would be useful to parse figures internally by part, for example when there are parts A and B representing separate graphs within the same figure with one caption.

See second example in #82

Ingest Pipeline fails with 'can't concat int to bytes'

Code: branch apiv1
Re-create: run on cosmos0003, with two gpus in dask cluster (spawn_dask_cluster_2_gpu.sh), run ingestion script (ingest_documents_timing.sh)
Result: Ingestion fails with error on this document: 5e7a19df998e17af826615ac.pdf
)
Error:

ERROR :: 2020-08-11 17:56:01,387 :: can't concat int to bytes
Traceback (most recent call last):
File "/ssd/iain/Cosmos/cosmos/ingestion/ingest/ingest.py", line 198, in pdf_to_images
meta, limit = parse_pdf(filename)
File "/ssd/iain/Cosmos/cosmos/ingestion/ingest/utils/pdf_extractor.py", line 45, in parse_pdf
interpreter.process_page(page)
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 895, in process_page
self.render_contents(page.resources, page.contents, ctm=ctm)
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 908, in render_contents
self.execute(list_value(streams))
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 936, in execute
func()
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 506, in do_s
self.do_S()
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 499, in do_S
self.curpath)
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/site-packages/pdfminer/converter.py", line 115, in paint_path
pts.append(apply_matrix_pt(self.ctm, (p[i], p[i+1])))
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/site-packages/pdfminer/utils.py", line 138, in apply_matrix_pt
return a * x + c * y + e, b * x + d * y + f
TypeError: can't concat int to bytes
distributed.worker - WARNING - Compute Failed
Function: pdf_to_images
args: ('/hdd/iain/covid_docs_25Mar_all/5e7a19df998e17af826615ac.pdf')
kwargs: {}
Exception: Exception('Parsing error', "can't concat int to bytes")

duplicated entities in response

It still seems that duplication of results is common. Example in current interface at time of posting. Search "infection rate" with defaults and go to equations. first result is duplicated. Same object, same reference.

dup

Anserini-backed API returns invalid page number

The Anserini API returns page: 0 for each page of results returned, although the actual contents of the response do change. This is confusing infinite-scrolling pagination on the frontend.

stanford-corenlp-full-2018-10-05.zip link may not work

When http://nlp.stanford.edu/software/stanford-corenlp-full-2018-10-05.zip is not working, running docker-compose up will get stuck at
step 39/51: RUN wget http://nlp.stanford.edu/software/stanford-corenlp-full-2018-10-05.zip.

A way to solve this is manually download this .zip file from the link above. Then unzip it and paste it to the same directory as Dockerfile's.

In Dockerfile, the downloading part also need to be commented out:

108 USER user
109
110 # RUN wget http://nlp.stanford.edu/software/stanford-corenlp-full-2018-10-05.zip
111 # RUN unzip stanford-corenlp-full-2018-10-05.zip
112 # RUN rm stanford-corenlp-full-2018-10-05.zip

sort/filter options for anserini, ES search: pub date, journal, publisher

It would be very useful to be able to sort the response, currently returned in order of a combination of "confidence" and query matching, by other metadata. Big one would be publication date. It will be common for scientists to want to see the latest results first. Secondary filtering would be by journal. Publisher filtering a convenience for communication with publishers (mostly, though not exclusively, could be useful for science too).

good segmentation errors to try and fix

Text normalization and cleanup

Text extracted from the PDF (or OCR) is currently used as-is, which propagates source inconsistencies into the COSMOS output. Normalization would improve recall and general utility.

To be added for context field:

Some images returned in API are not present in source papers.

I did a search within the new IODP set for COSMOS, through the xDD API, for the “Maastrichtian” geologic period which abuts the K-T boundary. Results can be viewed here This brings back a lot of good data, but one interesting thing is that starting at about result #10 and continuing for 8 or so iterations, it appears that we are returning duplicate figures; these are attributed to different papers. For at least one of the papers, the figure is not included in the pdf version of the paper.

Perhaps things are getting shuffled around in the pipeline, or there's something weird going on with how IODP journal articles are being split for ingestion.

Here's API links for some of the offending objects:
https://xdd.wisc.edu/sets/iodp/cosmos/api/object/a74d37aff204b26ba74ffc3abd472ed67ffc1486
https://xdd.wisc.edu/sets/iodp/cosmos/api/object/0fa4bdb330241854711b83be10bb453fda026851
https://xdd.wisc.edu/sets/iodp/cosmos/api/object/0b115ea35ed2f41828ea857a2aff4dadf5ebf262
https://xdd.wisc.edu/sets/iodp/cosmos/api/object/24a4846d2fd59f7544db53bc06d7cbc7dd261ac5

And here's the figure in questions:
K-T-core

Unknown runtime specified nvidia (using CPU)

Starting cosmos_redis_1           ... done
ERROR: for cosmos_cosmos_1  Cannot create container for service cosmos: Unknown runtime specified nvidia

ERROR: for cosmos  Cannot create container for service cosmos: Unknown runtime specified nvidia
ERROR: Encountered errors while bringing up the project.

When trying to create and start the docker-compose.yml container and you encounter this error, the docker-compose.yml needs to be modified.

Comment out line 5 (runtime: nvidia) to fix the error and use the default runtime option

  1 version: "2.3"
  2 services:
  3     cosmos:
  4         build: .
  5         runtime: nvidia
  6         ipc: host
  7         volumes:
  8             - .:/cosmos/
  9             - ${INPUT_DIR}:/input/
 10             - ${OUTPUT_DIR}:/output/
 11         command: "python run.py /input  -w torch_model/model_weights.pth -t 4 -o /output/ -d ${DEVICE} -k"

Quick evaluation results

Here is an assessment of COSMOS returned results on the geothermal dataset (bigram model) for the search terms "thermal conductivity", "geochemistry", and "porosity" WITH the permalink to each success/failure included. Is there an ideal place to put this information?

table checks.xlsx

Using cosmos with a cpu

Hi!

I am trying to get cosmos running to extract text, tables, figures, etc from PDFs by following the getting started instructions here. I would like to run the dockers with a cpu. I updated the .env file to include "-cpu" and switched every mention of DEVICE to 'cpu'. However, when running the docker-compose -f deployment/docker-compose-ingest.yml -p cosmos up command, I get the following error.

ERROR: pull access denied for uwcosmos/cosmos-base-cpu, repository does not exist or may require 'docker login': denied: requested access to the resource is denied

Do you think that these images don't exist or I am making some docker usage mistakes? (I am a relative novice with dockers).

Thank you!

Help flag for run.py does not accurately reflect arguments

To reproduce:

  1. Start docker instance and run:
python run.py -h

Output:

usage: run.py [-h] [--rawfolder RAWFOLDER] [--outputfolder OUTPUTFOLDER]

optional arguments:
  -h, --help            show this help message and exit
  --rawfolder RAWFOLDER
  --outputfolder OUTPUTFOLDER

Expected output:

An accurate reflection of the arguments at the top of run.py

One word in two lines

For one word in two lines, currently we only keep the coordinate information of the first bbox.

Feedback API

As we put more verticals into production, we need to start collecting more robust feedback to start improving our search and segmentation models

This can help us track segmentation failures, search failures, or other failures and iterate.

Exiting with code 137

When running the docker image in the build model phase, it may exit with code 137.

If this happens you should increase your memory in your Docker settings/preferences. By default, it's 2GB, increase it to at least 4GB and if the issue persists, keep increasing it.

(visual of how to change memory in preferences below)
https://www.petefreitag.com/item/848.cfm

"Section" class seems to be the only label for body text

Using this response as guide:
https://xdddev.chtc.io/sets/xdd-covid-19/api/v2_beta/search?query=Remdesivir,MERS&inclusive=true&ignore_bytes=true

There are a ton of text blocks returned here with cls=section, that are clearly not sections.

Trying to be explicit:
https://xdddev.chtc.io/sets/xdd-covid-19/api/v2_beta/search?query=Remdesivir,MERS&inclusive=true&ignore_bytes=true&type=Body%20Text

All of the cls values are still "Section.

Making the search more general:
https://xdddev.chtc.io/sets/xdd-covid-19/api/v2_beta/search?query=Remdesivir&inclusive=true&ignore_bytes=true
There are some cls="equation" objects that seem to be large text blocks too, along with abundant cls="Section".

Ingest Pipeline fails - 'PSKeyword' object is not iterable

Code: branch apiv1
Re-create: run on cosmos0003, with two gpus in dask cluster (spawn_dask_cluster_2_gpu.sh), run ingestion script (ingest_documents_timing.sh)
Result: Ingestion fails with error on these documents: 5e7a4b17998e17af82661ba6.pdf
593b59cdcf58f13177abb084.pdf
5e79e8ce998e17af82660f6f.pdf

Error:

ERROR :: 2020-08-12 13:05:58,334 :: 'PSKeyword' object is not iterable
Traceback (most recent call last):
File "/ssd/iain/Cosmos/cosmos/ingestion/ingest/ingest.py", line 198, in pdf_to_images
meta, limit = parse_pdf(filename)
File "/ssd/iain/Cosmos/cosmos/ingestion/ingest/utils/pdf_extractor.py", line 45, in parse_pdf
interpreter.process_page(page)
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 895, in process_page
self.render_contents(page.resources, page.contents, ctm=ctm)
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 908, in render_contents
self.execute(list_value(streams))
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 933, in execute
func(*args)
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 803, in do_TJ
self.graphicstate.copy())
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/site-packages/pdfminer/pdfdevice.py", line 83, in render_string
graphicstate)
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/site-packages/pdfminer/pdfdevice.py", line 96, in render_string_horizontal
for cid in font.decode(obj):
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/site-packages/pdfminer/pdffont.py", line 523, in decode
return bytearray(bytes) # map(ord, bytes)
TypeError: 'PSKeyword' object is not iterable
distributed.worker - WARNING - Compute Failed
Function: pdf_to_images
args: ('/hdd/iain/covid_docs_25Mar_all/5e7a4b17998e17af82661ba6.pdf')
kwargs: {}
Exception: Exception('Parsing error', "'PSKeyword' object is not iterable")

multi-part figures often not merged

Much of the time, figures with multiple parts are segmented separately and not merged properly in post-processing, leaving multiple "figures" that cannot be matched to multiple "captions". This results in key figure parts being impossible to retrieve (i.e., they have no text and are associated with no captions). It's also likely that this is causing lower confidence on Figure proposals because "chopped up" figure and table elements are out of distribution (by definition rel. to our training data).

Revisiting the merging step for figures specifically is needed. Tables would also benefit from more sophisticated merging.

Ingest pipeline fails with RecursionError

Code: branch apiv1
Re-create: run on cosmos0003, with two gpus in dask cluster (spawn_dask_cluster_2_gpu.sh), run ingestion script (ingest_documents_timing.sh)
Result: Ingestion fails with error on this document:
571f511a642f88805083527c.pdf

Error:

RecursionError: maximum recursion depth exceeded
distributed.worker - WARNING - Compute Failed
Function: pdf_to_images
args: ('/hdd/iain/covid_docs_25Mar_all/571f511a642f88805083527c.pdf')
kwargs: {}
Exception: Exception('Parsing error', 'maximum recursion depth exceeded')

Obfuscate page_ids

Pages should have UUIDs (or similar) instead of a integer id. Right now, a user could trivially step through a bunch of pages and recreate a full PDF.

API route to summarize COSMOS vertical corpus

An API route containing details about the underlying corpus in a vertical set (e.g., COVID19) should be made available.
This route should include basic data size statistics as well as a breakdown of number of documents by publisher and journal title (when available). It would be wise to pre-compute this and store it in a lookup table that gets updated when new documents are added.

Add augmentation for rotating tables.

One augmentation to consider is rotating docs. We might want to consider rotating only docs that contain only a table or only a figure, as those are most likely to occur in the wild. EG:

image

Entity discovery

Using this issue to track an upcoming update where entities are discovered and clustered at the corpus level, with support for linking to existing knowledge bases.

Leverage in-text table mentions to expand context recall

(lots of development already done on this, just writing up a quick issue for tracking purposes)

Executive summary

COSMOS extracts tables, figures, and their associated mentions. However, in-text mentions of these objects could expand contextual windows for these, or provide context for cases where caption association is missed.

@ilmcconnell could you write up a brief summary of the current status and the lingering items before it's ready for a final release?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.