uw-cosmos / cosmos Goto Github PK

Knowledge base construction from raw scientific documents

Python 94.41% Shell 1.05% Dockerfile 1.47% JavaScript 0.14% Jupyter Notebook 2.57% Mako 0.06% Makefile 0.09% Batchfile 0.10% Perl 0.12%

cosmos's People

Contributors

Stargazers

Watchers

Forkers

iross lizhelongs sverma25 ricklentz xinyuzeng ml-lab hadarohana ilmcconnell stevenyesz celestialized debrawang richardscottoz johnkn ryansun117 deven-biehler mwestphall

cosmos's Issues

Proposals sometimes miss parts of the document

For some input PDFs, some parts of the document are missed and are not contained within a proposal:

Example:

Ingest Pipeline fails with ''

Code: branch apiv1
Re-create: run on cosmos0003, with two gpus in dask cluster (spawn_dask_cluster_2_gpu.sh), run ingestion script (ingest_documents_timing.sh)
Result: Ingestion fails with error on this document: 5e78f996998e17af8265ef95.pdf

Error:

ERROR :: 2020-08-13 11:17:04,329 ::
Traceback (most recent call last):
File "/ssd/iain/Cosmos/cosmos/ingestion/ingest/ingest.py", line 198, in pdf_to_images
meta, limit = parse_pdf(filename)
File "/ssd/iain/Cosmos/cosmos/ingestion/ingest/utils/pdf_extractor.py", line 57, in parse_pdf
texts.append(line.get_text().strip())
File "/ssd/iain/Cosmos/cosmos/ingestion/ingest/utils/pdf_extractor.py", line 57, in parse_pdf
texts.append(line.get_text().strip())
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/bdb.py", line 88, in trace_dispatch
return self.dispatch_line(frame)
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/bdb.py", line 113, in dispatch_line
if self.quitting: raise BdbQuit
bdb.BdbQuit
distributed.worker - WARNING - Compute Failed
Function: pdf_to_images
args: ('/hdd/iain/covid_docs_25Mar_all/5e78f996998e17af8265ef95.pdf')
kwargs: {}
Exception: Exception('Parsing error', '')

get context for returned objects

It would be super helpful to have an API route that took an extracted object id as argument and type as argument and returned spatially-adjacent objects matching type for context.

Example: get body text elements immediately above and below equation.

Figures sometimes matched to wrong captions

Seen on this page with a few figures. I am uncertain how common this is, but it seems to be limited to cases with two figures on the same page.

Autobuild docs

Instead of having to make github and commit the resultant docs/ folder, there must be a Github action to do it for us.

Cleanup old cruft

There are still relics (docker-comple files, Dockerfiles, scripts) from previous workflows scattered throughout. Need a thorough cleaning to make sure there aren't vestigial pieces around to confuse users (sorry, @ilmcconnell !)

API: should return confidence for header objects

Children now return base_confidence and postprocessing_confidence keys. The header should also return these values.

Also, since the shape of the header and each child item should be the same, it would be easier for the frontend if the API response returned a nested signature {header: <extraction>, children: <extraction>[]} instead of the current spread of the header props {header_id, ..., children: <extraction>[]}.

Ingest pipeline fails with ValueError: not enough values to unpack (expected 3, got 0)

Code: branch apiv1
Re-create: run on cosmos0003, with one gpu in dask cluster (spawn_dask_cluster.sh), run ingestion script (ingest_documents_timing.sh)
Result: Ingestion fails with error on these documents:
5a15d754cf58f10f6558336a.pdf
5e7b01ae998e17af82663153.pdf
5e7a522d998e17af82661c81.pdf
5e7a522d998e17af82661c82.pdf
5e7ac2a2998e17af82662b2f.pdf
5cd955ca0b45c76caf8922bc.pdf
5e7900aa998e17af8265f0c8.pdf
5e7900aa998e17af8265f0c7.pdf

looks like these docs are most indexes, lists not really fitting scientific paper structures.

Dask Cluster Error:

Function: xgboost_postprocess
args: ('tmp/images/5a15d754cf58f10f6558336a.pdf_2.pkl')
kwargs: {}
Exception: ValueError('not enough values to unpack (expected 3, got 0)')

Python Error:

(cosmos) [imcconnell2@cosmos0003 Cosmos]$ bash cli/ingest_documents_timing.sh
DEBUG :: 2020-08-14 00:55:48,331 :: Using selector: EpollSelector
Traceback (most recent call last): ] | 23% Completed | 30min 35.4s 1.4s
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/ssd/iain/Cosmos/cosmos/ingestion/ingest/scripts/ingest_documents.py", line 43, in
ingest_documents()
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/site-packages/click/core.py", line 829, in call
return self.main(*args, **kwargs)
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/ssd/iain/Cosmos/cosmos/ingestion/ingest/scripts/ingest_documents.py", line 39, in ingest_documents
aggregations=aggregation)
File "/ssd/iain/Cosmos/cosmos/ingestion/ingest/ingest.py", line 66, in ingest
aggregations=aggregations)
File "/ssd/iain/Cosmos/cosmos/ingestion/ingest/ingest.py", line 93, in _ingest_local
images = [i.result() for i in images]
File "/ssd/iain/Cosmos/cosmos/ingestion/ingest/ingest.py", line 93, in
images = [i.result() for i in images]
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/site-packages/distributed/client.py", line 223, in result
raise exc.with_traceback(tb)
File "/ssd/iain/Cosmos/cosmos/ingestion/ingest/process_page.py", line 71, in xgboost_postprocess
objects = postprocess(dp.postprocess_model, dp.classes, objects)
File "/ssd/iain/Cosmos/cosmos/ingestion/ingest/process/postprocess/xgboost_model/inference.py", line 9, in run_inference
p_bb, _, texts = zip(*page_objs)
ValueError: not enough values to unpack (expected 3, got 0)

Segmentation for figure parts

It would be useful to parse figures internally by part, for example when there are parts A and B representing separate graphs within the same figure with one caption.

See second example in #82

Project Leads/Project Pages

The leads should create project pages for each task in the proposal. Done by 7/29/2020.

Ingest Pipeline fails with 'can't concat int to bytes'

ERROR :: 2020-08-11 17:56:01,387 :: can't concat int to bytes
Traceback (most recent call last):
File "/ssd/iain/Cosmos/cosmos/ingestion/ingest/ingest.py", line 198, in pdf_to_images
meta, limit = parse_pdf(filename)
File "/ssd/iain/Cosmos/cosmos/ingestion/ingest/utils/pdf_extractor.py", line 45, in parse_pdf
interpreter.process_page(page)
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 895, in process_page
self.render_contents(page.resources, page.contents, ctm=ctm)
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 908, in render_contents
self.execute(list_value(streams))
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 936, in execute
func()
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 506, in do_s
self.do_S()
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 499, in do_S
self.curpath)
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/site-packages/pdfminer/converter.py", line 115, in paint_path
pts.append(apply_matrix_pt(self.ctm, (p[i], p[i+1])))
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/site-packages/pdfminer/utils.py", line 138, in apply_matrix_pt
return a * x + c * y + e, b * x + d * y + f
TypeError: can't concat int to bytes
distributed.worker - WARNING - Compute Failed
Function: pdf_to_images
args: ('/hdd/iain/covid_docs_25Mar_all/5e7a19df998e17af826615ac.pdf')
kwargs: {}
Exception: Exception('Parsing error', "can't concat int to bytes")

duplicated entities in response

It still seems that duplication of results is common. Example in current interface at time of posting. Search "infection rate" with defaults and go to equations. first result is duplicated. Same object, same reference.

Anserini-backed API returns invalid page number

The Anserini API returns page: 0 for each page of results returned, although the actual contents of the response do change. This is confusing infinite-scrolling pagination on the frontend.

stanford-corenlp-full-2018-10-05.zip link may not work

When http://nlp.stanford.edu/software/stanford-corenlp-full-2018-10-05.zip is not working, running docker-compose up will get stuck at
step 39/51: RUN wget http://nlp.stanford.edu/software/stanford-corenlp-full-2018-10-05.zip.

A way to solve this is manually download this .zip file from the link above. Then unzip it and paste it to the same directory as Dockerfile's.

In Dockerfile, the downloading part also need to be commented out:

108 USER user
109
110 # RUN wget http://nlp.stanford.edu/software/stanford-corenlp-full-2018-10-05.zip
111 # RUN unzip stanford-corenlp-full-2018-10-05.zip
112 # RUN rm stanford-corenlp-full-2018-10-05.zip

sort/filter options for anserini, ES search: pub date, journal, publisher

It would be very useful to be able to sort the response, currently returned in order of a combination of "confidence" and query matching, by other metadata. Big one would be publication date. It will be common for scientists to want to see the latest results first. Secondary filtering would be by journal. Publisher filtering a convenience for communication with publishers (mostly, though not exclusively, could be useful for science too).

Add more robust debug options at each stage

I'd like something like a debug decorator or more debug function parameters so we can toggle more of our debug output easily.

Ingested PDFs that are over 14 mb are currently not attached during ingestion

This is currently due to the file size limit on BSON documents in MongoDB. We need to implement a workaround to handle PDFs larger than this size.

Captions in ES search vs. Anserini search

https://cosmos.wisc.edu/sets/covid/?backend=ElasticSearch&query=copyright&type=Figure
https://cosmos.wisc.edu/sets/covid/?backend=Anserini&query=copyright&type=Figure

These are very different, which raises possibility of bug in ES case

good segmentation errors to try and fix

Text normalization and cleanup

Text extracted from the PDF (or OCR) is currently used as-is, which propagates source inconsistencies into the COSMOS output. Normalization would improve recall and general utility.

To be added for context field:

Pass through ftfy (https://ftfy.readthedocs.io/en/latest/) or similar to normalize ligatures and unicode
Find and correct words that are split across lines

Some images returned in API are not present in source papers.

I did a search within the new IODP set for COSMOS, through the xDD API, for the “Maastrichtian” geologic period which abuts the K-T boundary. Results can be viewed here This brings back a lot of good data, but one interesting thing is that starting at about result #10 and continuing for 8 or so iterations, it appears that we are returning duplicate figures; these are attributed to different papers. For at least one of the papers, the figure is not included in the pdf version of the paper.

Perhaps things are getting shuffled around in the pipeline, or there's something weird going on with how IODP journal articles are being split for ingestion.

Here's API links for some of the offending objects:
https://xdd.wisc.edu/sets/iodp/cosmos/api/object/a74d37aff204b26ba74ffc3abd472ed67ffc1486
https://xdd.wisc.edu/sets/iodp/cosmos/api/object/0fa4bdb330241854711b83be10bb453fda026851
https://xdd.wisc.edu/sets/iodp/cosmos/api/object/0b115ea35ed2f41828ea857a2aff4dadf5ebf262
https://xdd.wisc.edu/sets/iodp/cosmos/api/object/24a4846d2fd59f7544db53bc06d7cbc7dd261ac5

And here's the figure in questions:

page segmentation failure

10.1016/j.drudis.2008.06.006
not clear why this one failed.

Multiple elastic search terms: COSMOS search API

This could be a front or backend issue. The question is how are multiple search terms handled when ElasticSearch is used:

https://cosmos.wisc.edu/sets/covid/?backend=ElasticSearch&query=IC50%20Ribavirin&type=Table

Example of the returned objects:
http://cosmos3.chtc.wisc.edu:8081/search?id=1007497
http://cosmos3.chtc.wisc.edu:8081/search?id=1584646
The term Ribavirin does not appear anywhere here.

Unknown runtime specified nvidia (using CPU)

Starting cosmos_redis_1           ... done
ERROR: for cosmos_cosmos_1  Cannot create container for service cosmos: Unknown runtime specified nvidia

ERROR: for cosmos  Cannot create container for service cosmos: Unknown runtime specified nvidia
ERROR: Encountered errors while bringing up the project.

When trying to create and start the docker-compose.yml container and you encounter this error, the docker-compose.yml needs to be modified.

Comment out line 5 (runtime: nvidia) to fix the error and use the default runtime option

  1 version: "2.3"
  2 services:
  3     cosmos:
  4         build: .
  5         runtime: nvidia
  6         ipc: host
  7         volumes:
  8             - .:/cosmos/
  9             - ${INPUT_DIR}:/input/
 10             - ${OUTPUT_DIR}:/output/
 11         command: "python run.py /input  -w torch_model/model_weights.pth -t 4 -o /output/ -d ${DEVICE} -k"

Quick evaluation results

Here is an assessment of COSMOS returned results on the geothermal dataset (bigram model) for the search terms "thermal conductivity", "geochemistry", and "porosity" WITH the permalink to each success/failure included. Is there an ideal place to put this information?

table checks.xlsx

Using cosmos with a cpu

Hi!

I am trying to get cosmos running to extract text, tables, figures, etc from PDFs by following the getting started instructions here. I would like to run the dockers with a cpu. I updated the .env file to include "-cpu" and switched every mention of DEVICE to 'cpu'. However, when running the docker-compose -f deployment/docker-compose-ingest.yml -p cosmos up command, I get the following error.

ERROR: pull access denied for uwcosmos/cosmos-base-cpu, repository does not exist or may require 'docker login': denied: requested access to the resource is denied

Do you think that these images don't exist or I am making some docker usage mistakes? (I am a relative novice with dockers).

Thank you!

Help flag for run.py does not accurately reflect arguments

To reproduce:

Start docker instance and run:

python run.py -h

Output:

usage: run.py [-h] [--rawfolder RAWFOLDER] [--outputfolder OUTPUTFOLDER]

optional arguments:
  -h, --help            show this help message and exit
  --rawfolder RAWFOLDER
  --outputfolder OUTPUTFOLDER

Expected output:

An accurate reflection of the arguments at the top of run.py

Anserini/ES returns figure captions with no figures in figure search

See example from search for "coronavirus lungs" caption is returned with no associated figure. is this expected behavior? Does it mean that COSMOS was unable to associate the caption with a figure element?

possible extracted object-reference bug

In the interface over covid docs, there are two identical figures and captions associated with two different references. Only one of those references seems to actually have that figure and caption. We need to find the source of this error in the pipeline. It does not appear to be an interface-related bug.

https://www.sciencedirect.com/science/article/pii/S0163445320300906
https://www.sciencedirect.com/science/article/pii/S0163445320300955

One word in two lines

For one word in two lines, currently we only keep the coordinate information of the first bbox.

Feedback API

As we put more verticals into production, we need to start collecting more robust feedback to start improving our search and segmentation models

This can help us track segmentation failures, search failures, or other failures and iterate.

Exiting with code 137

When running the docker image in the build model phase, it may exit with code 137.

If this happens you should increase your memory in your Docker settings/preferences. By default, it's 2GB, increase it to at least 4GB and if the issue persists, keep increasing it.

(visual of how to change memory in preferences below)
https://www.petefreitag.com/item/848.cfm

"Section" class seems to be the only label for body text

Using this response as guide:
https://xdddev.chtc.io/sets/xdd-covid-19/api/v2_beta/search?query=Remdesivir,MERS&inclusive=true&ignore_bytes=true

There are a ton of text blocks returned here with cls=section, that are clearly not sections.

Trying to be explicit:
https://xdddev.chtc.io/sets/xdd-covid-19/api/v2_beta/search?query=Remdesivir,MERS&inclusive=true&ignore_bytes=true&type=Body%20Text

All of the cls values are still "Section.

Making the search more general:
https://xdddev.chtc.io/sets/xdd-covid-19/api/v2_beta/search?query=Remdesivir&inclusive=true&ignore_bytes=true
There are some cls="equation" objects that seem to be large text blocks too, along with abundant cls="Section".

Ingest Pipeline fails - 'PSKeyword' object is not iterable

Error:

ERROR :: 2020-08-12 13:05:58,334 :: 'PSKeyword' object is not iterable
Traceback (most recent call last):
File "/ssd/iain/Cosmos/cosmos/ingestion/ingest/ingest.py", line 198, in pdf_to_images
meta, limit = parse_pdf(filename)
File "/ssd/iain/Cosmos/cosmos/ingestion/ingest/utils/pdf_extractor.py", line 45, in parse_pdf
interpreter.process_page(page)
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 895, in process_page
self.render_contents(page.resources, page.contents, ctm=ctm)
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 908, in render_contents
self.execute(list_value(streams))
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 933, in execute
func(*args)
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 803, in do_TJ
self.graphicstate.copy())
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/site-packages/pdfminer/pdfdevice.py", line 83, in render_string
graphicstate)
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/site-packages/pdfminer/pdfdevice.py", line 96, in render_string_horizontal
for cid in font.decode(obj):
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/site-packages/pdfminer/pdffont.py", line 523, in decode
return bytearray(bytes) # map(ord, bytes)
TypeError: 'PSKeyword' object is not iterable
distributed.worker - WARNING - Compute Failed
Function: pdf_to_images
args: ('/hdd/iain/covid_docs_25Mar_all/5e7a4b17998e17af82661ba6.pdf')
kwargs: {}
Exception: Exception('Parsing error', "'PSKeyword' object is not iterable")

Ordering of children in TF/IDF API

In some cases (esp. body text) where the ordering of "child" extractions is semantically meaningful, related and apparently sequential extractions are returned in a random order. Ordering these properly would produce a readable set of text extractions in the frontend in many cases.

http://cosmos.wisc.edu/sets/covid/?query=SARS&type=Body%20Text

major segmentation fail, not clear why, but this one needs to be trapped

https://www.sciencedirect.com/science/article/pii/S0021967309018445 http://cosmos3.chtc.wisc.edu:8081/search?id=2852427

multi-part figures often not merged

Much of the time, figures with multiple parts are segmented separately and not merged properly in post-processing, leaving multiple "figures" that cannot be matched to multiple "captions". This results in key figure parts being impossible to retrieve (i.e., they have no text and are associated with no captions). It's also likely that this is causing lower confidence on Figure proposals because "chopped up" figure and table elements are out of distribution (by definition rel. to our training data).

Revisiting the merging step for figures specifically is needed. Tables would also benefit from more sophisticated merging.

More strict checks on COSMOS output to prevent whole-page exposure

example:
https://xdd.wisc.edu/set_visualizer/sets/xdd-covid-19/object/3b27c65e322a8e39cf935cbdc3d95a7361483bf5

Ingest pipeline fails with RecursionError

Error:

RecursionError: maximum recursion depth exceeded
distributed.worker - WARNING - Compute Failed
Function: pdf_to_images
args: ('/hdd/iain/covid_docs_25Mar_all/571f511a642f88805083527c.pdf')
kwargs: {}
Exception: Exception('Parsing error', 'maximum recursion depth exceeded')

Obfuscate page_ids

Pages should have UUIDs (or similar) instead of a integer id. Right now, a user could trivially step through a bunch of pages and recreate a full PDF.

API route to summarize COSMOS vertical corpus

An API route containing details about the underlying corpus in a vertical set (e.g., COVID19) should be made available.
This route should include basic data size statistics as well as a breakdown of number of documents by publisher and journal title (when available). It would be wise to pre-compute this and store it in a lookup table that gets updated when new documents are added.

Build pydocs for the project

We should be building pydocs based off of function documentation.

preprocess.py should be refactored to reflect that there is no tmp_dir

API: should return bibjson

See UW-COSMOS/cosmos-visualizer#35

Have this in a branch somewhere. Will be added back ASAP.

Segmentation for inset figures/tables

This is pretty common in a few big-name publications. Example:
Current segmentation step probably can't handle this.

Another:

Add augmentation for rotating tables.

One augmentation to consider is rotating docs. We might want to consider rotating only docs that contain only a table or only a figure, as those are most likely to occur in the wild. EG:

Postprocess should use the unicode represenation to change the label, if available.

As title.

Right now postprocess.postprocess.py contains a function that will change the output label of a box if it contains some keywords. It's currently using the Tesseract output, but if the unicode is available, that should be used instead.

API response container differs between ElasticSearch and Anserini APIs

ElasticSearch uses results to wrap the list of returned objects, while Anserini uses objects. Right now, there is a switch in the frontend to fix this, but it is pretty brittle.

Proposed solution:

In API, standardize on objects as a list container
In frontend, try to find objects and fall back on results to keep current backends working.

Entity discovery

Using this issue to track an upcoming update where entities are discovered and clustered at the corpus level, with support for linking to existing knowledge bases.

Leverage in-text table mentions to expand context recall

(lots of development already done on this, just writing up a quick issue for tracking purposes)

Executive summary

COSMOS extracts tables, figures, and their associated mentions. However, in-text mentions of these objects could expand contextual windows for these, or provide context for cases where caption association is missed.

@ilmcconnell could you write up a brief summary of the current status and the lingering items before it's ready for a final release?

uw-cosmos / cosmos Goto Github PK

cosmos's People

Contributors

Stargazers

Watchers

Forkers

cosmos's Issues

Executive summary

Recommend Projects

Recommend Topics

Recommend Org