uw-cosmos / cosmos Goto Github PK
View Code? Open in Web Editor NEWKnowledge base construction from raw scientific documents
Knowledge base construction from raw scientific documents
Code: branch apiv1
Re-create: run on cosmos0003, with two gpus in dask cluster (spawn_dask_cluster_2_gpu.sh), run ingestion script (ingest_documents_timing.sh)
Result: Ingestion fails with error on this document: 5e78f996998e17af8265ef95.pdf
Error:
ERROR :: 2020-08-13 11:17:04,329 ::
Traceback (most recent call last):
File "/ssd/iain/Cosmos/cosmos/ingestion/ingest/ingest.py", line 198, in pdf_to_images
meta, limit = parse_pdf(filename)
File "/ssd/iain/Cosmos/cosmos/ingestion/ingest/utils/pdf_extractor.py", line 57, in parse_pdf
texts.append(line.get_text().strip())
File "/ssd/iain/Cosmos/cosmos/ingestion/ingest/utils/pdf_extractor.py", line 57, in parse_pdf
texts.append(line.get_text().strip())
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/bdb.py", line 88, in trace_dispatch
return self.dispatch_line(frame)
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/bdb.py", line 113, in dispatch_line
if self.quitting: raise BdbQuit
bdb.BdbQuit
distributed.worker - WARNING - Compute Failed
Function: pdf_to_images
args: ('/hdd/iain/covid_docs_25Mar_all/5e78f996998e17af8265ef95.pdf')
kwargs: {}
Exception: Exception('Parsing error', '')
It would be super helpful to have an API route that took an extracted object id as argument and type as argument and returned spatially-adjacent objects matching type for context.
Example: get body text elements immediately above and below equation.
Seen on this page with a few figures. I am uncertain how common this is, but it seems to be limited to cases with two figures on the same page.
Instead of having to make github
and commit the resultant docs/
folder, there must be a Github action to do it for us.
There are still relics (docker-comple files, Dockerfiles, scripts) from previous workflows scattered throughout. Need a thorough cleaning to make sure there aren't vestigial pieces around to confuse users (sorry, @ilmcconnell !)
Children now return base_confidence
and postprocessing_confidence
keys. The header should also return these values.
Also, since the shape of the header and each child item should be the same, it would be easier for the frontend if the API response returned a nested signature {header: <extraction>, children: <extraction>[]}
instead of the current spread of the header props {header_id, ..., children: <extraction>[]}
.
Code: branch apiv1
Re-create: run on cosmos0003, with one gpu in dask cluster (spawn_dask_cluster.sh), run ingestion script (ingest_documents_timing.sh)
Result: Ingestion fails with error on these documents:
5a15d754cf58f10f6558336a.pdf
5e7b01ae998e17af82663153.pdf
5e7a522d998e17af82661c81.pdf
5e7a522d998e17af82661c82.pdf
5e7ac2a2998e17af82662b2f.pdf
5cd955ca0b45c76caf8922bc.pdf
5e7900aa998e17af8265f0c8.pdf
5e7900aa998e17af8265f0c7.pdf
looks like these docs are most indexes, lists not really fitting scientific paper structures.
Dask Cluster Error:
Function: xgboost_postprocess
args: ('tmp/images/5a15d754cf58f10f6558336a.pdf_2.pkl')
kwargs: {}
Exception: ValueError('not enough values to unpack (expected 3, got 0)')
Python Error:
(cosmos) [imcconnell2@cosmos0003 Cosmos]$ bash cli/ingest_documents_timing.sh
DEBUG :: 2020-08-14 00:55:48,331 :: Using selector: EpollSelector
Traceback (most recent call last): ] | 23% Completed | 30min 35.4s 1.4s
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/ssd/iain/Cosmos/cosmos/ingestion/ingest/scripts/ingest_documents.py", line 43, in
ingest_documents()
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/site-packages/click/core.py", line 829, in call
return self.main(*args, **kwargs)
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/ssd/iain/Cosmos/cosmos/ingestion/ingest/scripts/ingest_documents.py", line 39, in ingest_documents
aggregations=aggregation)
File "/ssd/iain/Cosmos/cosmos/ingestion/ingest/ingest.py", line 66, in ingest
aggregations=aggregations)
File "/ssd/iain/Cosmos/cosmos/ingestion/ingest/ingest.py", line 93, in _ingest_local
images = [i.result() for i in images]
File "/ssd/iain/Cosmos/cosmos/ingestion/ingest/ingest.py", line 93, in
images = [i.result() for i in images]
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/site-packages/distributed/client.py", line 223, in result
raise exc.with_traceback(tb)
File "/ssd/iain/Cosmos/cosmos/ingestion/ingest/process_page.py", line 71, in xgboost_postprocess
objects = postprocess(dp.postprocess_model, dp.classes, objects)
File "/ssd/iain/Cosmos/cosmos/ingestion/ingest/process/postprocess/xgboost_model/inference.py", line 9, in run_inference
p_bb, _, texts = zip(*page_objs)
ValueError: not enough values to unpack (expected 3, got 0)
It would be useful to parse figures internally by part, for example when there are parts A and B representing separate graphs within the same figure with one caption.
See second example in #82
The leads should create project pages for each task in the proposal. Done by 7/29/2020.
Code: branch apiv1
Re-create: run on cosmos0003, with two gpus in dask cluster (spawn_dask_cluster_2_gpu.sh), run ingestion script (ingest_documents_timing.sh)
Result: Ingestion fails with error on this document: 5e7a19df998e17af826615ac.pdf
)
Error:
ERROR :: 2020-08-11 17:56:01,387 :: can't concat int to bytes
Traceback (most recent call last):
File "/ssd/iain/Cosmos/cosmos/ingestion/ingest/ingest.py", line 198, in pdf_to_images
meta, limit = parse_pdf(filename)
File "/ssd/iain/Cosmos/cosmos/ingestion/ingest/utils/pdf_extractor.py", line 45, in parse_pdf
interpreter.process_page(page)
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 895, in process_page
self.render_contents(page.resources, page.contents, ctm=ctm)
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 908, in render_contents
self.execute(list_value(streams))
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 936, in execute
func()
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 506, in do_s
self.do_S()
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 499, in do_S
self.curpath)
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/site-packages/pdfminer/converter.py", line 115, in paint_path
pts.append(apply_matrix_pt(self.ctm, (p[i], p[i+1])))
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/site-packages/pdfminer/utils.py", line 138, in apply_matrix_pt
return a * x + c * y + e, b * x + d * y + f
TypeError: can't concat int to bytes
distributed.worker - WARNING - Compute Failed
Function: pdf_to_images
args: ('/hdd/iain/covid_docs_25Mar_all/5e7a19df998e17af826615ac.pdf')
kwargs: {}
Exception: Exception('Parsing error', "can't concat int to bytes")
The Anserini API returns page: 0
for each page of results returned, although the actual contents of the response do change. This is confusing infinite-scrolling pagination on the frontend.
When http://nlp.stanford.edu/software/stanford-corenlp-full-2018-10-05.zip is not working, running docker-compose up will get stuck at
step 39/51: RUN wget http://nlp.stanford.edu/software/stanford-corenlp-full-2018-10-05.zip
.
A way to solve this is manually download this .zip file from the link above. Then unzip it and paste it to the same directory as Dockerfile's.
In Dockerfile, the downloading part also need to be commented out:
108 USER user
109
110 # RUN wget http://nlp.stanford.edu/software/stanford-corenlp-full-2018-10-05.zip
111 # RUN unzip stanford-corenlp-full-2018-10-05.zip
112 # RUN rm stanford-corenlp-full-2018-10-05.zip
It would be very useful to be able to sort the response, currently returned in order of a combination of "confidence" and query matching, by other metadata. Big one would be publication date. It will be common for scientists to want to see the latest results first. Secondary filtering would be by journal. Publisher filtering a convenience for communication with publishers (mostly, though not exclusively, could be useful for science too).
I'd like something like a debug decorator or more debug function parameters so we can toggle more of our debug output easily.
This is currently due to the file size limit on BSON documents in MongoDB. We need to implement a workaround to handle PDFs larger than this size.
https://cosmos.wisc.edu/sets/covid/?backend=ElasticSearch&query=copyright&type=Figure
https://cosmos.wisc.edu/sets/covid/?backend=Anserini&query=copyright&type=Figure
These are very different, which raises possibility of bug in ES case
http://cosmos3.chtc.wisc.edu:8081/prediction/page/546014
http://cosmos3.chtc.wisc.edu:8081/prediction/page/547262
http://cosmos3.chtc.wisc.edu:8081/prediction/page/545296
http://cosmos3.chtc.wisc.edu:8081/prediction/page/547836
http://cosmos3.chtc.wisc.edu:8081/prediction/page/541803
http://cosmos3.chtc.wisc.edu:8081/prediction/page/548586
http://cosmos3.chtc.wisc.edu:8081/prediction/page/547734
http://cosmos3.chtc.wisc.edu:8081/prediction/page/543877
http://cosmos3.chtc.wisc.edu:8081/prediction/page/545704 [eqn fails that should have been ok]
Text extracted from the PDF (or OCR) is currently used as-is, which propagates source inconsistencies into the COSMOS output. Normalization would improve recall and general utility.
To be added for context
field:
I did a search within the new IODP set for COSMOS, through the xDD API, for the “Maastrichtian” geologic period which abuts the K-T boundary. Results can be viewed here This brings back a lot of good data, but one interesting thing is that starting at about result #10 and continuing for 8 or so iterations, it appears that we are returning duplicate figures; these are attributed to different papers. For at least one of the papers, the figure is not included in the pdf version of the paper.
Perhaps things are getting shuffled around in the pipeline, or there's something weird going on with how IODP journal articles are being split for ingestion.
Here's API links for some of the offending objects:
https://xdd.wisc.edu/sets/iodp/cosmos/api/object/a74d37aff204b26ba74ffc3abd472ed67ffc1486
https://xdd.wisc.edu/sets/iodp/cosmos/api/object/0fa4bdb330241854711b83be10bb453fda026851
https://xdd.wisc.edu/sets/iodp/cosmos/api/object/0b115ea35ed2f41828ea857a2aff4dadf5ebf262
https://xdd.wisc.edu/sets/iodp/cosmos/api/object/24a4846d2fd59f7544db53bc06d7cbc7dd261ac5
This could be a front or backend issue. The question is how are multiple search terms handled when ElasticSearch is used:
https://cosmos.wisc.edu/sets/covid/?backend=ElasticSearch&query=IC50%20Ribavirin&type=Table
Example of the returned objects:
http://cosmos3.chtc.wisc.edu:8081/search?id=1007497
http://cosmos3.chtc.wisc.edu:8081/search?id=1584646
The term Ribavirin does not appear anywhere here.
Starting cosmos_redis_1 ... done
ERROR: for cosmos_cosmos_1 Cannot create container for service cosmos: Unknown runtime specified nvidia
ERROR: for cosmos Cannot create container for service cosmos: Unknown runtime specified nvidia
ERROR: Encountered errors while bringing up the project.
When trying to create and start the docker-compose.yml container and you encounter this error, the docker-compose.yml needs to be modified.
Comment out line 5 (runtime: nvidia) to fix the error and use the default runtime option
1 version: "2.3"
2 services:
3 cosmos:
4 build: .
5 runtime: nvidia
6 ipc: host
7 volumes:
8 - .:/cosmos/
9 - ${INPUT_DIR}:/input/
10 - ${OUTPUT_DIR}:/output/
11 command: "python run.py /input -w torch_model/model_weights.pth -t 4 -o /output/ -d ${DEVICE} -k"
Here is an assessment of COSMOS returned results on the geothermal dataset (bigram model) for the search terms "thermal conductivity", "geochemistry", and "porosity" WITH the permalink to each success/failure included. Is there an ideal place to put this information?
Hi!
I am trying to get cosmos running to extract text, tables, figures, etc from PDFs by following the getting started instructions here. I would like to run the dockers with a cpu. I updated the .env file to include "-cpu" and switched every mention of DEVICE to 'cpu'. However, when running the docker-compose -f deployment/docker-compose-ingest.yml -p cosmos up
command, I get the following error.
ERROR: pull access denied for uwcosmos/cosmos-base-cpu, repository does not exist or may require 'docker login': denied: requested access to the resource is denied
Do you think that these images don't exist or I am making some docker usage mistakes? (I am a relative novice with dockers).
Thank you!
To reproduce:
python run.py -h
Output:
usage: run.py [-h] [--rawfolder RAWFOLDER] [--outputfolder OUTPUTFOLDER]
optional arguments:
-h, --help show this help message and exit
--rawfolder RAWFOLDER
--outputfolder OUTPUTFOLDER
Expected output:
An accurate reflection of the arguments at the top of run.py
In the interface over covid docs, there are two identical figures and captions associated with two different references. Only one of those references seems to actually have that figure and caption. We need to find the source of this error in the pipeline. It does not appear to be an interface-related bug.
https://www.sciencedirect.com/science/article/pii/S0163445320300906
https://www.sciencedirect.com/science/article/pii/S0163445320300955
For one word in two lines, currently we only keep the coordinate information of the first bbox.
As we put more verticals into production, we need to start collecting more robust feedback to start improving our search and segmentation models
This can help us track segmentation failures, search failures, or other failures and iterate.
When running the docker image in the build model phase, it may exit with code 137.
If this happens you should increase your memory in your Docker settings/preferences. By default, it's 2GB, increase it to at least 4GB and if the issue persists, keep increasing it.
(visual of how to change memory in preferences below)
https://www.petefreitag.com/item/848.cfm
Using this response as guide:
https://xdddev.chtc.io/sets/xdd-covid-19/api/v2_beta/search?query=Remdesivir,MERS&inclusive=true&ignore_bytes=true
There are a ton of text blocks returned here with cls=section, that are clearly not sections.
Trying to be explicit:
https://xdddev.chtc.io/sets/xdd-covid-19/api/v2_beta/search?query=Remdesivir,MERS&inclusive=true&ignore_bytes=true&type=Body%20Text
All of the cls values are still "Section.
Making the search more general:
https://xdddev.chtc.io/sets/xdd-covid-19/api/v2_beta/search?query=Remdesivir&inclusive=true&ignore_bytes=true
There are some cls="equation" objects that seem to be large text blocks too, along with abundant cls="Section".
Code: branch apiv1
Re-create: run on cosmos0003, with two gpus in dask cluster (spawn_dask_cluster_2_gpu.sh), run ingestion script (ingest_documents_timing.sh)
Result: Ingestion fails with error on these documents: 5e7a4b17998e17af82661ba6.pdf
593b59cdcf58f13177abb084.pdf
5e79e8ce998e17af82660f6f.pdf
Error:
ERROR :: 2020-08-12 13:05:58,334 :: 'PSKeyword' object is not iterable
Traceback (most recent call last):
File "/ssd/iain/Cosmos/cosmos/ingestion/ingest/ingest.py", line 198, in pdf_to_images
meta, limit = parse_pdf(filename)
File "/ssd/iain/Cosmos/cosmos/ingestion/ingest/utils/pdf_extractor.py", line 45, in parse_pdf
interpreter.process_page(page)
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 895, in process_page
self.render_contents(page.resources, page.contents, ctm=ctm)
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 908, in render_contents
self.execute(list_value(streams))
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 933, in execute
func(*args)
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 803, in do_TJ
self.graphicstate.copy())
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/site-packages/pdfminer/pdfdevice.py", line 83, in render_string
graphicstate)
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/site-packages/pdfminer/pdfdevice.py", line 96, in render_string_horizontal
for cid in font.decode(obj):
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/site-packages/pdfminer/pdffont.py", line 523, in decode
return bytearray(bytes) # map(ord, bytes)
TypeError: 'PSKeyword' object is not iterable
distributed.worker - WARNING - Compute Failed
Function: pdf_to_images
args: ('/hdd/iain/covid_docs_25Mar_all/5e7a4b17998e17af82661ba6.pdf')
kwargs: {}
Exception: Exception('Parsing error', "'PSKeyword' object is not iterable")
In some cases (esp. body text) where the ordering of "child" extractions is semantically meaningful, related and apparently sequential extractions are returned in a random order. Ordering these properly would produce a readable set of text extractions in the frontend in many cases.
http://cosmos.wisc.edu/sets/covid/?query=SARS&type=Body%20Text
Much of the time, figures with multiple parts are segmented separately and not merged properly in post-processing, leaving multiple "figures" that cannot be matched to multiple "captions". This results in key figure parts being impossible to retrieve (i.e., they have no text and are associated with no captions). It's also likely that this is causing lower confidence on Figure proposals because "chopped up" figure and table elements are out of distribution (by definition rel. to our training data).
Revisiting the merging step for figures specifically is needed. Tables would also benefit from more sophisticated merging.
Code: branch apiv1
Re-create: run on cosmos0003, with two gpus in dask cluster (spawn_dask_cluster_2_gpu.sh), run ingestion script (ingest_documents_timing.sh)
Result: Ingestion fails with error on this document:
571f511a642f88805083527c.pdf
Error:
RecursionError: maximum recursion depth exceeded
distributed.worker - WARNING - Compute Failed
Function: pdf_to_images
args: ('/hdd/iain/covid_docs_25Mar_all/571f511a642f88805083527c.pdf')
kwargs: {}
Exception: Exception('Parsing error', 'maximum recursion depth exceeded')
Pages should have UUIDs (or similar) instead of a integer id. Right now, a user could trivially step through a bunch of pages and recreate a full PDF.
An API route containing details about the underlying corpus in a vertical set (e.g., COVID19) should be made available.
This route should include basic data size statistics as well as a breakdown of number of documents by publisher and journal title (when available). It would be wise to pre-compute this and store it in a lookup table that gets updated when new documents are added.
We should be building pydocs based off of function documentation.
See UW-COSMOS/cosmos-visualizer#35
Have this in a branch somewhere. Will be added back ASAP.
As title.
Right now postprocess.postprocess.py contains a function that will change the output label of a box if it contains some keywords. It's currently using the Tesseract output, but if the unicode is available, that should be used instead.
ElasticSearch uses results
to wrap the list of returned objects, while Anserini uses objects
. Right now, there is a switch in the frontend to fix this, but it is pretty brittle.
Proposed solution:
objects
as a list containerobjects
and fall back on results
to keep current backends working.Using this issue to track an upcoming update where entities are discovered and clustered at the corpus level, with support for linking to existing knowledge bases.
(lots of development already done on this, just writing up a quick issue for tracking purposes)
COSMOS extracts tables, figures, and their associated mentions. However, in-text mentions of these objects could expand contextual windows for these, or provide context for cases where caption association is missed.
@ilmcconnell could you write up a brief summary of the current status and the lingering items before it's ready for a final release?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.