castorini / covidex Goto Github PK

View Code? Open in Web Editor NEW

136.0 136.0 27.0 58.59 MB

A multi-stage neural search engine for the COVID-19 Open Research Dataset

Home Page: https://covidex.ai

License: MIT License

Python 31.88% HTML 0.34% TypeScript 64.33% CSS 0.80% Shell 2.65%

covidex's People

Contributors

Stargazers

Watchers

covidex's Issues

Feature request: bib entry

Be able to directly copy a result's bib entry.

add the acknowledgement of CIFAR

we need to acknowledge the support by "the CIFAR AI and COVID-19 Catalyst Grants" in our landing page.

Explore Bing search dataset for Coronavirus Intent

https://github.com/microsoft/BingCoronavirusQuerySet

Explore this dataset to see if we can gather a list of queries that would be appropriate for CORD-19.

Add Apache 2 License

The team has agreed on release everything in this repo under a permissive Apache 2 License. Until we get our license in place in the repo, this issue can serve as the license.

Update index of basic covidex

I don't think the Blacklight version has been updated in a while?

No update-anserini.sh in repo.

Change example

What do we know about asymptomatic transmission of COVID-19? -> Are there cases of asymptomatic transmission of COVID-19?

Is Covidex now Cydex?

In the following paper:

@inproceedings{ding2020cydex,
  title={Cydex: Neural Search Infrastructure for the Scholarly Literature},
  author={Ding, Shane and Zhang, Edwin and Lin, Jimmy},
  booktitle={Proceedings of the First Workshop on Scholarly Document Processing},
  pages={168--173},
  year={2020}
}

it describes Cydex as an extension of Covidex to decouple it from CORD. While the paper says Cydex is open source, there does not seem to be any URL to the source.

Is this repo for Cydex as well?

Retrieve metadata from Covidex

Hi, this is a really cool search engine tool on COVID-19!
However, in the search engine, is there any way that we could get more metadata such as cord_id (document_id), Impact Factor (IF) of the journal or conference?

The Covidex ai website is offline for a few days now

Hi,
I just wonder when would the website be online again? Thanks.

Add shout out to Colin

I'd send a PR myself if I knew what to do wrt this new space-aged technology:
https://github.com/castorini/covidex/blob/master/client/src/components/HomeText.tsx

What happened to just writing HTML...?

We should add a a shout out to Colin in the landing page.

Special thanks to Colin Raffel for his help in pretraining T5 models for the biomedical domain.

Evaluate highlighter on SQuAD and BioASQ

The highlighting service is very similar to a Q&A system: given a question and a document, the model outputs a sentence span that might contain the answer.

Hence, we should evaluate different highlighter models (BioBERT, T5, sciT5) on Q&A datasets such as SQuAD and BioASQ.

sorting

it is somewhat related to faceting (#46) but requires much less work and thinking.

one of the feedbacks we have received directly from the feedback form is the need of sorting the result based on publication dates. i agree this is an important feature, especially since our index is updated weekly. what people want is more of differential.

can we add it quickly?

cc @lintool @edwinzhng @nikhilro

Add Github ribbon

https://github.blog/2008-12-19-github-ribbons/

Let's go with the red one, I think.

On both basic and neural ones. Neural href should redirect to this repo. Basic should redirect to https://github.com/castorini/anserini/blob/master/docs/experiments-covid.md

Use T5 as a highlighter

Now that we are using Huggingface's T5 reranker, we can try to replace BioBERT's highlighter with T5's context vectors. Thus, we will run inference in only one model, which will decrease our latency and spare one GPU.

Note: we will need to evaluate this T5-based highlighter on BioASQ to see if it is actually better than BioBERT.

Add description of corpus in landing page

In landing page, add, after "(data release of April 10, 2020)."

"(data release of April 10, 2020), which currently contains over 47,000 scholarly articles, including over 36,000 with full text, about COVID-19 and coronavirus-related research, drawn from a variety of sources including PubMed, a curated list of articles from the WHO, as well as preprints from bioRxiv and medRxiv."

Rationale is that domain experts want to know what's in the collection, and we shouldn't force them to click out to find out.

Add link to our arxiv paper on landing page

Before paragraph that begins with: "Beyond the search application above, our efforts include the following:"

For more details about how the Neural Covidex works, please refer to our arXiv preprint "Rapidly Deploying a Neural Search Engine for the COVID-19 Open Research Dataset: Preliminary Thoughts and Lessons Learned".

This should be a separate paragraph.

Add x65han to landing page

Thanks for your contributions @x65han !

Retrieve related articles metadata from Pyserini index

Currently, we download the metadata.csv file for related article search, and retrieve values from there. We should instead use Pyserini so that all article data is formatted in the same way and comes from the same place. This would require retrieving an article by ID from Pyserini, which I'm not sure is possible yet.

Discussion: faceting and pagination in multistage ranking

Here are some of my thoughts about faceting and pagination in multistage ranking.

tl;dr - it's not clear to me what the "correct" implementation is... see details below for full discussion.

The standard mental model of faceting is as a slice of the entire collection, i.e., how many documents contain that facet. This is the intuitive user expectation, and works "as expected" with pagination, i.e., when the user clicks "next page", the search engine fetches the next page of results that contain the facet, until we run out of results. This is exactly what Blacklight does.

However, when we move to a reranking architecture, it's a bit unclear what the system should do. The simplest implementation would be to provide faceted browsing on the initial candidate list. That is, the initial retrieval returns 1k hits, system reranks, and the faceted browsing is on those 1k hits.

This is fine, but problematic from both the perspective of faceted browsing and pagination:

From the perspective of pagination, users are accustomed to seeing results of fixed sizes: first 10 hits, second 20 hits, etc. Since we're already retrieving and reranking top 1k hits, it makes no sense for us to paginate. But, what happens if the user wants more hits? Do we retrieve the next 1k raw hits, and then rerank those? This has the downside that each results page would contain a different number of hits. Also, under the paragraph condition, we'd have to dedup wrt previous hits, which means having to keep track of state, which means a more complex implementation, etc.
From the perspective of faceting, the implementation outlined above divergences from user expectations. Say we facet only on reranked results - and thus the interface shows only the matching hits. The user scrolls to the bottom and wants more hits. Obviously, we can go back and fetch more hits (with all the complexities above), but then the facet count isn't accurate...

These are important considerations, since systematic reviews, one of the use cases for our system, needs high recall and thus may require going deep into hit lists. For example, this metareview examines over 1300 articles. Faceted browsing, I imagine, would be helpful for systematic reviews also. We don't have a RCT facet right now, but if we had one, I think it'd be used quite a bit.

So, it's not clear to me what the "correct" implementation is...

Thoughts?

Docker image + index modification request

I would like to request a development of docker image with environments set up and advice how to modify the API to use a custom index (might have to disable some components such as related article) as the web client is very well made. I would like to use it for a topic annotation based project where the web UI with centralised index would be used across the team, ensuring consistency.

Mainly I would like the following feature:

docker image with configurable properties to point to a custom index and to disable gpu components.

Reconstruct original text without extra whitespace tokens after highlighting

Extra whitespace is added to text after tokenization and decoding from highlighting, we can remove the extra whitespace using the original text and update highlighted text characters accordingly

Add Unit Tests and CI Pipelines

Basic unit tests would be nice to have as a sanity check

Feature request: filter the time window

This, I think we already know we need - reimplement facets!

a query works for basic but fails for neural

try "Suggestions for disinfection of ophthalmic examination equipment and protection of ophthalmologist against 2019 novel coronavirus infection (Chinese Journal of Ophthalmology)"

Same exact banner for Covidex and Neural Covidex

Both neural and non-neural versions should look exactly the same in the landing page, modulo differences in landing page text (and the neural vs. non-neural header).

That is, if I have both sites open in different tables, I should be able to tab back and forth and see exactly the same location for the search box, same height of the banner, etc. That way, we establish a consistent look and feel.

Build HNSW index in covidex

At the moment, we are building HNSW index with https://github.com/x65han/ai2 on tuna with each data drop. Then upload the resulting tar.gz to dropbox etc.

We can instead put the HNSW indexing logic in Covidex under a new directory hnsw/. So we can build HNSW index from Covidex without going thru dropbox.

With the data drop on May 1st, building HNSW takes 7 minutes on tuna.

I will merge the HNSW index when the next data drop comes in a week.

No requirements.txt and .env.sample in repo

requirements.txt and .env.sample mentioned in README.

Doesn't retrieve important article

Definitely a bug...

Implement log rolling

Refactoring and integration with pygaggle

Now that pygaggle is public, refactoring to move code into pygaggle as appropriate...

E tensorflow/core/platform/cloud/curl_http_request.cc:611]

When I run uvicorn app.main:app --reload --port=8000, it will show the ERROR:

2020-07-20 11:43:29.696584: E tensorflow/core/platform/cloud/curl_http_request.cc:611] The transmission of request 0x55ccb5e145b0 (URI: https://www.googleapis.com/storage/v1/b/neuralresearcher_data/o/covid%2Fdata%2Fmodel_exp304%2Fcheckpoint?fields=size%2Cgeneration%2Cupdated) has been stuck at 0 of 0 bytes for 61 seconds and will be aborted. CURL timing information: lookup time: 0.008897 (No error), connect time: 0.090673 (No error), pre-transfer time: 0 (No error), start-transfer time: 0 (No error)
2020-07-20 11:44:41.008628: E tensorflow/core/platform/cloud/curl_http_request.cc:611] The transmission of request 0x55ccb5e1d8b0 (URI: https://www.googleapis.com/storage/v1/b/neuralresearcher_data/o/covid%2Fdata%2Fmodel_exp304%2Fcheckpoint?fields=size%2Cgeneration%2Cupdated) has been stuck at 0 of 0 bytes for 61 seconds and will be aborted. CURL timing information: lookup time: 0.008684 (No error), connect time: 60.0513 (No error), pre-transfer time: 0 (No error), start-transfer time: 0 (No error)

Of course, I can add the proxy to the network to successfully complete the http request, but after adding the proxy, using the local IP to access the service will fail. So is there any way to solve this problem? For example, I download the relevant data to the local in advance

Is covidex.ai down?

Hi,

Just wondering, is the covidex.ai platform down?

Thanks!

Regards,
Shreyas

Add examples

Add examples, in a drop down box in the search bar. Some nice ones to start:

What is the incubation period of COVID-19?
What's the effectiveness of chloroquine for COVID-19?
What do we know about asymptomatic transmission of COVID-19?
How do weather conditions affect the transmission of COVID-19?

castorini / covidex Goto Github PK

covidex's People

Contributors

Stargazers

Watchers

Forkers

covidex's Issues

Recommend Projects

Recommend Topics

Recommend Org