Coder Social home page Coder Social logo

castorini / covidex Goto Github PK

View Code? Open in Web Editor NEW
136.0 136.0 27.0 58.59 MB

A multi-stage neural search engine for the COVID-19 Open Research Dataset

Home Page: https://covidex.ai

License: MIT License

Python 31.88% HTML 0.34% TypeScript 64.33% CSS 0.80% Shell 2.65%

covidex's People

Contributors

audrey-siqueira avatar daemon avatar dependabot[bot] avatar edwinzhng avatar lintool avatar nikhilro avatar rodrigonogueira4 avatar ronakice avatar toluclassics avatar turnersr avatar x65han avatar zanezzephyrs avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

covidex's Issues

Add Apache 2 License

The team has agreed on release everything in this repo under a permissive Apache 2 License. Until we get our license in place in the repo, this issue can serve as the license.

Change example

What do we know about asymptomatic transmission of COVID-19? -> Are there cases of asymptomatic transmission of COVID-19?

Is Covidex now Cydex?

In the following paper:

@inproceedings{ding2020cydex,
  title={Cydex: Neural Search Infrastructure for the Scholarly Literature},
  author={Ding, Shane and Zhang, Edwin and Lin, Jimmy},
  booktitle={Proceedings of the First Workshop on Scholarly Document Processing},
  pages={168--173},
  year={2020}
}

it describes Cydex as an extension of Covidex to decouple it from CORD. While the paper says Cydex is open source, there does not seem to be any URL to the source.

Is this repo for Cydex as well?

Retrieve metadata from Covidex

Hi, this is a really cool search engine tool on COVID-19!
However, in the search engine, is there any way that we could get more metadata such as cord_id (document_id), Impact Factor (IF) of the journal or conference?

Evaluate highlighter on SQuAD and BioASQ

The highlighting service is very similar to a Q&A system: given a question and a document, the model outputs a sentence span that might contain the answer.

Hence, we should evaluate different highlighter models (BioBERT, T5, sciT5) on Q&A datasets such as SQuAD and BioASQ.

sorting

it is somewhat related to faceting (#46) but requires much less work and thinking.

one of the feedbacks we have received directly from the feedback form is the need of sorting the result based on publication dates. i agree this is an important feature, especially since our index is updated weekly. what people want is more of differential.

can we add it quickly?

cc @lintool @edwinzhng @nikhilro

Use T5 as a highlighter

Now that we are using Huggingface's T5 reranker, we can try to replace BioBERT's highlighter with T5's context vectors. Thus, we will run inference in only one model, which will decrease our latency and spare one GPU.

Note: we will need to evaluate this T5-based highlighter on BioASQ to see if it is actually better than BioBERT.

Add description of corpus in landing page

In landing page, add, after "(data release of April 10, 2020)."

"(data release of April 10, 2020), which currently contains over 47,000 scholarly articles, including over 36,000 with full text, about COVID-19 and coronavirus-related research, drawn from a variety of sources including PubMed, a curated list of articles from the WHO, as well as preprints from bioRxiv and medRxiv."

Rationale is that domain experts want to know what's in the collection, and we shouldn't force them to click out to find out.

Retrieve related articles metadata from Pyserini index

Currently, we download the metadata.csv file for related article search, and retrieve values from there. We should instead use Pyserini so that all article data is formatted in the same way and comes from the same place. This would require retrieving an article by ID from Pyserini, which I'm not sure is possible yet.

Discussion: faceting and pagination in multistage ranking

Here are some of my thoughts about faceting and pagination in multistage ranking.

tl;dr - it's not clear to me what the "correct" implementation is... see details below for full discussion.

The standard mental model of faceting is as a slice of the entire collection, i.e., how many documents contain that facet. This is the intuitive user expectation, and works "as expected" with pagination, i.e., when the user clicks "next page", the search engine fetches the next page of results that contain the facet, until we run out of results. This is exactly what Blacklight does.

However, when we move to a reranking architecture, it's a bit unclear what the system should do. The simplest implementation would be to provide faceted browsing on the initial candidate list. That is, the initial retrieval returns 1k hits, system reranks, and the faceted browsing is on those 1k hits.

This is fine, but problematic from both the perspective of faceted browsing and pagination:

  • From the perspective of pagination, users are accustomed to seeing results of fixed sizes: first 10 hits, second 20 hits, etc. Since we're already retrieving and reranking top 1k hits, it makes no sense for us to paginate. But, what happens if the user wants more hits? Do we retrieve the next 1k raw hits, and then rerank those? This has the downside that each results page would contain a different number of hits. Also, under the paragraph condition, we'd have to dedup wrt previous hits, which means having to keep track of state, which means a more complex implementation, etc.

  • From the perspective of faceting, the implementation outlined above divergences from user expectations. Say we facet only on reranked results - and thus the interface shows only the matching hits. The user scrolls to the bottom and wants more hits. Obviously, we can go back and fetch more hits (with all the complexities above), but then the facet count isn't accurate...

These are important considerations, since systematic reviews, one of the use cases for our system, needs high recall and thus may require going deep into hit lists. For example, this metareview examines over 1300 articles. Faceted browsing, I imagine, would be helpful for systematic reviews also. We don't have a RCT facet right now, but if we had one, I think it'd be used quite a bit.

So, it's not clear to me what the "correct" implementation is...

Thoughts?

Docker image + index modification request

I would like to request a development of docker image with environments set up and advice how to modify the API to use a custom index (might have to disable some components such as related article) as the web client is very well made. I would like to use it for a topic annotation based project where the web UI with centralised index would be used across the team, ensuring consistency.

Mainly I would like the following feature:

  • docker image with configurable properties to point to a custom index and to disable gpu components.

a query works for basic but fails for neural

try "Suggestions for disinfection of ophthalmic examination equipment and protection of ophthalmologist against 2019 novel coronavirus infection (Chinese Journal of Ophthalmology)"

Same exact banner for Covidex and Neural Covidex

Both neural and non-neural versions should look exactly the same in the landing page, modulo differences in landing page text (and the neural vs. non-neural header).

That is, if I have both sites open in different tables, I should be able to tab back and forth and see exactly the same location for the search box, same height of the banner, etc. That way, we establish a consistent look and feel.

Build HNSW index in covidex

At the moment, we are building HNSW index with https://github.com/x65han/ai2 on tuna with each data drop. Then upload the resulting tar.gz to dropbox etc.

We can instead put the HNSW indexing logic in Covidex under a new directory hnsw/. So we can build HNSW index from Covidex without going thru dropbox.

With the data drop on May 1st, building HNSW takes 7 minutes on tuna.

I will merge the HNSW index when the next data drop comes in a week.

E tensorflow/core/platform/cloud/curl_http_request.cc:611]

When I run uvicorn app.main:app --reload --port=8000, it will show the ERROR:

2020-07-20 11:43:29.696584: E tensorflow/core/platform/cloud/curl_http_request.cc:611] The transmission of request 0x55ccb5e145b0 (URI: https://www.googleapis.com/storage/v1/b/neuralresearcher_data/o/covid%2Fdata%2Fmodel_exp304%2Fcheckpoint?fields=size%2Cgeneration%2Cupdated) has been stuck at 0 of 0 bytes for 61 seconds and will be aborted. CURL timing information: lookup time: 0.008897 (No error), connect time: 0.090673 (No error), pre-transfer time: 0 (No error), start-transfer time: 0 (No error)
2020-07-20 11:44:41.008628: E tensorflow/core/platform/cloud/curl_http_request.cc:611] The transmission of request 0x55ccb5e1d8b0 (URI: https://www.googleapis.com/storage/v1/b/neuralresearcher_data/o/covid%2Fdata%2Fmodel_exp304%2Fcheckpoint?fields=size%2Cgeneration%2Cupdated) has been stuck at 0 of 0 bytes for 61 seconds and will be aborted. CURL timing information: lookup time: 0.008684 (No error), connect time: 60.0513 (No error), pre-transfer time: 0 (No error), start-transfer time: 0 (No error)

Of course, I can add the proxy to the network to successfully complete the http request, but after adding the proxy, using the local IP to access the service will fail. So is there any way to solve this problem? For example, I download the relevant data to the local in advance

Is covidex.ai down?

Hi,

Just wondering, is the covidex.ai platform down?

Thanks!

Regards,
Shreyas

Add examples

Add examples, in a drop down box in the search bar. Some nice ones to start:

  • What is the incubation period of COVID-19?
  • What's the effectiveness of chloroquine for COVID-19?
  • What do we know about asymptomatic transmission of COVID-19?
  • How do weather conditions affect the transmission of COVID-19?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.