Coder Social home page Coder Social logo

castorini / covidex Goto Github PK

View Code? Open in Web Editor NEW
135.0 10.0 27.0 58.59 MB

A multi-stage neural search engine for the COVID-19 Open Research Dataset

Home Page: https://covidex.ai

License: MIT License

Python 31.88% HTML 0.34% TypeScript 64.33% CSS 0.80% Shell 2.65%

covidex's Introduction

Covidex: A Search Engine for the COVID-19 Open Research Dataset

Build Status License: MIT

This repository contains the API server, neural models, and UI client for Covidex, a neural search engine for the COVID-19 Open Research Dataset (CORD-19). For a description of our system, check out this paper: Covidex: Neural Ranking Models and Keyword Search Infrastructure for the COVID-19 Open Research Dataset.

We also provide neural search infrastructure for searching domain-specific scholarly literature via Cydex. This paper details the abstractions developed on top of Covidex to facilitate domain-specific search: Cydex: Neural Search Infrastructure for the Scholarly Literature.

Environment Setup

API Server

  1. Install CUDA 10.1
  • For Ubuntu, follow these instructions
  • For Debian run sudo apt-get install nvidia-cuda-toolkit
  1. Install Anaconda (currently version 2020.02)
wget https://repo.anaconda.com/archive/Anaconda3-2020.02-Linux-x86_64.sh
bash Anaconda3-2020.02-Linux-x86_64.sh
  1. Install Java 11 and Maven
sudo apt-get install openjdk-11-jre openjdk-11-jdk maven
  1. Create an Anaconda environment for Python 3.7
conda create -n covidex python=3.7
  1. Activate the Anaconda environment
conda activate covidex
  1. Install Python dependencies from inside api/
cd api
pip install -r api/requirements.txt
  1. Setup index and environment variables

    • Build Anserini indices for your dataset. We provide instructions for setting up Covidex with both CORD-19 and the ACL Anthology. Instructions to add support for new datasets is found under docs/adding-datasets.md

    • Set up environment variables by copying over the defaults from api/.env.sample into a new api/.env file, and modifying as needed. This requires setting the correct index and schema locations, CUDA devices, and enabling/disabling various services (highlighting, related search, neural ranking, etc.). Set DEVELOPMENT=False for production deployments.

UI Client

  1. Install Node.js 14+ and Yarn.

  2. Install dependencies from inside /client

yarn install

Local Deployment

Serve the UI from inside /client. The client will be running at localhost:3000.

yarn start

Separately, run the API server from inside /api. The server wil be running at localhost:8000.

uvicorn app.main:app --reload --port=8000

Production deployment

We provide a script under scripts/deploy-prod.sh to start the API server and serve the UI build files. This assumes the environment is set up correctly and api/.env contains DEVELOPMENT=False.

Start the server (deploys to port 8000 by default):

sh scripts/deploy-prod.sh

Optional: set the environment variable PORT to use a different port:

PORT=8080 sh scripts/deploy-prod.sh

Route port 80 to 8000 (or whatever port we deploy to). By default, the deployment script will use 8000.

sudo iptables -t nat -A PREROUTING -p tcp --dport 80 -j REDIRECT --to-port 8000

If we're having trouble accessing the service, check that there aren't any conflicting rules:

sudo iptables -t nat -L -n -v

If there are conflicting rules, we should delete them:

sudo iptables -t nat -D PREROUTING -p tcp --dport 80 -j REDIRECT --to-port UNWANTED_PORT

Log files are available under api/logs. New files are created daily based on UTC time. All filenames have the date appended, except for the current one, which will be named search.log or related.log.

Testing

Run all API tests:

TESTING=true pytest api

How do I cite this work?

@inproceedings{zhang2020covidex,
  title = "Covidex: Neural Ranking Models and Keyword Search Infrastructure for the {COVID}-19 Open Research Dataset",
  author = "Zhang, Edwin  and
    Gupta, Nikhil  and
    Tang, Raphael  and
    Han, Xiao  and
    Pradeep, Ronak  and
    Lu, Kuang  and
    Zhang, Yue  and
    Nogueira, Rodrigo  and
    Cho, Kyunghyun  and
    Fang, Hui  and
    Lin, Jimmy",
  booktitle = "Proceedings of the First Workshop on Scholarly Document Processing",
  month = nov,
  year = "2020",
  address = "Online",
  publisher = "Association for Computational Linguistics",
  url = "https://www.aclweb.org/anthology/2020.sdp-1.5",
  doi = "10.18653/v1/2020.sdp-1.5",
  pages = "31--41",
}

covidex's People

Contributors

audrey-siqueira avatar daemon avatar dependabot[bot] avatar edwinzhng avatar lintool avatar nikhilro avatar rodrigonogueira4 avatar ronakice avatar toluclassics avatar turnersr avatar x65han avatar zanezzephyrs avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

covidex's Issues

Add description of corpus in landing page

In landing page, add, after "(data release of April 10, 2020)."

"(data release of April 10, 2020), which currently contains over 47,000 scholarly articles, including over 36,000 with full text, about COVID-19 and coronavirus-related research, drawn from a variety of sources including PubMed, a curated list of articles from the WHO, as well as preprints from bioRxiv and medRxiv."

Rationale is that domain experts want to know what's in the collection, and we shouldn't force them to click out to find out.

Evaluate highlighter on SQuAD and BioASQ

The highlighting service is very similar to a Q&A system: given a question and a document, the model outputs a sentence span that might contain the answer.

Hence, we should evaluate different highlighter models (BioBERT, T5, sciT5) on Q&A datasets such as SQuAD and BioASQ.

Add Apache 2 License

The team has agreed on release everything in this repo under a permissive Apache 2 License. Until we get our license in place in the repo, this issue can serve as the license.

Retrieve related articles metadata from Pyserini index

Currently, we download the metadata.csv file for related article search, and retrieve values from there. We should instead use Pyserini so that all article data is formatted in the same way and comes from the same place. This would require retrieving an article by ID from Pyserini, which I'm not sure is possible yet.

Add examples

Add examples, in a drop down box in the search bar. Some nice ones to start:

  • What is the incubation period of COVID-19?
  • What's the effectiveness of chloroquine for COVID-19?
  • What do we know about asymptomatic transmission of COVID-19?
  • How do weather conditions affect the transmission of COVID-19?

E tensorflow/core/platform/cloud/curl_http_request.cc:611]

When I run uvicorn app.main:app --reload --port=8000, it will show the ERROR:

2020-07-20 11:43:29.696584: E tensorflow/core/platform/cloud/curl_http_request.cc:611] The transmission of request 0x55ccb5e145b0 (URI: https://www.googleapis.com/storage/v1/b/neuralresearcher_data/o/covid%2Fdata%2Fmodel_exp304%2Fcheckpoint?fields=size%2Cgeneration%2Cupdated) has been stuck at 0 of 0 bytes for 61 seconds and will be aborted. CURL timing information: lookup time: 0.008897 (No error), connect time: 0.090673 (No error), pre-transfer time: 0 (No error), start-transfer time: 0 (No error)
2020-07-20 11:44:41.008628: E tensorflow/core/platform/cloud/curl_http_request.cc:611] The transmission of request 0x55ccb5e1d8b0 (URI: https://www.googleapis.com/storage/v1/b/neuralresearcher_data/o/covid%2Fdata%2Fmodel_exp304%2Fcheckpoint?fields=size%2Cgeneration%2Cupdated) has been stuck at 0 of 0 bytes for 61 seconds and will be aborted. CURL timing information: lookup time: 0.008684 (No error), connect time: 60.0513 (No error), pre-transfer time: 0 (No error), start-transfer time: 0 (No error)

Of course, I can add the proxy to the network to successfully complete the http request, but after adding the proxy, using the local IP to access the service will fail. So is there any way to solve this problem? For example, I download the relevant data to the local in advance

sorting

it is somewhat related to faceting (#46) but requires much less work and thinking.

one of the feedbacks we have received directly from the feedback form is the need of sorting the result based on publication dates. i agree this is an important feature, especially since our index is updated weekly. what people want is more of differential.

can we add it quickly?

cc @lintool @edwinzhng @nikhilro

Use T5 as a highlighter

Now that we are using Huggingface's T5 reranker, we can try to replace BioBERT's highlighter with T5's context vectors. Thus, we will run inference in only one model, which will decrease our latency and spare one GPU.

Note: we will need to evaluate this T5-based highlighter on BioASQ to see if it is actually better than BioBERT.

Is covidex.ai down?

Hi,

Just wondering, is the covidex.ai platform down?

Thanks!

Regards,
Shreyas

Docker image + index modification request

I would like to request a development of docker image with environments set up and advice how to modify the API to use a custom index (might have to disable some components such as related article) as the web client is very well made. I would like to use it for a topic annotation based project where the web UI with centralised index would be used across the team, ensuring consistency.

Mainly I would like the following feature:

  • docker image with configurable properties to point to a custom index and to disable gpu components.

Same exact banner for Covidex and Neural Covidex

Both neural and non-neural versions should look exactly the same in the landing page, modulo differences in landing page text (and the neural vs. non-neural header).

That is, if I have both sites open in different tables, I should be able to tab back and forth and see exactly the same location for the search box, same height of the banner, etc. That way, we establish a consistent look and feel.

Discussion: faceting and pagination in multistage ranking

Here are some of my thoughts about faceting and pagination in multistage ranking.

tl;dr - it's not clear to me what the "correct" implementation is... see details below for full discussion.

The standard mental model of faceting is as a slice of the entire collection, i.e., how many documents contain that facet. This is the intuitive user expectation, and works "as expected" with pagination, i.e., when the user clicks "next page", the search engine fetches the next page of results that contain the facet, until we run out of results. This is exactly what Blacklight does.

However, when we move to a reranking architecture, it's a bit unclear what the system should do. The simplest implementation would be to provide faceted browsing on the initial candidate list. That is, the initial retrieval returns 1k hits, system reranks, and the faceted browsing is on those 1k hits.

This is fine, but problematic from both the perspective of faceted browsing and pagination:

  • From the perspective of pagination, users are accustomed to seeing results of fixed sizes: first 10 hits, second 20 hits, etc. Since we're already retrieving and reranking top 1k hits, it makes no sense for us to paginate. But, what happens if the user wants more hits? Do we retrieve the next 1k raw hits, and then rerank those? This has the downside that each results page would contain a different number of hits. Also, under the paragraph condition, we'd have to dedup wrt previous hits, which means having to keep track of state, which means a more complex implementation, etc.

  • From the perspective of faceting, the implementation outlined above divergences from user expectations. Say we facet only on reranked results - and thus the interface shows only the matching hits. The user scrolls to the bottom and wants more hits. Obviously, we can go back and fetch more hits (with all the complexities above), but then the facet count isn't accurate...

These are important considerations, since systematic reviews, one of the use cases for our system, needs high recall and thus may require going deep into hit lists. For example, this metareview examines over 1300 articles. Faceted browsing, I imagine, would be helpful for systematic reviews also. We don't have a RCT facet right now, but if we had one, I think it'd be used quite a bit.

So, it's not clear to me what the "correct" implementation is...

Thoughts?

Change example

What do we know about asymptomatic transmission of COVID-19? -> Are there cases of asymptomatic transmission of COVID-19?

Is Covidex now Cydex?

In the following paper:

@inproceedings{ding2020cydex,
  title={Cydex: Neural Search Infrastructure for the Scholarly Literature},
  author={Ding, Shane and Zhang, Edwin and Lin, Jimmy},
  booktitle={Proceedings of the First Workshop on Scholarly Document Processing},
  pages={168--173},
  year={2020}
}

it describes Cydex as an extension of Covidex to decouple it from CORD. While the paper says Cydex is open source, there does not seem to be any URL to the source.

Is this repo for Cydex as well?

Retrieve metadata from Covidex

Hi, this is a really cool search engine tool on COVID-19!
However, in the search engine, is there any way that we could get more metadata such as cord_id (document_id), Impact Factor (IF) of the journal or conference?

Build HNSW index in covidex

At the moment, we are building HNSW index with https://github.com/x65han/ai2 on tuna with each data drop. Then upload the resulting tar.gz to dropbox etc.

We can instead put the HNSW indexing logic in Covidex under a new directory hnsw/. So we can build HNSW index from Covidex without going thru dropbox.

With the data drop on May 1st, building HNSW takes 7 minutes on tuna.

I will merge the HNSW index when the next data drop comes in a week.

a query works for basic but fails for neural

try "Suggestions for disinfection of ophthalmic examination equipment and protection of ophthalmologist against 2019 novel coronavirus infection (Chinese Journal of Ophthalmology)"

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.