Coder Social home page Coder Social logo

cs586's Introduction

CS 586 Final Project

COVID-19: Understanding the range of incubation periods and how long individuals are contagious after recovery.

Overview

We utilized the Semantic Scholar COVID-19 Open Research Dataset (CORD-19) as well as the COVID-19 Literature Knowledge Graph from Steenwinckel et al. derived from the CORD-19 dataset to extract information regarding the COVID-19 incubation and contagious periods.

Our implementation thus represents the following steps:

  1. Identify keywords associated with incubation period
  2. Extract the relevant papers from the CORD-19 dataset using regex query
  3. Use the CDQA library and spaCy's NER model to find the 'answer' to the query 'what is the incubation period'
  4. Filter extremities and unrecognizable characters
  5. Repeat the third step again
  6. Generate the Page Rankings
  7. Utilize the Page Rank values to add weights to number of days suggested by each paper.
Results - The incubation period is 3.5 to 13.5 days (mean = 8.41 days, SD = 5), weighted incubation period is 8.38 . The contagious period is 3 to 7 days (mean = 5.43 days, SD = 2.12), weighted contagious period is 5.38 days.

The implementation Directory

This directory contains all of our implementation (code) files along with the additional assets detailed below.

assets Directory

The assets directory houses all of the relevant .csv, .tsv files, and the cleaned COVID-19 knowledge graph.

final_ib.csv contains the information regarding literature from CORD-19 that pertain to incubation/contagious periods.

final_ib_pagerankings_title.tsv contains the resulting page rankings after running the PageRanking algorithm using the NetworkX library with the respective title and DOI.

Utilizing the COVID-19 Literature Knowledge Graph from data.zip

This compressed file contains our modified the knowledge graph in N-Triples format. Note that the knowledge graph was modified due to the parsing errors that were in the original literature knowledge graph.

Generating the CSV file for releavant sentences in the papers with sentence_extraction.ipynb

Out of all the papers and their texts, we have to find sentences that mention incubation period. But, the search can not be straight forward as the searching just the incubation period can have false cases like incubation period of different diseases, no mention of days but just incubation period and many others. So, we used different parameters using regex to find the sentences that are relevant to our search. We apply the same method to find the contagious period as well.

Generating the Page Rankings with page_rank.ipynb

Because we wanted to ensure that we utilized the most credible papers, we opted to use the PageRank algorithm to generate the page ranks based on the number of times a paper has been cited. These rankings were then used as weights in determining the incubation and contagious periods.

page_rank.ipynb can be opened in either Google Colab or Jupyter Notebook (Note: we used Google Colab to run this file so the paths will need to be updated since we were accessing files in our own Google Drive). There are instructions as well as additional implementation explanations within the .ipynb. The first cell can be run to install the dependency libraries: rdflib, networkx, tqdm. Documentation for each of the libraries is listed below:

  • rdflib: used to load the literature knowledge graph and generate a citation subgraph.
  • networkx: used to run the PageRank algorithm
  • tqdm: this library is purely optional, but it is helpful for displaying progress bars.

Extracting incubation/contagious period from Papers with day_extraction.ipynb

Once we get the csv file with sentences we clean some basic symbols. Then we apply the CDQA library and find the pos tags of sentences. Then using the NER model and CDQA library we extract the number of days from each sentence (of each paper). Then after manual incpection we find the abnormalities and extremeties and clean them again. Once we get number of days suggested by each paper multiply it with its pagerank like a weight to get the weighted average.

Accuracy

We randomly took 100 papers and noted their suggested incubation period and another random 100 for contagious period, the average incubation period we got was 8.5 days which is very close to 8.8 that we found and average contagious period came out to be 4.5 days which is also close to 5.2 that we found.

Final Report and Presentation

  • The presentation directory holds our presentation slides in .ppt format and the report directory holds our final report in .pdf format.

cs586's People

Contributors

lydiatse avatar vipuldhariwal avatar ayomidejoe avatar

Stargazers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.