Light

alexhan46 / studyfind-ai-web-crawler Goto Github PK

View Code? Open in Web Editor NEW

2.0 2.0 0.0 85 KB

Python 100.00%

studyfind-ai-web-crawler's Introduction

AI Web Crawler 1.0

AI Web Crawler is a python tool that downloads data about available research studies, formats it, and uploads the data to a database.

Coded in Python v3.8.5

Relesae Notes

v1.0.0 (20/11/2020)

New Features

#7 #8 #9 #15 #25 Download research studies by crawling clinicaltrials.gov=

#6 #26 Schedule crawling tasks to run on a recurring basis

#27 #33 Use NLP to create a brief summary and a list of keywords for each crawler

Automatically upload the data to the specified Firebase database

#34 Use multithreading for superior performance (up to 7 studies per sec)

Bug Fixes

The crawler didn't run if there was http request errors(fixed)

Installation Guide

Requirements

Python 3 (recommended 3.8.5)
Pip (included with Python 3)
The Firebase JSON provided by the development team

Installation

Download the code using git or straight from GitHub
To install dependencies, execute the following command where the code was downloaded

   pip install -r requirements.txt

Run the following commands to install other required dependencies

  python -m nltk.downloader stopwords
  python -m nltk.downloader universal_tagset
  python -m spacy download en

Place the Firebase JSON into the same folder

Usage

There are 2 ways to execute the crawler

A. Using the admin panel to schedule recurring crawls

python manage.py runserver

To use the admin panel, you must be an authorized user, with access to the login information

b. Manually executing the crawler

python crawler.py

Known Limitations

When updating studies that have already been crawled, there is a limitation placed by clinicaltrials.gov, causing some studies to be left out. In our testing we have never gotten close to this limit as long as the crawler is executed daily.

Contributors ✨

Meet our team mates

_{Jonathon Sisson}

studyfind-ai-web-crawler's People

Contributors

Stargazers

Watchers

studyfind-ai-web-crawler's Issues

Test pysummarization library

Test Gensim TextRank library

Test and evaluate rake-nltk python library

Create crawling engine

Lay foundation for the rest of the tasks involving the crawler

Finalize architectural design

The team will work together to go over and finalize the architectural design of the product.

Test polyglot library

Create Admin Panel

Create admin panel using Django which allows user to run and schedule the crawler, as well as configure the websites.

Get complete list of updated studies

Implement and trial NLP library

test

Implement and trial text summarization NLP algorithm

Get ID numbers of studies

Find a way to get a list of ID numbers for all new studies from clinicaltrials,gov

Test spaCy library

Test TextBlob library

Implement better error handling

Test pke library

Manual testing

Evaluate pyate NLP library

scraping only newly added studies without searching previously one repeatedly

Download and format data

Given an ID number of a study, download the data from clinicaltrials.gov, and format/prune the data to fit the client's specifications

Connect crawler to admin panel

Test sumy python library

test issue

create boilerplates for our project

Research libraries

Since the team is fairly new to NLP, we will spend some time researching appropriate NLP and web crawling libraries so we can implement those features in later sprints.

Create readme file to have installation guide

Implement unit tests

Create database and export data

Create database and upload formatted data to the database.

Test nltk library

Implement multithreading

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.