Coder Social home page Coder Social logo

diseases's Introduction

Diseases-Tewiki

This repository contains all the work that was done as a part of enriching Tewiki Hindi wikipedia in the Diseases and Conditions domain.

Stages of the project

The following ordered list will give an idea as to what stage the project currently is in:

  • Scrape Data from Web sources
  • Clean and Format the data
  • Scrape image URLs from Wiki commons
  • Scrape infobox information from Wikipedia
  • Create a sample article
  • Review of the sample article
  • Work on feedback from review of sample article
  • Review of the dataset
  • Create template for article generation
  • Review of the template
  • Work on feedback from review of template
  • Create the XML dump for all the diseases to be published

Folders

  • Datasets: Contains all the dataset (csv) files that have been used for this project.
  • Code: Contains all the code that has been used in the project.
  • Template: Contains the jinja template for the XML generation of a Disease article.

Datasets

This folder contains the final dataset as well as some other dataset that have been used in the project:

  • FinalDiseasesHindi: this file contains the final dataset that will be used for article generation. This file contains data on 1157 diseases and has 24 attributes. Its excel version is here.
  • Others/ScrapedEnglish: contains the original scraped data on diseases and conditions from Mayo Clinic and NHS.
  • Others/InfoboxEnglish: contains the infobox attributes that were scraped from the diseases' English wikipedia pages.
  • Others/MergedEnglish: contains the merged dataset file of the original scraped data and the infobox data. Its excel version can be found here.

Code

This folder contains various files:

  • scrape: it contains the code for scraping the NHS website links available for the diseases. These links can then be used to scrape the information for some diseases which Mayo Clinic had little to no information on. It also has code for generating a sweet viz report on the dataset
  • format: it contains the code for formatting the original unstructured dataset into a more structured form. Helps to remove sentences from attributes that are too informal, casual, and directional. Moreover, it also removes the sentences which are part of the website and does not offer any information regarding the disease.
  • check: it contains the code for checking which rows of the dataset are valid to be included in the final dataset. Highlights the diseases for which very little information is present and an article cannot be generated on them. Helped to reduce the number from 1183 possible articles to 1157 articles.
  • images: it contains the code for scraping images of diseases from Wiki commons. The image is essentially used in the English wikipedia page of the disease.
  • infobox: it contains the code for scraping infobox data available on the English wikipedia pages of the diseases. Therefore, our infobox will follow the same template as followed by diseases on wikipedia.
  • translate: it contains the code for using Google translator to translate English scraped dataset into Hindi dataset. It also contains code for transliteration.
  • translateInfobox: this file contains the code for translating the scraped English infobox information.
  • merge: this file contains the code for merging the translated files of the website scraped data and infobox data.

Generating articles

  • template: it contains the template for the structure of a disease and condition article
  • genXML: it contains the backbone structure for conversion to XML
  • render: it contains the code for generating XML dump of articles using the template and the dataset
  • diseases: the final XML dump file of all the 1157 diseases and conditions

diseases's People

Contributors

panwarjayant avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.