Coder Social home page Coder Social logo

kasyapwiki / cricket-players Goto Github PK

View Code? Open in Web Editor NEW

This project forked from indicwiki-iiit/cricket-players

0.0 0.0 0.0 201.77 MB

Implementation to generate 10k+ Telugu articles on Cricket players

Python 0.82% HTML 98.65% Jupyter Notebook 0.22% Jinja 0.31%

cricket-players's Introduction

Cricket-players

Cricket-players is one of the domains, which is a part of the IndicWiki Project.

Description

The aim of this domain is to create a large number of articles (about 10,000) about notable cricket players across the world. This domain has potential because of the interest and passion for cricket and cricketers in our country, and Telugu-speaking states are no exception. Hence, we are generating these data-rich articles in telugu for about 10,000 notable players, and uploading them to wikipedia, so that people who can read only in their native language (here, telugu) can benefit from this information.

Installation

Create virtual environment in the project folder using the following commands.

$ pip install virtualenv
$ virtualenv -p python3.7 venv

After the successful creation of virtual environment (venv), clone the repository or download the zip folder of the project and extract it into the project folder.

Activate the virtual environment and headover to install the dependencies by following command.

$ pip install -r requirements.txt

requirements.txt comes along with the Project Directory.

Guide to generate XML dump, articles for different cricketers

  • Clone the repository into the local system.
  • For generating articles, one needs the folders: data, templates; and files: render.py, genXML.py. Ensure that these files and folders are available.
  • In the file 'render.py', update the all_ids list such that it contains all the cricinfo IDs of the required players. Also update the lists split_ids and file_names based on requirements if necessary (their purpose has been clearly described in comments).
  • Execute 'render.py' with the command: python3.7 render.py. This will generate the XML dump for given player ids list, and store them in different corresponding xml files, as mentioned in file_names list (described in comments in render file).

Github Structure

data

Github folder Link: https://github.com/indicwiki-iiit/Cricket-players/tree/main/data

  • This folder contains the penultimate and final versions of our datasets, along with implementation for data cleaning and sweetviz report for analyzing the dataset.
    • final_cricket_players_translated_dataset_with_images.csv -> This is the csv format of the final version of the dataset obtained after merging, cleaning and translation/transliteration.
    • final_cricket_players_translated_dataset_with_images.pkl -> This is the pickle file of the final version of the dataset obtained after merging, cleaning and translation/transliteration.
    • final_cricket_players_translated_dataset_with_images.xlsx -> This is the xlsx format of the final version of the dataset obtained after merging, cleaning and translation/transliteration.
    • generate_report.py -> This script generates sweetviz report for dataset - for detailed analysis.
    • SWEETVIZ_REPORT.html -> This is a brief report of the dataset, generated using sweetviz library, for better analysis of data.

data_cleaning

Github folder Link: https://github.com/indicwiki-iiit/Cricket-players/tree/main/data/data_cleaning

  • This folder contains the complete implementation for data cleaning.
    • initial_cleaning.py -> This file contains implementation which performs an initial level data cleaning based on defects observed in sweetviz report.
    • symbol_replacement.py -> This file contains implementation which performs a secondary level data cleaning based on defects observed on dataset obtained after initial cleaning.
    • final_cleaning.py -> This file contains implementation which performs a final level data cleaning based on defects observed on dataset obtained after secondary level cleaning.

templates

Github folder Link: https://github.com/indicwiki-iiit/Cricket-players/tree/main/templates

  • This folder contains the implementations of templates required for rendering articles.
    • categories.j2 -> This file contains the jinja2 template corresponding to categories of a cricket player in the article.
    • info.j2 -> This file contains the jinja2 template corresponding to infobox and overview of a cricket player in the article.
    • life.j2 -> This file contains the jinja2 template corresponding to professional life section of a cricket player in the article.
    • personal_life.j2 -> This file contains the jinja2 template corresponding to personal life section of a cricket player in the article.
    • player_statistical_analysis.j2 -> This file contains the jinja2 template corresponding to statistical analysis sub-section of a cricket player in the article.
    • records.j2 -> This file contains the jinja2 template corresponding to records, awards and references of a cricket player in the article.
    • render_categories.py -> This file contains implementation which displays relevant categories for a given player based on his/her information.
    • render_info.py -> This file contains implementation which displays infobox and overview for a given player based on his/her information.
    • render_life.py -> This file contains implementation which displays relevant professional life details for a given player based on his/her information.
    • render_personal_life_and_statistics.py -> This file contains implementation which displays personal life and statistical analysis for a given player based on his/her information.
    • render_records.py -> This file contains implementation which displays relevant records, awards and references for a given player based on his/her information.

images

Github folder Link: https://github.com/indicwiki-iiit/Cricket-players/tree/main/images

  • This folder contains the implementation for scraping images from wikidata and english wikipedia articles, and the dataset obtained on doing so.
    • collect_image_links.py -> This file contains implementation for extracting image links of players with a valid english wikipedia article.
    • get_images.py -> This file contains implementation for extracting english wikipedia article url for different players having a valid wikidata id.
    • cricket_player_images.csv -> This file contains the dataset which contains information related to wikipedia article url, wikipedia article infobox image link, wikidata id etc. for each player (key information which was exploited for extracting images).

records

Github folder Link: https://github.com/indicwiki-iiit/Cricket-players/tree/main/records

  • This folder contains the implementation for translation of all records attributes (for which online libraries didn't work).
    • records_processing.py -> This file contains the script for identifying the count and type of unique sentence structures in records attribute of dataset.
    • fix_records.py -> This file contains the script for translating records attribute of dataset.
    • Part-1_records_translation.xlsx -> This file contains the first split of dataset comprising of unique sentence structures for records, and their corresponding translation (done manually).
    • Part-2_records_translation.xlsx -> This file contains the second split of dataset comprising of unique sentence structures for records, and their corresponding translation (done manually).
    • Part-3_records_translation.xlsx -> This file contains the third split of dataset comprising of unique sentence structures for records, and their corresponding translation (done manually).

scraping

Github folder Link: https://github.com/indicwiki-iiit/Cricket-players/tree/main/scraping

  • This folder contains the implementation of data scraping for obtaining dataset.
    • scrape.ipynb -> This file contains the script for scraping notable players' data from cricinfo official site.
    • Stats_JSON_Data Scraper.ipynb -> This file contains script for scraping additional stat details from cricinfo official site, in json format (for notable players).

translating_data

Github folder Link: https://github.com/indicwiki-iiit/Cricket-players/tree/main/translating_data

  • This folder contains the implementation for obtaining and storing a translated (and transliterated) dataset, to save overhead of translation libraries while generating articles via XML file. It also contains the intermediate datasets obtained in the process.
    • awards.csv -> This file contains dataset corresponding to translated values for awards attribute.
    • cricket_players_records.xlsx -> This file contains dataset corresponding to translated values for records attributes (with excel translation - which didn't produce desirable output).
    • handle_debuts.py -> This file contains implementation which rectifies mistakes in existing debut strings translation, and handles abbreviations in those sentences.
    • info_overview.py -> This file contains implementation for obtaining telugu contents for all attributes associated with infobox and overview of a player's article.
    • modified_info_overview.csv -> This file contains dataset corresponding to translated values for attributes of infobox and overview.
    • Personal_life_stats_translated.csv -> This file contains dataset corresponding to translated values for attributes of personal life section and statistical analysis sub-section.
    • professional_life_trans.csv -> This file contains dataset corresponding to translated values for attributes of professional life section.
    • professional_life.py -> This file contains implementation for obtaining telugu contents for all attributes associated with professional life of a player.

Report

You can find the detailed report here

Sample Article

You can find the sample article here

cricket_players(part-1).xml

Github file Link: https://github.com/indicwiki-iiit/Cricket-players/tree/main/cricket_players(part-1).xml

  • This file contains the XML dump which consists of articles of the first 5000 players (first 5000 of a total of 9953 players), whose data has been collected.

cricket_players(part-2).xml

Github file Link: https://github.com/indicwiki-iiit/Cricket-players/tree/main/cricket_players(part-2).xml

  • This file contains the XML dump which consists of articles of the last 5k (approx) players (last 4953 of a total of 9953 players), whose data has been collected.

duplicates_to_consider.json

Github file Link: https://github.com/indicwiki-iiit/Cricket-players/tree/main/duplicates_to_consider.json

  • This file contains a dictionary regarding which players are to be considered when duplicate names are encountered.

genXML.py

Github file Link: https://github.com/indicwiki-iiit/Cricket-players/tree/main/genXML.py

  • This file contains the code for generating an XML file which has the data for rendering an article.

render.py

Github file Link: https://github.com/indicwiki-iiit/Cricket-players/tree/main/render.py

  • This is the code used for rendering the cricket player articles using jinja2 templates from templates folder. It generates XML dump based on parameters provided in implementation (as described in comments of the file).

requirements.txt:

Github file Link: https://github.com/indicwiki-iiit/Cricket-players/tree/main/requirements.txt

  • This contains all the packages and libraries that are necessary for building this project.

cricket-players's People

Contributors

tgv2002 avatar hrudaikoda avatar vkk5 avatar sowmyavarakala avatar saitejakondapalli avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.