Cricket-players is one of the domains, which is a part of the IndicWiki Project.
The aim of this domain is to create a large number of articles (about 10,000) about notable cricket players across the world. This domain has potential because of the interest and passion for cricket and cricketers in our country, and Telugu-speaking states are no exception. Hence, we are generating these data-rich articles in telugu for about 10,000 notable players, and uploading them to wikipedia, so that people who can read only in their native language (here, telugu) can benefit from this information.
Create virtual environment in the project folder using the following commands.
$ pip install virtualenv
$ virtualenv -p python3.7 venv
After the successful creation of virtual environment (venv), clone the repository or download the zip folder of the project and extract it into the project folder.
Activate the virtual environment and headover to install the dependencies by following command.
$ pip install -r requirements.txt
requirements.txt comes along with the Project Directory.
- Clone the repository into the local system.
- For generating articles, one needs the folders: data, templates; and files: render.py, genXML.py. Ensure that these files and folders are available.
- In the file 'render.py', update the
all_ids
list such that it contains all the cricinfo IDs of the required players. Also update the listssplit_ids
andfile_names
based on requirements if necessary (their purpose has been clearly described in comments). - Execute 'render.py' with the command:
python3.7 render.py
. This will generate the XML dump for given player ids list, and store them in different corresponding xml files, as mentioned infile_names
list (described in comments in render file).
Github folder Link: https://github.com/indicwiki-iiit/Cricket-players/tree/main/data
- This folder contains the penultimate and final versions of our datasets, along with implementation for data cleaning and sweetviz report for analyzing the dataset.
- final_cricket_players_translated_dataset_with_images.csv -> This is the csv format of the final version of the dataset obtained after merging, cleaning and translation/transliteration.
- final_cricket_players_translated_dataset_with_images.pkl -> This is the pickle file of the final version of the dataset obtained after merging, cleaning and translation/transliteration.
- final_cricket_players_translated_dataset_with_images.xlsx -> This is the xlsx format of the final version of the dataset obtained after merging, cleaning and translation/transliteration.
- generate_report.py -> This script generates sweetviz report for dataset - for detailed analysis.
- SWEETVIZ_REPORT.html -> This is a brief report of the dataset, generated using sweetviz library, for better analysis of data.
Github folder Link: https://github.com/indicwiki-iiit/Cricket-players/tree/main/data/data_cleaning
- This folder contains the complete implementation for data cleaning.
- initial_cleaning.py -> This file contains implementation which performs an initial level data cleaning based on defects observed in sweetviz report.
- symbol_replacement.py -> This file contains implementation which performs a secondary level data cleaning based on defects observed on dataset obtained after initial cleaning.
- final_cleaning.py -> This file contains implementation which performs a final level data cleaning based on defects observed on dataset obtained after secondary level cleaning.
Github folder Link: https://github.com/indicwiki-iiit/Cricket-players/tree/main/templates
- This folder contains the implementations of templates required for rendering articles.
- categories.j2 -> This file contains the jinja2 template corresponding to categories of a cricket player in the article.
- info.j2 -> This file contains the jinja2 template corresponding to infobox and overview of a cricket player in the article.
- life.j2 -> This file contains the jinja2 template corresponding to professional life section of a cricket player in the article.
- personal_life.j2 -> This file contains the jinja2 template corresponding to personal life section of a cricket player in the article.
- player_statistical_analysis.j2 -> This file contains the jinja2 template corresponding to statistical analysis sub-section of a cricket player in the article.
- records.j2 -> This file contains the jinja2 template corresponding to records, awards and references of a cricket player in the article.
- render_categories.py -> This file contains implementation which displays relevant categories for a given player based on his/her information.
- render_info.py -> This file contains implementation which displays infobox and overview for a given player based on his/her information.
- render_life.py -> This file contains implementation which displays relevant professional life details for a given player based on his/her information.
- render_personal_life_and_statistics.py -> This file contains implementation which displays personal life and statistical analysis for a given player based on his/her information.
- render_records.py -> This file contains implementation which displays relevant records, awards and references for a given player based on his/her information.
Github folder Link: https://github.com/indicwiki-iiit/Cricket-players/tree/main/images
- This folder contains the implementation for scraping images from wikidata and english wikipedia articles, and the dataset obtained on doing so.
- collect_image_links.py -> This file contains implementation for extracting image links of players with a valid english wikipedia article.
- get_images.py -> This file contains implementation for extracting english wikipedia article url for different players having a valid wikidata id.
- cricket_player_images.csv -> This file contains the dataset which contains information related to wikipedia article url, wikipedia article infobox image link, wikidata id etc. for each player (key information which was exploited for extracting images).
Github folder Link: https://github.com/indicwiki-iiit/Cricket-players/tree/main/records
- This folder contains the implementation for translation of all records attributes (for which online libraries didn't work).
- records_processing.py -> This file contains the script for identifying the count and type of unique sentence structures in records attribute of dataset.
- fix_records.py -> This file contains the script for translating records attribute of dataset.
- Part-1_records_translation.xlsx -> This file contains the first split of dataset comprising of unique sentence structures for records, and their corresponding translation (done manually).
- Part-2_records_translation.xlsx -> This file contains the second split of dataset comprising of unique sentence structures for records, and their corresponding translation (done manually).
- Part-3_records_translation.xlsx -> This file contains the third split of dataset comprising of unique sentence structures for records, and their corresponding translation (done manually).
Github folder Link: https://github.com/indicwiki-iiit/Cricket-players/tree/main/scraping
- This folder contains the implementation of data scraping for obtaining dataset.
- scrape.ipynb -> This file contains the script for scraping notable players' data from cricinfo official site.
- Stats_JSON_Data Scraper.ipynb -> This file contains script for scraping additional stat details from cricinfo official site, in json format (for notable players).
Github folder Link: https://github.com/indicwiki-iiit/Cricket-players/tree/main/translating_data
- This folder contains the implementation for obtaining and storing a translated (and transliterated) dataset, to save overhead of translation libraries while generating articles via XML file. It also contains the intermediate datasets obtained in the process.
- awards.csv -> This file contains dataset corresponding to translated values for awards attribute.
- cricket_players_records.xlsx -> This file contains dataset corresponding to translated values for records attributes (with excel translation - which didn't produce desirable output).
- handle_debuts.py -> This file contains implementation which rectifies mistakes in existing debut strings translation, and handles abbreviations in those sentences.
- info_overview.py -> This file contains implementation for obtaining telugu contents for all attributes associated with infobox and overview of a player's article.
- modified_info_overview.csv -> This file contains dataset corresponding to translated values for attributes of infobox and overview.
- Personal_life_stats_translated.csv -> This file contains dataset corresponding to translated values for attributes of personal life section and statistical analysis sub-section.
- professional_life_trans.csv -> This file contains dataset corresponding to translated values for attributes of professional life section.
- professional_life.py -> This file contains implementation for obtaining telugu contents for all attributes associated with professional life of a player.
You can find the detailed report here
You can find the sample article here
Github file Link: https://github.com/indicwiki-iiit/Cricket-players/tree/main/cricket_players(part-1).xml
- This file contains the XML dump which consists of articles of the first 5000 players (first 5000 of a total of 9953 players), whose data has been collected.
Github file Link: https://github.com/indicwiki-iiit/Cricket-players/tree/main/cricket_players(part-2).xml
- This file contains the XML dump which consists of articles of the last 5k (approx) players (last 4953 of a total of 9953 players), whose data has been collected.
Github file Link: https://github.com/indicwiki-iiit/Cricket-players/tree/main/duplicates_to_consider.json
- This file contains a dictionary regarding which players are to be considered when duplicate names are encountered.
Github file Link: https://github.com/indicwiki-iiit/Cricket-players/tree/main/genXML.py
- This file contains the code for generating an XML file which has the data for rendering an article.
Github file Link: https://github.com/indicwiki-iiit/Cricket-players/tree/main/render.py
- This is the code used for rendering the cricket player articles using jinja2 templates from templates folder. It generates XML dump based on parameters provided in implementation (as described in comments of the file).
Github file Link: https://github.com/indicwiki-iiit/Cricket-players/tree/main/requirements.txt
- This contains all the packages and libraries that are necessary for building this project.