Coder Social home page Coder Social logo

tanishqchamoli / newspaper_mining Goto Github PK

View Code? Open in Web Editor NEW
5.0 2.0 0.0 16.89 MB

Newspaper mining and the analysis of the results using python. Cleaning the text using OCR.

Python 100.00%
data-science newspaper mining tool newspaper-mining ocr python3 wget webcrawling pdf2text

newspaper_mining's Introduction

Newspaper Mining

This project aims at first collecting data through web scraping.The files are downloaded using
the wget module which refers to the API links stored in text files. Data is cleaned to bring 
efficiency in the data for better results.The text files thus obtained are free from UTF-8
characters and contains simple text. These cleaned files are sent for data processing.The
total number of words and sentences related to COVID-19 are evaluated from the total number
of words and sentences respectively from every single newspaper.The percentage of total words
and sentences are calculated which gives a proper understanding of rising and declining
COVID-19 related articles.The data is visualized in the form of graphs to understand trends,
outliers, and patterns.

This is a project which will make it easy for people so they can find out the graphs and data of the occurences in newspapers. Our motive for this program was simple and straight forward i.e to extract the data from the pdfs into a text file and then being able to use that data as many times we want for the analysis. So the program was made to do this task.

We noticed that some newspapers dont support the pdf to text converions and give us incomplete data so for those cases we have also provided the OCR converter for the pdf to text using text from image recognition.

NOTE

We have made a variable in every program which will contain the value of the pdfs which you have downloaded
PLEASE CHANGE IT WHEN YOU HAVE DOWNLOADED MORE FILES OR EVEN WHEN YOU HAVE CLEANED MORE FILES!
So please make sure to change the value of the variable called "downloaded" and "max1"
We will suggest that you keep our directory structure so that you dont have to change anything in the code
and would be able to straight away start using it.
Thank you!

Convesions Supported:

- Ocr_conversion using pdf2image and pytesseract library and PIL

- Converting PDF to text using Pdf2Text library

Downloding the Dataset:

As we have already provided a rar file which has the cleaned data from the newspaper "THE HINDU" from March to June so you can extract the dataset and directly run the programs for the searching of the words.

Else we have also provided our own link catcher and downloader:

Steps to follow:

  Run the Link_catcher.py and wait for it to complete
  
  Then run the Download_files.py which will use the links catched by the
  above program and then use the wget function to download the files.

Programs to run on the Dataset:

  • Count_occurences.py and Count_occurences_multiple.py for finding a single word or having a set of words respectively

  • Delimeter_checker.py for findind the number of sentences which contain our word or set of words provided to the program by us in the code.

  • Bad_word_removal.py is the one which removes the words which are commonly used in the sentance just to add meaning to it and gives us a better number.

Folder Structer:

--FOLDER HAVING THE CODE
Folders to creat inside the above one ->
-- --Combined_Dataset
-- --Newspaper_Cleaned
-- --Nespaper_PDF
-- --Better_cleaned
And then paste our code in it.
-- --OUR CODE FROM GITHUB

Authors

Tanishq Chamoli

https://github.com/TanishqChamoli

Sonam Garg

https://github.com/CO18350

Shriya Verma

https://github.com/CO18347

Mentor-

Dr.Ankit Gupta

newspaper_mining's People

Contributors

tanishqchamoli avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.