santhoshse7en / news-fetch Goto Github PK

A Python Package which helps to scrape all news details from any news websites

License: MIT License

Python 100.00%

newspaper3k google-search-using-python news scraper scraper-engine news-details python news-website felix extracts

news-fetch's Introduction

news-fetch

news-fetch is an open-source, easy-to-use news crawler that extracts structured information from almost any news website. It can follow recursively internal hyperlinks and read RSS feeds to fetch both most recent and also old, archived articles. You only need to provide the root URL of the news website to crawl it completely. News-fetch combines the power of multiple state-of-the-art libraries and tools, such as news-please - Felix Hamborg and Newspaper3K - Lucas (欧阳象) Ou-Yang. This package consists of both features provided by Felix's work and Lucas' work.

I built this to reduce most of NaN or '' or [] or 'None' values while scraping for some news websites. Platform-independent and written in Python 3. Programmers and developers can very easily use this package to access the news data to their programs.

Source	Link
PyPI:	https://pypi.org/project/news-fetch/
Repository:	https://santhoshse7en.github.io/news-fetch/
Documentation:	https://santhoshse7en.github.io/news-fetch_doc/ (Not Yet Created!)

Dependencies

Extracted information

news-fetch extracts the following attributes from news articles. Also, have a look at an examplary JSON file extracted by news-please.

headline
name(s) of author(s)
publication date
publication
category
source_domain
article
summary
keyword
url
language

Dependencies Installation

Use the package manager pip to install following

pip install -r requirements.txt

Usage

Download it by clicking the green download button here on Github. To extract URLs from a targeted website, call the google_search function. You only need to parse the keyword and newspaper link argument.

>>> from newsfetch.google import google_search
>>> google = google_search('Alcoholics Anonymous', 'https://timesofindia.indiatimes.com/')

Use the URLs attribute to get the links of all the news articles scraped.

>>> google.urls

Directory of google search results urls

To scrape all the news details, call the newspaper function

>>> from newsfetch.news import newspaper
>>> news = newspaper('https://www.bbc.co.uk/news/world-48810070')

Directory of news

>>> news.headline

'g20 summit: trump and xi agree to restart us china trade talks'

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Please make sure to update tests as appropriate.

License

MIT

news-fetch's People

Contributors

Stargazers

Watchers

Forkers

seth-ibm prakharrathi25 imakashsahu avpittman eowang rahul09123 chiragdayaramani rishavsoper10 mallika2126 shariq-hub adisin361 chetanpujari5105 utkr98jais vedikaag99 aishwarya540 masudsk712 vrandagupta ananyasahoo shreyaskorti codewithzaid adi4you222 shubham-0911 ayushman278 vedoxy7 swapnildm7 abhi-a007 arnabsarkar1998 mansishah20 vegeta03x shubham-01-star sarvesh123-cell abhishek42254 rajanmajithiya chourasiapriya hozefa976 one-with-3-dreams anonymous23-dls ritikg555 puneet-root nimitshrestha atharvadhoot prateik1123 sharma-aryan pankajahakey-tech sahil-99 sam1bit dhairyameh arunimaadya sanya-hash sahilbawa7777 supremevab saifveesar dhruvvats-011 tony0411kp devendranath-maddula nikhil5511 beingutsavraj goutam980 rajakash543 ak-aryan anksramteke devvisoulia deepak27004 saacket bepratik avtikgangani coderabhiraj meanujraj thehacker-oss rajjatsoni vinayakthapak1 yatendra-dev docligot faisalnawazmir wymored tksolano suatatan royaakash sherif211 talmeezahmed dorozcom thebennos utkarsh-ub kareem-desouky22 timothyvirgil ioanszilagyi aucan mrfoxes quantale-io kyodocn idrissbachali jai2033shankar jtrag pokhrelj voyagersocial vanphule11 business-apps-open-source arjunmurali1304 freshy969 mohamadansari786

news-fetch's Issues

news.article returns full body of an article, but removes all apostrophes

Tested using links from The Guardian, Sky News, and The Daily Mail.

"Special letters" are being converted to regular ones

Hello

Is it possible in some way to define what language the news is in, so it could be fetched correctly?
I used the library for a news in Portuguese, but it converted "special letters" to regular ones.
It highly compromises NLP procedures that deals with syntax, context etc.

example: "àáéóíúâôêãõç" is converted to "aaeiuaoeaoc"

from newsfetch.news import newspaper
news = newspaper('https://g1.globo.com/sc/santa-catarina/noticia/2021/01/20/greve-na-comcap-coleta-feita-por-empresa-privada-em-florianopolis-vai-abranger-35percent-do-roteiro-diz-prefeitura.ghtml')

I saw inside the class it is used Newspaper3K Scraper and if I enforce the right language it returns the correct text.

from newspaper import Article
article = Article(url, language='pt')

thank you

news.article does not fetch all contents of BBC articles.

Given a BBC link (eg. https://www.bbc.co.uk/news/uk-56769860) to a standard article, news.article currently seems to only get the paragraph just before the unordered list block (3 bullet point links to relevant articles).
This might be an issue with the newspaper3k import, not too sure. Would really appreciate a fix!

AttributeError: 'NoneType' object has no attribute 'text'

I got AttributeError while trying to fetch URLs through google_search module:

My code is :

from newsfetch.google import google_search
google = google_search('coronavirus news', 'https://www.bbc.co.uk')

Encountered error while generating package metadata.

Hello, I can't install requirments.txt on mac os M1.

Requirement already satisfied: beautifulsoup4 in /Users/tangquanzhong/miniforge3/lib/python3.9/site-packages (from -r requirements.txt (line 1)) (4.11.1)
Requirement already satisfied: selenium in /Users/tangquanzhong/miniforge3/lib/python3.9/site-packages (from -r requirements.txt (line 2)) (4.1.3)
Collecting chromedriver-binary
Using cached chromedriver-binary-101.0.4951.15.0.tar.gz (4.9 kB)
Preparing metadata (setup.py) ... done
Requirement already satisfied: pandas in /Users/tangquanzhong/miniforge3/lib/python3.9/site-packages (from -r requirements.txt (line 4)) (1.3.4)
Collecting pattern
Using cached Pattern-3.6.0.tar.gz (22.2 MB)
Preparing metadata (setup.py) ... done
Collecting fake_useragent
Using cached fake-useragent-0.1.11.tar.gz (13 kB)
Preparing metadata (setup.py) ... done
Requirement already satisfied: setuptools in /Users/tangquanzhong/miniforge3/lib/python3.9/site-packages (from -r requirements.txt (line 7)) (58.0.4)
Collecting twine
Using cached twine-4.0.0-py3-none-any.whl (36 kB)
Collecting unidecode
Using cached Unidecode-1.3.4-py3-none-any.whl (235 kB)
Requirement already satisfied: soupsieve>1.2 in /Users/tangquanzhong/miniforge3/lib/python3.9/site-packages (from beautifulsoup4->-r requirements.txt (line 1)) (2.3.2.post1)
Requirement already satisfied: urllib3[secure,socks]=1.26 in /Users/tangquanzhong/miniforge3/lib/python3.9/site-packages (from selenium->-r requirements.txt (line 2)) (1.26.7)
Requirement already satisfied: trio=0.17 in /Users/tangquanzhong/miniforge3/lib/python3.9/site-packages (from selenium->-r requirements.txt (line 2)) (0.20.0)
Requirement already satisfied: trio-websocket~=0.9 in /Users/tangquanzhong/miniforge3/lib/python3.9/site-packages (from selenium->-r requirements.txt (line 2)) (0.9.2)
Requirement already satisfied: python-dateutil>=2.7.3 in /Users/tangquanzhong/miniforge3/lib/python3.9/site-packages (from pandas->-r requirements.txt (line 4)) (2.8.2)
Requirement already satisfied: pytz>=2017.3 in /Users/tangquanzhong/miniforge3/lib/python3.9/site-packages (from pandas->-r requirements.txt (line 4)) (2021.3)
Requirement already satisfied: numpy>=1.20.0 in /Users/tangquanzhong/miniforge3/lib/python3.9/site-packages (from pandas->-r requirements.txt (line 4)) (1.21.3)
Requirement already satisfied: future in /Users/tangquanzhong/miniforge3/lib/python3.9/site-packages (from pattern->-r requirements.txt (line 5)) (0.18.2)
Collecting backports.csv
Using cached backports.csv-1.0.7-py2.py3-none-any.whl (12 kB)
Collecting mysqlclient
Using cached mysqlclient-2.1.0.tar.gz (87 kB)
Preparing metadata (setup.py) ... error
error: subprocess-exited-with-error

× python setup.py egg_info did not run successfully.
│ exit code: 1
╰─> [16 lines of output]
/bin/sh: mysql_config: command not found
/bin/sh: mariadb_config: command not found
/bin/sh: mysql_config: command not found
Traceback (most recent call last):
File "", line 2, in
File "", line 34, in
File "/private/var/folders/b3/923q90gn6p14m7qtmgjz4z140000gn/T/pip-install-uw_1rtd5/mysqlclient_e107c4fc41db45b1bb9ce0e7250d32be/setup.py", line 15, in
metadata, options = get_config()
File "/private/var/folders/b3/923q90gn6p14m7qtmgjz4z140000gn/T/pip-install-uw_1rtd5/mysqlclient_e107c4fc41db45b1bb9ce0e7250d32be/setup_posix.py", line 70, in get_config
libs = mysql_config("libs")
File "/private/var/folders/b3/923q90gn6p14m7qtmgjz4z140000gn/T/pip-install-uw_1rtd5/mysqlclient_e107c4fc41db45b1bb9ce0e7250d32be/setup_posix.py", line 31, in mysql_config
raise OSError("{} not found".format(_mysql_config_path))
OSError: mysql_config not found
mysql_config --version
mariadb_config --version
mysql_config --libs
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.
(base) tangquanzhong@tangquanzhongdeMacBook-Air news-fetch-master %

Documentation link is broken

Hi there, The link to documentation is broken, can you please update? Thank you! Alex

ImportError: cannot import name 'get_chrome_web_driver' from 'newsfetch.helpers'

`Python 3.9.1 (default, Dec 13 2020, 11:55:53)
[GCC 10.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

from newsfetch.google import google_search
Traceback (most recent call last):
File "", line 1, in
File "/home/zeka/ZEKA/pythonProjects/news-fetch/newsfetch/google.py", line 1, in
from newsfetch.helpers import (get_chrome_web_driver, get_web_driver_options,
ImportError: cannot import name 'get_chrome_web_driver' from 'newsfetch.helpers' (/home/zeka/ZEKA/pythonProjects/news-fetch/newsfetch/helpers.py)
from newsfetch.news import newspaper
news = newspaper('https://www.bbc.co.uk/news/world-48810070')
File "", line 1
news = newspaper('https://www.bbc.co.uk/news/world-48810070')
IndentationError: unexpected indent
from newsfetch.google import google_search
Traceback (most recent call last):
File "", line 1, in
File "/home/zeka/ZEKA/pythonProjects/news-fetch/newsfetch/google.py", line 1, in
from newsfetch.helpers import (get_chrome_web_driver, get_web_driver_options,
ImportError: cannot import name 'get_chrome_web_driver' from 'newsfetch.helpers' (/home/zeka/ZEKA/pythonProjects/news-fetch/newsfetch/helpers.py)
`

I could not find get_chrome_web_driver, get_web_driver_options, set_automation_as_head_less, set_browser_as_incognito, set_ignore_certificate_error in the newsfetch.helpers

Reference to Connection timeout that breaks entire code

Ref: #89

Creating catch helper function is not helping.

Does not fetch arabic news

Hello,
I tried it but it did not fetch Arabic news such as https://www.alarabiya.net/
I got zero articles.

My code:

news_paper = newspaper3k.build('https://www.alarabiya.net/', language='ar', memoize_articles=False) 
for article in news_paper.articles:
    article_url = article.url
    news = newsfetch(article_url)

Any idea?

using selenium-hub?

I'm trying to setup news-fetch on my server, where I already have two selenium docker containers [1], hub and standalone-chrome. I can get a working selenium.webdriver object like this:

from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

huburl = 'http://seleniumhub_docker_container:4444/wd/hub'
driver = webdriver.Remote(command_executor=huburl, desired_capabilities=DesiredCapabilities.CHROME)

however I haven't been able to tell newsfetch to use this driver.

Is it at all possible or are there plans to acomodate this use-case?

[1] https://github.com/SeleniumHQ/docker-selenium

Getting TypeError when I try to import google_search and newspaper modules

This is the error that I got.
TypeError: attrs() got an unexpected keyword argument 'eq'

news

Chrome version 91

I am using a mac and Im getting the below error

`SessionNotCreatedException Traceback (most recent call last)
in
----> 1 google = google_search('ASX','https://au.news.yahoo.com/')

~/opt/anaconda3/lib/python3.8/site-packages/newsfetch/google.py in init(self, keyword, newspaper_url)
26 set_ignore_certificate_error(options)
27 set_browser_as_incognito(options)
---> 28 driver = get_chrome_web_driver(options)
29 driver.get(url)
30

~/opt/anaconda3/lib/python3.8/site-packages/newsfetch/helpers.py in get_chrome_web_driver(options)
32
33 def get_chrome_web_driver(options):
---> 34 return webdriver.Chrome(chrome_options=options)
35
36

~/opt/anaconda3/lib/python3.8/site-packages/selenium/webdriver/chrome/webdriver.py in init(self, executable_path, port, options, service_args, desired_capabilities, service_log_path, chrome_options, keep_alive)
74
75 try:
---> 76 RemoteWebDriver.init(
77 self,
78 command_executor=ChromeRemoteConnection(

~/opt/anaconda3/lib/python3.8/site-packages/selenium/webdriver/remote/webdriver.py in init(self, command_executor, desired_capabilities, browser_profile, proxy, keep_alive, file_detector, options)
155 warnings.warn("Please use FirefoxOptions to set browser profile",
156 DeprecationWarning, stacklevel=2)
--> 157 self.start_session(capabilities, browser_profile)
158 self._switch_to = SwitchTo(self)
159 self._mobile = Mobile(self)

~/opt/anaconda3/lib/python3.8/site-packages/selenium/webdriver/remote/webdriver.py in start_session(self, capabilities, browser_profile)
250 parameters = {"capabilities": w3c_caps,
251 "desiredCapabilities": capabilities}
--> 252 response = self.execute(Command.NEW_SESSION, parameters)
253 if 'sessionId' not in response:
254 response = response['value']

~/opt/anaconda3/lib/python3.8/site-packages/selenium/webdriver/remote/webdriver.py in execute(self, driver_command, params)
319 response = self.command_executor.execute(driver_command, params)
320 if response:
--> 321 self.error_handler.check_response(response)
322 response['value'] = self._unwrap_value(
323 response.get('value', None))

~/opt/anaconda3/lib/python3.8/site-packages/selenium/webdriver/remote/errorhandler.py in check_response(self, response)
240 alert_text = value['alert'].get('text')
241 raise exception_class(message, screen, stacktrace, alert_text)
--> 242 raise exception_class(message, screen, stacktrace)
243
244 def _value_or_default(self, obj, key, default):

SessionNotCreatedException: Message: session not created: This version of ChromeDriver only supports Chrome version 91
Current browser version is 90.0.4430.212 with binary path /Applications/Google Chrome.app/Contents/MacOS/Google Chrome
`

google

WebDriverException: Message: Service chromedriver unexpectedly exited. Status code was: -6

Connection timeout that breaks entire code

Hi everyone,

First of all, great package!

Im running into a problem with connection timeout on some urls. I make a request like so:

      for href in hrefs:
            try:
                news = newspaper(href)
            except:
                print('im here')
                continue

It works! however in some cases I get:
connection/timeout error: https://www.nasdaq.com/articles/why-these-real-estate-stocks-are-crashing-today-2020-06-11 HTTPSConnectionPool(host='www.nasdaq.com', port=443): Read timed out. (read timeout=6)

Which is also fine; if the website does not allow me to retrieve its content I just want to skip it. However, my code breaks completely with this latter error message and hence does not continue with other calls to newspaper()

I did some digging and came till here:
simple_crawler.py

       try:
            # read by streaming chunks (stream=True, iter_content=xx)
            # so we can stop downloading as soon as MAX_FILE_SIZE is reached
            response = requests.get(url, timeout=timeout, verify=False, allow_redirects=True, headers=HEADERS)
        except (requests.exceptions.MissingSchema, requests.exceptions.InvalidURL):
            LOGGER.error('malformed URL: %s', url)
        except requests.exceptions.TooManyRedirects:
            LOGGER.error('too many redirects: %s', url)
        except requests.exceptions.SSLError as err:
            LOGGER.error('SSL: %s %s', url, err)
        except (
            socket.timeout, requests.exceptions.ConnectionError,
            requests.exceptions.Timeout, socket.error, socket.gaierror
        ) as err:
            print('arrrhh')
            **LOGGER.error('connection/timeout error: %s %s', url, err)**
            return ''
        else:

It seems that I can't catch exceptions in my own code where I call newspaper().

Any ideas on how to fix this?

No module named 'newsfetch'

No module named 'newsfetch' using visual studio. and win 10 Help

Image_Url and Language Fix

There was a small issue with the code the Language endpoint was given the Image_Url variable and so the Image_url was fetched as the language.

LINE 78 : change "image_url" to "language"
LINE 130 : Add a new line "'language': self.language,"

Raised a Pull request too
https://github.com/santhoshse7en/news-fetch/pull/6