Coder Social home page Coder Social logo

santhoshse7en / news-fetch Goto Github PK

View Code? Open in Web Editor NEW
168.0 10.0 111.0 55 KB

A Python Package which helps to scrape all news details from any news websites

License: MIT License

Python 100.00%
newspaper3k google-search-using-python news scraper scraper-engine news-details python news-website felix extracts

news-fetch's Introduction

PyPI version License Documentation Status

news-fetch

news-fetch is an open-source, easy-to-use news crawler that extracts structured information from almost any news website. It can follow recursively internal hyperlinks and read RSS feeds to fetch both most recent and also old, archived articles. You only need to provide the root URL of the news website to crawl it completely. News-fetch combines the power of multiple state-of-the-art libraries and tools, such as news-please - Felix Hamborg and Newspaper3K - Lucas (欧阳象) Ou-Yang. This package consists of both features provided by Felix's work and Lucas' work.

I built this to reduce most of NaN or '' or [] or 'None' values while scraping for some news websites. Platform-independent and written in Python 3. Programmers and developers can very easily use this package to access the news data to their programs.

Source Link
PyPI: https://pypi.org/project/news-fetch/
Repository: https://santhoshse7en.github.io/news-fetch/
Documentation: https://santhoshse7en.github.io/news-fetch_doc/ (Not Yet Created!)

Dependencies

Extracted information

news-fetch extracts the following attributes from news articles. Also, have a look at an examplary JSON file extracted by news-please.

  • headline
  • name(s) of author(s)
  • publication date
  • publication
  • category
  • source_domain
  • article
  • summary
  • keyword
  • url
  • language

Dependencies Installation

Use the package manager pip to install following

pip install -r requirements.txt

Usage

Download it by clicking the green download button here on Github. To extract URLs from a targeted website, call the google_search function. You only need to parse the keyword and newspaper link argument.

>>> from newsfetch.google import google_search
>>> google = google_search('Alcoholics Anonymous', 'https://timesofindia.indiatimes.com/')

Use the URLs attribute to get the links of all the news articles scraped.

>>> google.urls

Directory of google search results urls

google

To scrape all the news details, call the newspaper function

>>> from newsfetch.news import newspaper
>>> news = newspaper('https://www.bbc.co.uk/news/world-48810070')

Directory of news

newsdir

>>> news.headline

'g20 summit: trump and xi agree to restart us china trade talks'

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Please make sure to update tests as appropriate.

License

MIT

news-fetch's People

Contributors

imakashsahu avatar prakharrathi25 avatar sahilbawa7777 avatar santhoshse7en avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

news-fetch's Issues

"Special letters" are being converted to regular ones

Hello

Is it possible in some way to define what language the news is in, so it could be fetched correctly?
I used the library for a news in Portuguese, but it converted "special letters" to regular ones.
It highly compromises NLP procedures that deals with syntax, context etc.

example: "àáéóíúâôêãõç" is converted to "aaeiuaoeaoc"

from newsfetch.news import newspaper
news = newspaper('https://g1.globo.com/sc/santa-catarina/noticia/2021/01/20/greve-na-comcap-coleta-feita-por-empresa-privada-em-florianopolis-vai-abranger-35percent-do-roteiro-diz-prefeitura.ghtml')

I saw inside the class it is used Newspaper3K Scraper and if I enforce the right language it returns the correct text.

from newspaper import Article
article = Article(url, language='pt')

thank you

Encountered error while generating package metadata.

Hello, I can't install requirments.txt on mac os M1.

Requirement already satisfied: beautifulsoup4 in /Users/tangquanzhong/miniforge3/lib/python3.9/site-packages (from -r requirements.txt (line 1)) (4.11.1)
Requirement already satisfied: selenium in /Users/tangquanzhong/miniforge3/lib/python3.9/site-packages (from -r requirements.txt (line 2)) (4.1.3)
Collecting chromedriver-binary
Using cached chromedriver-binary-101.0.4951.15.0.tar.gz (4.9 kB)
Preparing metadata (setup.py) ... done
Requirement already satisfied: pandas in /Users/tangquanzhong/miniforge3/lib/python3.9/site-packages (from -r requirements.txt (line 4)) (1.3.4)
Collecting pattern
Using cached Pattern-3.6.0.tar.gz (22.2 MB)
Preparing metadata (setup.py) ... done
Collecting fake_useragent
Using cached fake-useragent-0.1.11.tar.gz (13 kB)
Preparing metadata (setup.py) ... done
Requirement already satisfied: setuptools in /Users/tangquanzhong/miniforge3/lib/python3.9/site-packages (from -r requirements.txt (line 7)) (58.0.4)
Collecting twine
Using cached twine-4.0.0-py3-none-any.whl (36 kB)
Collecting unidecode
Using cached Unidecode-1.3.4-py3-none-any.whl (235 kB)
Requirement already satisfied: soupsieve>1.2 in /Users/tangquanzhong/miniforge3/lib/python3.9/site-packages (from beautifulsoup4->-r requirements.txt (line 1)) (2.3.2.post1)
Requirement already satisfied: urllib3[secure,socks]=1.26 in /Users/tangquanzhong/miniforge3/lib/python3.9/site-packages (from selenium->-r requirements.txt (line 2)) (1.26.7)
Requirement already satisfied: trio
=0.17 in /Users/tangquanzhong/miniforge3/lib/python3.9/site-packages (from selenium->-r requirements.txt (line 2)) (0.20.0)
Requirement already satisfied: trio-websocket~=0.9 in /Users/tangquanzhong/miniforge3/lib/python3.9/site-packages (from selenium->-r requirements.txt (line 2)) (0.9.2)
Requirement already satisfied: python-dateutil>=2.7.3 in /Users/tangquanzhong/miniforge3/lib/python3.9/site-packages (from pandas->-r requirements.txt (line 4)) (2.8.2)
Requirement already satisfied: pytz>=2017.3 in /Users/tangquanzhong/miniforge3/lib/python3.9/site-packages (from pandas->-r requirements.txt (line 4)) (2021.3)
Requirement already satisfied: numpy>=1.20.0 in /Users/tangquanzhong/miniforge3/lib/python3.9/site-packages (from pandas->-r requirements.txt (line 4)) (1.21.3)
Requirement already satisfied: future in /Users/tangquanzhong/miniforge3/lib/python3.9/site-packages (from pattern->-r requirements.txt (line 5)) (0.18.2)
Collecting backports.csv
Using cached backports.csv-1.0.7-py2.py3-none-any.whl (12 kB)
Collecting mysqlclient
Using cached mysqlclient-2.1.0.tar.gz (87 kB)
Preparing metadata (setup.py) ... error
error: subprocess-exited-with-error

× python setup.py egg_info did not run successfully.
│ exit code: 1
╰─> [16 lines of output]
/bin/sh: mysql_config: command not found
/bin/sh: mariadb_config: command not found
/bin/sh: mysql_config: command not found
Traceback (most recent call last):
File "", line 2, in
File "", line 34, in
File "/private/var/folders/b3/923q90gn6p14m7qtmgjz4z140000gn/T/pip-install-uw_1rtd5/mysqlclient_e107c4fc41db45b1bb9ce0e7250d32be/setup.py", line 15, in
metadata, options = get_config()
File "/private/var/folders/b3/923q90gn6p14m7qtmgjz4z140000gn/T/pip-install-uw_1rtd5/mysqlclient_e107c4fc41db45b1bb9ce0e7250d32be/setup_posix.py", line 70, in get_config
libs = mysql_config("libs")
File "/private/var/folders/b3/923q90gn6p14m7qtmgjz4z140000gn/T/pip-install-uw_1rtd5/mysqlclient_e107c4fc41db45b1bb9ce0e7250d32be/setup_posix.py", line 31, in mysql_config
raise OSError("{} not found".format(_mysql_config_path))
OSError: mysql_config not found
mysql_config --version
mariadb_config --version
mysql_config --libs
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.
(base) tangquanzhong@tangquanzhongdeMacBook-Air news-fetch-master %

ImportError: cannot import name 'get_chrome_web_driver' from 'newsfetch.helpers'

`Python 3.9.1 (default, Dec 13 2020, 11:55:53)
[GCC 10.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

from newsfetch.google import google_search
Traceback (most recent call last):
File "", line 1, in
File "/home/zeka/ZEKA/pythonProjects/news-fetch/newsfetch/google.py", line 1, in
from newsfetch.helpers import (get_chrome_web_driver, get_web_driver_options,
ImportError: cannot import name 'get_chrome_web_driver' from 'newsfetch.helpers' (/home/zeka/ZEKA/pythonProjects/news-fetch/newsfetch/helpers.py)
from newsfetch.news import newspaper
news = newspaper('https://www.bbc.co.uk/news/world-48810070')
File "", line 1
news = newspaper('https://www.bbc.co.uk/news/world-48810070')
IndentationError: unexpected indent
from newsfetch.google import google_search
Traceback (most recent call last):
File "", line 1, in
File "/home/zeka/ZEKA/pythonProjects/news-fetch/newsfetch/google.py", line 1, in
from newsfetch.helpers import (get_chrome_web_driver, get_web_driver_options,
ImportError: cannot import name 'get_chrome_web_driver' from 'newsfetch.helpers' (/home/zeka/ZEKA/pythonProjects/news-fetch/newsfetch/helpers.py)
`

I could not find get_chrome_web_driver, get_web_driver_options, set_automation_as_head_less, set_browser_as_incognito, set_ignore_certificate_error in the newsfetch.helpers

Does not fetch arabic news

Hello,
I tried it but it did not fetch Arabic news such as https://www.alarabiya.net/
I got zero articles.

My code:

news_paper = newspaper3k.build('https://www.alarabiya.net/', language='ar', memoize_articles=False) 
for article in news_paper.articles:
    article_url = article.url
    news = newsfetch(article_url)

Any idea?

using selenium-hub?

I'm trying to setup news-fetch on my server, where I already have two selenium docker containers [1], hub and standalone-chrome. I can get a working selenium.webdriver object like this:

from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

huburl = 'http://seleniumhub_docker_container:4444/wd/hub'
driver = webdriver.Remote(command_executor=huburl, desired_capabilities=DesiredCapabilities.CHROME)

however I haven't been able to tell newsfetch to use this driver.

Is it at all possible or are there plans to acomodate this use-case?

[1] https://github.com/SeleniumHQ/docker-selenium

Chrome version 91

I am using a mac and Im getting the below error

`SessionNotCreatedException Traceback (most recent call last)
in
----> 1 google = google_search('ASX','https://au.news.yahoo.com/')

~/opt/anaconda3/lib/python3.8/site-packages/newsfetch/google.py in init(self, keyword, newspaper_url)
26 set_ignore_certificate_error(options)
27 set_browser_as_incognito(options)
---> 28 driver = get_chrome_web_driver(options)
29 driver.get(url)
30

~/opt/anaconda3/lib/python3.8/site-packages/newsfetch/helpers.py in get_chrome_web_driver(options)
32
33 def get_chrome_web_driver(options):
---> 34 return webdriver.Chrome(chrome_options=options)
35
36

~/opt/anaconda3/lib/python3.8/site-packages/selenium/webdriver/chrome/webdriver.py in init(self, executable_path, port, options, service_args, desired_capabilities, service_log_path, chrome_options, keep_alive)
74
75 try:
---> 76 RemoteWebDriver.init(
77 self,
78 command_executor=ChromeRemoteConnection(

~/opt/anaconda3/lib/python3.8/site-packages/selenium/webdriver/remote/webdriver.py in init(self, command_executor, desired_capabilities, browser_profile, proxy, keep_alive, file_detector, options)
155 warnings.warn("Please use FirefoxOptions to set browser profile",
156 DeprecationWarning, stacklevel=2)
--> 157 self.start_session(capabilities, browser_profile)
158 self._switch_to = SwitchTo(self)
159 self._mobile = Mobile(self)

~/opt/anaconda3/lib/python3.8/site-packages/selenium/webdriver/remote/webdriver.py in start_session(self, capabilities, browser_profile)
250 parameters = {"capabilities": w3c_caps,
251 "desiredCapabilities": capabilities}
--> 252 response = self.execute(Command.NEW_SESSION, parameters)
253 if 'sessionId' not in response:
254 response = response['value']

~/opt/anaconda3/lib/python3.8/site-packages/selenium/webdriver/remote/webdriver.py in execute(self, driver_command, params)
319 response = self.command_executor.execute(driver_command, params)
320 if response:
--> 321 self.error_handler.check_response(response)
322 response['value'] = self._unwrap_value(
323 response.get('value', None))

~/opt/anaconda3/lib/python3.8/site-packages/selenium/webdriver/remote/errorhandler.py in check_response(self, response)
240 alert_text = value['alert'].get('text')
241 raise exception_class(message, screen, stacktrace, alert_text)
--> 242 raise exception_class(message, screen, stacktrace)
243
244 def _value_or_default(self, obj, key, default):

SessionNotCreatedException: Message: session not created: This version of ChromeDriver only supports Chrome version 91
Current browser version is 90.0.4430.212 with binary path /Applications/Google Chrome.app/Contents/MacOS/Google Chrome
`

Connection timeout that breaks entire code

Hi everyone,

First of all, great package!

Im running into a problem with connection timeout on some urls. I make a request like so:

      for href in hrefs:
            try:
                news = newspaper(href)
            except:
                print('im here')
                continue

It works! however in some cases I get:
connection/timeout error: https://www.nasdaq.com/articles/why-these-real-estate-stocks-are-crashing-today-2020-06-11 HTTPSConnectionPool(host='www.nasdaq.com', port=443): Read timed out. (read timeout=6)

Which is also fine; if the website does not allow me to retrieve its content I just want to skip it. However, my code breaks completely with this latter error message and hence does not continue with other calls to newspaper()

I did some digging and came till here:
simple_crawler.py

       try:
            # read by streaming chunks (stream=True, iter_content=xx)
            # so we can stop downloading as soon as MAX_FILE_SIZE is reached
            response = requests.get(url, timeout=timeout, verify=False, allow_redirects=True, headers=HEADERS)
        except (requests.exceptions.MissingSchema, requests.exceptions.InvalidURL):
            LOGGER.error('malformed URL: %s', url)
        except requests.exceptions.TooManyRedirects:
            LOGGER.error('too many redirects: %s', url)
        except requests.exceptions.SSLError as err:
            LOGGER.error('SSL: %s %s', url, err)
        except (
            socket.timeout, requests.exceptions.ConnectionError,
            requests.exceptions.Timeout, socket.error, socket.gaierror
        ) as err:
            print('arrrhh')
            **LOGGER.error('connection/timeout error: %s %s', url, err)**
            return ''
        else:

It seems that I can't catch exceptions in my own code where I call newspaper().

Any ideas on how to fix this?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.