Coder Social home page Coder Social logo

newsgrabber's Introduction

NewsGrabber

THIS IS STILL IN BETA

An Archive Team project to save every news article from every newswebsite. The dashboard of this grab can be viewed here: http://newsgrabber.harrycross.me:29000 and historical lists of grabbed URLs can be seen here: http://newsgrabber.harrycross.me . The channel for NewsGrabber is #newsgrabber at irc.efnet.org .

How does this work?

In /services/ there is a list of Python files. Each of these files is for an individual newswebsite. The files contain the seed URLs from which new URLs are discovered, then matched with regexes also given in the same services file. Newly matched URLs are added to a list. The URLs are then grabbed and the requests and responses saved into WARC files, which are uploaded to the Internet Archive, where they can be downloaded directly, and also and browsed in the Wayback Machine.

A website is rechecked for new URLs every few seconds. All newly matched URLs are downloaded on the hour.

Add a new website

Every new website that is added requires a Python file in /services/ in order to be grabbed. This Python file should be laid out as follows:

Filename

The name of the new Python file should start with web__ and end with .py. The name should contain the name of the website or a description of what part of the websites it holds. The filename should only contain the following characters: 0123456789, abcdefghijklmnopqrstuvwxyz, ABCDEFGHIJKLMNOPQRSTUVWXYZ and _. For example: web__skynews_com.py or web__rtlnieuws_nl_videos.py.

refresh

This is a number indicating how often the URLs in urls should be recrawled for new URLs. When refresh = 4 the URLs in urls will be redownloaded and checked for new URLs every 300 seconds. For example:

refresh = 6

Refresh can be any number from 1 to 11 where:

1 = 5 seconds
2 = 30 seconds
3 = 60 seconds - 1 minute
4 = 300 seconds - 5 minutes
5 = 1800 seconds - 30 minutes
6 = 3600 seconds - 60 minutes - 1 hour
7 = 7200 seconds - 120 minutes - 2 hours
8 = 21600 seconds - 360 minutes - 6 hours
9 = 43200 seconds - 720 minutes - 12 hours
10 = 86400 seconds - 1,440 minutes - 24 hours - 1 day
11 = 172800 seconds - 2,880 minutes - 48 hours - 2 days

version

This is the version number of the Python script. This should be the date and the count of the updates from that one day, for example:

version = 20151215.01

urls

This is a list of URLs that will be checked for new links. These urls should be pages with a list of the newest articles, like rss feeds, and/or frontpages which have links to the newest articles. As few links as possible should be added, but all new articles should be found. For example:

urls = ['http://www.theguardian.com/uk/rss']

regex

This is a list of regex patterns which will be matched with the links found in the downloaded URLs from urls. Links that match with one or more of these regex patterns will be added to the list to be downloaded. Often the regexes will match the main site of the newsarticles. For example:

regex = [r'^https?:\/\/[^\/]*theguardian\.com']

videoregex

This is a list of regex patterns which will be matched with the links found in the downloaded URLs from urls and that match with one or more regexes from regex. If the URLs match one or more of these regexes they will be downloaded with youtube-dl. For example:

videoregex = [r'\/video\/']

If the website contains no videos, put an empty list, like this:

videoregex = []

liveregex

This is a list of regex patterns which will be matched with the links found in the downloaded URLs from urls and that match with one or more regexes from regex. If the URLs match one or more of these regexes they will not be added to the list of already downloaded URLs once they have been grabbed once. This means these URLs will be downloaded over and over again every time they are found. This is intended for livepages which are repeatedly updated. For example:

liveregex = [r'\/liveblog\/']

If the website contains no live pages, put an empty list, like this:

liveregex = []

wikidata (optional, only add if known & available)

This is the ID of the Wikidata entry for the newswebsite. This is optional, and should only be included if it is available. It will be used to link the newssite to wikidata, so additional metadata can be referenced (e.g. geographical area, other identifiers, dates of publication). If the newswebsite does not (yet) have an entry on Wikidata, feel free to create one (along with appropriate sources to verify it is suitable for inclusion), and add the new ID here. An example of a wikidata URL for timesofisrael.com is https://www.wikidata.org/wiki/Q6449319. The ID part is Q6449319, as seen in the URL https://www.wikidata.org/wiki/`Q6449319`. Only the ID should be added as the value of the wikidata variable, and it should be quoted. For example:

wikidata = 'Q6449319'

The IRC bot

NewsGrabber has an IRC bot, newsbuddy, which can be found in #newsgrabberbot at irc.efnet.org. The commands for the bot are:

!imgrab <SERVICE>, !immediate-grab <SERVICE>, !immediate_grab <SERVICE>: Grab URLs immediatly after they're found. Add remove, rem or r to stop URLs from being grabbed immediatly after they're found.

!help: View list of commands.

!stop: Write lists of URLs, finish current running grabs and not start new grabs.

!start: Undo !stop: Start new grabs.

!version: Get current used version of the script.

!writefiles: Write lists of URLs.

!move: Move WARC files.

!upload: Upload WARC files.

!info <SERVICE>, !information <SERVICE>: Get information about a specific service.

!EMERGENCY_STOP: Stop the scripts immediatly.

newsgrabber's People

Contributors

arkiver2 avatar atluxity avatar bna-robin avatar djsmiley2k avatar dopefishjustin avatar ersi avatar espes avatar harricross avatar harryc145 avatar jesseweinstein avatar lagittaja avatar lucasrolff avatar phuzion avatar pressstartandselect avatar sollidius avatar sstollenwerk avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

newsgrabber's Issues

Handling of paywalled sites

How are they handled? To what degree can they be bypassed without getting in trouble?

Some sites have paywalls that rely on JS. They can either "fail open" or "fail closed". For example, if JS is disabled on dn.se all articles render fine, but if it's enabled "locked articles" will be hidden (after a short delay during which the JS loads)

Example of "fail open": https://www.dn.se/nyheter/nyheter-hem/infrastrukturministern-kommer-traffa-generaldirektor-snarast/

I presume the first type is okay to scrape, since you're not required to run JS. Would it be illegal/taint the results to bypass the latter type?

Some sites have server-side paywalls, but you can get one month/day/week for free by registering an account without providing payment info. Are these okay to scrape?

Some sites have server-side paywalls where you have to provide payment info to register an account, but it can be blatantly fake (card number 1234123412341234, phone number 123123, asdasd goes in the other fields). Are these okay to scrape?

Some sites have server-side paywalls where you have to provide payment info to register an account, and rudimentary validation (luhn number on cc, dob has to be valid date). Are these okay?

Some sites require valid (eg. functioning) payment info, and perform some small $0.01 transaction to verify it. Is it okay to register valid accounts with real info and then use them for scraping?

Some sites have poor login security. Is it okay to get a list of logins and use them? This is blatantly illegal and hard to decentralize, but would information scraped through methods like this (eg. using TOR/proxy and submit anonymously) be accepted?

A lot of news sites have paywalls, and it seems like a shame to not scrape them. Disregarding the issue of technical possibility, what would be acceptable to scrape? Also, some news sites provide .pdf downloads of their paper issues. Is there any project to scrape these?

Australian news websites

Add some polish websites

waiting to be in pull request

WP
https://wiadomosci.wp.pl/ --- general news
https://sportowefakty.wp.pl/ --- sport news and facts
https://kobieta.wp.pl/ --- women's design, fit, trends, stars
https://facet.wp.pl/ --- news for mens
https://gwiazdy.wp.pl/ --- stars news
https://moto.wp.pl/ --- moto news
https://tech.wp.pl/ --- tech news
https://opinie.wp.pl/ --- news opinions, looks like normal news
https://turystyka.wp.pl/ --- tourists news (idk)
https://finanse.wp.pl/ --- finance news
https://dom.wp.pl/ --- home more articles than news
https://film.wp.pl/ --- film/movies news, trailers, catalog of all movies ....
https://ksiazki.wp.pl/ --- books news, articles
https://kuchnia.wp.pl/ --- kitchen recipes, diets, cooking
https://fitness.wp.pl/ --- fitness articles, i don;t see news :/
https://komorkomania.pl/ --- tech news
https://fotoblogia.pl/ --- photo/tech news
https://portal.abczdrowie.pl/ --- it's about about health, but contains information about farmacy
https://parenting.pl/ --- news about kids and parents
https://www.money.pl/ --- business news
http://biztok.money.pl/ --- business news
https://kafeteria.pl/ --- general news
https://www.dobreprogramy.pl/ --- very big site about hardware/software including free/non-free software to download, tousends of files and probably millions pieces to download via archivebot

--- upper sites are affiliated with wp.pl, one of the biggest news site in poland ---
--- some of wp sites / wp affiliated sites are listed in menu, but we cannot connect to them. they are not included in this list ---

http://www.nowosci.com.pl/
http://www.polsatnews.pl/
http://www.rp.pl/
https://wiadomosci.onet.pl/

This issue is WIP
This issue needs rewiev, what should be in newsgrabber, what not. Ready for rewiev are:

  • WP

Bot shouldn't crash if a badly named service file is committed

According to @HarryC145 in IRC, the entire bot apparently crashes if someone commits a badly-named file.

For example, if someone were to make a file named web_importantnewssite_com.py instead of web__importantnewssite_com.py, the bot would crash.

That shouldn't happen. The logical fix would be to print an error to stderr (and perhaps IRC) and skip the file.

Skip site if its having technical issues

If a site is having technical issues, so say has gone down, the site should be skipped in that grab so time is not wasted. This could be implemented by "If x out of x URLs are not responding, then stop"

Add an option for special URL extraction rules for seed URLs of a service

Add an option for special URL extraction rules for seed URLs of a service. Sometimes the URLs on a website aren't static, but are generated using some more complex code. With this option a special rule for the seed URLs of a service can be added to also extract the URLs which aren't in the code as static URLs.

Use sitemap.xml

Use the sitemap by default to find more urls. Should implement sitemap format and recurse on sitemaps referring to other more details sitemaps. Disable the inspection of the sitemap using an option in the service itself.

Support selenium for better video and application extraction

Selenium or other automated browsers can help downloading videos and applications by loading up pages like a normal browser would and clicking on the video or application to play it. This would also extract and download videos from webpages that currently are not supported by youtube-dl.

Support ignoreregex for each site

Sites should have an ignoreregex parameter to avoid regrabbing pixel trackers and other unwanted files (such as files that consistently give 4xx errors).

Adding newspapers

I remember a discussion on IRC about adding the archiving of newspapers to the bot. Just opening an issue so that this is recorded and noted somewhere

Use Thread/Process pools

Using thread or process pools and queues instead of starting a new thread one for each grab (or subtask) would limit the number of jobs running at the same time and thus keep the server load at manageable levels.

rsync concurrency limit

We need a way to limit the rsync concurrency on the workers -> master server. At the moment, the master server has close to 1000 open rsync processes and is struggling with it

Ubuntu 16.04 gs-venv

worker.py wont work on ubuntu 16.04-venv
This is because it looks for the path when starting the crawl, which is harcoded.

psuedocode for fix:

GrabSite = ~/.local/bin/
If GrabSite = false:
Then GrabSite = ~/gs-venv/bin/

Warrior UI restarts in a loop

I just tried pointing a warrior, which I'm running with Docker, at the NewsGrabber project. When I select the project, I see some (unhelpful) output in the debug log, and then this message:

There is no connection with the warrior.

The warrior software seems to be restarting in a loop. If I select another project, this stops. Is this a known issue? Is NewsGrabber still an active project?

2019-01-24 15:26:13,761 - seesaw.warrior - DEBUG - Update warrior hq.
2019-01-24 15:26:13,761 - seesaw.warrior - DEBUG - Warrior ID '20829'.
2019-01-24 15:26:14,259 - seesaw.warrior - DEBUG - Select project newsgrabber
2019-01-24 15:26:14,260 - seesaw.warrior - DEBUG - Start selected project newsgrabber (reinstall=False)
2019-01-24 15:26:14,262 - seesaw.warrior - DEBUG - Install project newsgrabber
2019-01-24 15:26:14,272 - seesaw.warrior - DEBUG - git pull from https://github.com/ArchiveTeam/NewsGrabber-Warrior/
2019-01-24 15:26:14,796 - seesaw.warrior - DEBUG - git operation: Already up-to-date.

German-language newssites

In no particular order.

Public broadcasters:

  • tagesschau.de
  • heute.de
  • dradio.de
  • dw.com

Newspapers

  • faz.net
  • taz.de
  • zeit.de
  • spiegel.de
  • welt.de
  • focus.de
  • tagesspiegel.de
  • sueddeutsche.de
  • stern.de

Assign a name to a worker

Something like
rsync://1.3.3.7/newsbuddy #CrossToTheRescue

So that if there is a problem with a worker, we can identify whose it is

youtube-dl issue

ERROR Proxy error
Traceback (most recent call last):
  File "/home/box/wpull/freezer/pyinstaller/wpull_env/lib/python3.4/site-packages/wpull/proxy/server.py", line 55, in __call__
  File "/home/box/wpull/freezer/pyinstaller/wpull_env/lib/python3.4/site-packages/wpull/proxy/server.py", line 88, in __init__
AssertionError: /home/tanner/warrior/NewsGrabber-Warrior/proxy/proxy.crt
Deduplicating digest sha1:BZ2DZM6QWJSMLE5SJSCZMBTGYZNQKUGQ, url https://static.xx.fbcdn.net/rsrc.php/v3/y_/r/WTKa-3xI5b1.js
ERROR Proxy error
Traceback (most recent call last):
  File "/home/box/wpull/freezer/pyinstaller/wpull_env/lib/python3.4/site-packages/wpull/proxy/server.py", line 55, in __call__
  File "/home/box/wpull/freezer/pyinstaller/wpull_env/lib/python3.4/site-packages/wpull/proxy/server.py", line 88, in __init__
AssertionError: /home/tanner/warrior/NewsGrabber-Warrior/proxy/proxy.crt
WARNING WARNING: Could not send HEAD request to https://static.xx.fbcdn.net/images/icons/down_arrow_blue.gif: ''
INFO [generic] down_arrow_blue: Downloading webpage
WARNING ERROR: Unable to download webpage: '' (caused by BadStatusLine("''",)); please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see  https://yt-dl.org/update  on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.
WARNING Could not find external process metadata file: ./tmp-wpull-youtubedllycto27f/tmp*.info.json
INFO youtube-dl fetched \u2018https://static.xx.fbcdn.net/images/icons/down_arrow_blue.gif\u2019.
INFO Fetching \u2018https://static.xx.fbcdn.net/rsrc.php/v3ibfr4/yG/l/en_US/AdsPlacePageSetDataManager.js\u2019.
  1,033,306,112  30%    2.12MB/s    0:18:19  Deduplicating digest sha1:KCZQ7GBGK7DZ3JCQFZO26252O2BG3NKW, url https://static.xx.fbcdn.net/rsrc.php/v3ilgX4/yr/l/en_US/IuXRtSiO7qk.js
INFO Fetched \u2018https://static.xx.fbcdn.net/rsrc.php/v3ibfr4/yG/l/en_US/AdsPlacePageSetDataManager.js\u2019: 400 Bad Request. Length: 0 [text/html; charset="utf-8"].
Deduplicating digest sha1:7M7FGQFRL6EU6EBWRBUULTM7JU4TSMR4, url https://static.xx.fbcdn.net/rsrc.php/v3/y9/r/Zoqj4493zU6.js
WARNING WARNING: Assuming --restrict-filenames since file system encoding cannot encode all characters. Set the LC_ALL environment variable to fix this.
INFO Fetching \u2018https://static.xx.fbcdn.net/rsrc.php/v3iKY-4/yu/l/en_US/AdsLWIDialogUtils.js\u2019.
INFO Fetching \u2018https://static.xx.fbcdn.net/rsrc.php/v3iKY-4/yu/l/en_US/AdsLeadGenFormPreviewDetailsDataManager.js\u2019.
Deduplicating digest sha1:7BLPOOVPFAJLQ4TCGEKK7SN2IAPKWGZN, url https://static.xx.fbcdn.net/rsrc.php/v3/yi/r/GjW2KSaaE5I.js
INFO [generic] photo: Requesting header
ERROR Proxy error
Traceback (most recent call last):
  File "/home/box/wpull/freezer/pyinstaller/wpull_env/lib/python3.4/site-packages/wpull/proxy/server.py", line 55, in __call__
  File "/home/box/wpull/freezer/pyinstaller/wpull_env/lib/python3.4/site-packages/wpull/proxy/server.py", line 88, in __init__
AssertionError: /home/tanner/warrior/NewsGrabber-Warrior/proxy/proxy.crt
WARNING WARNING: Could not send HEAD request to https://static.xx.fbcdn.net/images/icons/photo.gif: ''
INFO [generic] photo: Downloading webpage
ERROR Proxy error
Traceback (most recent call last):
  File "/home/box/wpull/freezer/pyinstaller/wpull_env/lib/python3.4/site-packages/wpull/proxy/server.py", line 55, in __call__
  File "/home/box/wpull/freezer/pyinstaller/wpull_env/lib/python3.4/site-packages/wpull/proxy/server.py", line 88, in __init__
AssertionError: /home/tanner/warrior/NewsGrabber-Warrior/proxy/proxy.crt
WARNING ERROR: Unable to download webpage: '' (caused by BadStatusLine("''",)); please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see  https://yt-dl.org/update  on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.

I keep getting this in my console.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.