archiveteam / newsgrabber Goto Github PK

Grabbing all news.

Python 100.00%

newsgrabber's Introduction

NewsGrabber

THIS IS STILL IN BETA

An Archive Team project to save every news article from every newswebsite. The dashboard of this grab can be viewed here: http://newsgrabber.harrycross.me:29000 and historical lists of grabbed URLs can be seen here: http://newsgrabber.harrycross.me . The channel for NewsGrabber is #newsgrabber at irc.efnet.org .

How does this work?

In /services/ there is a list of Python files. Each of these files is for an individual newswebsite. The files contain the seed URLs from which new URLs are discovered, then matched with regexes also given in the same services file. Newly matched URLs are added to a list. The URLs are then grabbed and the requests and responses saved into WARC files, which are uploaded to the Internet Archive, where they can be downloaded directly, and also and browsed in the Wayback Machine.

A website is rechecked for new URLs every few seconds. All newly matched URLs are downloaded on the hour.

Add a new website

Every new website that is added requires a Python file in /services/ in order to be grabbed. This Python file should be laid out as follows:

Filename

The name of the new Python file should start with web__ and end with .py. The name should contain the name of the website or a description of what part of the websites it holds. The filename should only contain the following characters: 0123456789, abcdefghijklmnopqrstuvwxyz, ABCDEFGHIJKLMNOPQRSTUVWXYZ and _. For example: web__skynews_com.py or web__rtlnieuws_nl_videos.py.

`refresh`

This is a number indicating how often the URLs in urls should be recrawled for new URLs. When refresh = 4 the URLs in urls will be redownloaded and checked for new URLs every 300 seconds. For example:

refresh = 6

Refresh can be any number from 1 to 11 where:

1 = 5 seconds
2 = 30 seconds
3 = 60 seconds - 1 minute
4 = 300 seconds - 5 minutes
5 = 1800 seconds - 30 minutes
6 = 3600 seconds - 60 minutes - 1 hour
7 = 7200 seconds - 120 minutes - 2 hours
8 = 21600 seconds - 360 minutes - 6 hours
9 = 43200 seconds - 720 minutes - 12 hours
10 = 86400 seconds - 1,440 minutes - 24 hours - 1 day
11 = 172800 seconds - 2,880 minutes - 48 hours - 2 days

`version`

This is the version number of the Python script. This should be the date and the count of the updates from that one day, for example:

version = 20151215.01

`urls`

This is a list of URLs that will be checked for new links. These urls should be pages with a list of the newest articles, like rss feeds, and/or frontpages which have links to the newest articles. As few links as possible should be added, but all new articles should be found. For example:

urls = ['http://www.theguardian.com/uk/rss']

`regex`

This is a list of regex patterns which will be matched with the links found in the downloaded URLs from urls. Links that match with one or more of these regex patterns will be added to the list to be downloaded. Often the regexes will match the main site of the newsarticles. For example:

regex = [r'^https?:\/\/[^\/]*theguardian\.com']

`videoregex`

videoregex = [r'\/video\/']

If the website contains no videos, put an empty list, like this:

videoregex = []

`liveregex`

This is a list of regex patterns which will be matched with the links found in the downloaded URLs from urls and that match with one or more regexes from regex. If the URLs match one or more of these regexes they will not be added to the list of already downloaded URLs once they have been grabbed once. This means these URLs will be downloaded over and over again every time they are found. This is intended for livepages which are repeatedly updated. For example:

liveregex = [r'\/liveblog\/']

If the website contains no live pages, put an empty list, like this:

liveregex = []

`wikidata` (optional, only add if known & available)

This is the ID of the Wikidata entry for the newswebsite. This is optional, and should only be included if it is available. It will be used to link the newssite to wikidata, so additional metadata can be referenced (e.g. geographical area, other identifiers, dates of publication). If the newswebsite does not (yet) have an entry on Wikidata, feel free to create one (along with appropriate sources to verify it is suitable for inclusion), and add the new ID here. An example of a wikidata URL for timesofisrael.com is https://www.wikidata.org/wiki/Q6449319. The ID part is Q6449319, as seen in the URL https://www.wikidata.org/wiki/`Q6449319`. Only the ID should be added as the value of the wikidata variable, and it should be quoted. For example:

wikidata = 'Q6449319'

The IRC bot

NewsGrabber has an IRC bot, newsbuddy, which can be found in #newsgrabberbot at irc.efnet.org. The commands for the bot are:

!imgrab <SERVICE>, !immediate-grab <SERVICE>, !immediate_grab <SERVICE>: Grab URLs immediatly after they're found. Add remove, rem or r to stop URLs from being grabbed immediatly after they're found.

!help: View list of commands.

!stop: Write lists of URLs, finish current running grabs and not start new grabs.

!start: Undo !stop: Start new grabs.

!version: Get current used version of the script.

!writefiles: Write lists of URLs.

!move: Move WARC files.

!upload: Upload WARC files.

!info <SERVICE>, !information <SERVICE>: Get information about a specific service.

!EMERGENCY_STOP: Stop the scripts immediatly.

newsgrabber's People

Contributors

Stargazers

Watchers

newsgrabber's Issues

Handling of paywalled sites

How are they handled? To what degree can they be bypassed without getting in trouble?

Some sites have paywalls that rely on JS. They can either "fail open" or "fail closed". For example, if JS is disabled on dn.se all articles render fine, but if it's enabled "locked articles" will be hidden (after a short delay during which the JS loads)

Example of "fail open": https://www.dn.se/nyheter/nyheter-hem/infrastrukturministern-kommer-traffa-generaldirektor-snarast/

I presume the first type is okay to scrape, since you're not required to run JS. Would it be illegal/taint the results to bypass the latter type?

Some sites have server-side paywalls, but you can get one month/day/week for free by registering an account without providing payment info. Are these okay to scrape?

Some sites have server-side paywalls where you have to provide payment info to register an account, but it can be blatantly fake (card number 1234123412341234, phone number 123123, asdasd goes in the other fields). Are these okay to scrape?

Some sites have server-side paywalls where you have to provide payment info to register an account, and rudimentary validation (luhn number on cc, dob has to be valid date). Are these okay?

Some sites require valid (eg. functioning) payment info, and perform some small $0.01 transaction to verify it. Is it okay to register valid accounts with real info and then use them for scraping?

Some sites have poor login security. Is it okay to get a list of logins and use them? This is blatantly illegal and hard to decentralize, but would information scraped through methods like this (eg. using TOR/proxy and submit anonymously) be accepted?

A lot of news sites have paywalls, and it seems like a shame to not scrape them. Disregarding the issue of technical possibility, what would be acceptable to scrape? Also, some news sites provide .pdf downloads of their paper issues. Is there any project to scrape these?

Are we logging timestamps into the WARC file name?

This would be useful if we are uploading more than 1 WARC a day.

Prevent static page requisites from being regrabbed

Australian news websites

There currently appears to be no coverage of Australian news websites. I really lack the time to make a PR to add these, but I've created a list in case there is interest in adding them.

Most important: http://www.abc.net.au/news/

http://www.sbs.com.au/news/
http://www.theage.com.au/
http://www.smh.com.au/
http://www.skynews.com.au/
http://www.9news.com.au/
http://www.news.com.au/
http://www.huffingtonpost.com.au
http://www.perthnow.com.au
http://www.canberratimes.com.au
http://www.watoday.com.au
http://www.heraldsun.com.au
http://www.dailytelegraph.com.au
http://www.couriermail.com.au
http://www.brisbanetimes.com.au
http://www.ntnews.com.au

Rejoin IRC channel if disconnected

Add some polish websites

waiting to be in pull request

WP
https://wiadomosci.wp.pl/ --- general news
https://sportowefakty.wp.pl/ --- sport news and facts
https://kobieta.wp.pl/ --- women's design, fit, trends, stars
https://facet.wp.pl/ --- news for mens
https://gwiazdy.wp.pl/ --- stars news
https://moto.wp.pl/ --- moto news
https://tech.wp.pl/ --- tech news
https://opinie.wp.pl/ --- news opinions, looks like normal news
https://turystyka.wp.pl/ --- tourists news (idk)
https://finanse.wp.pl/ --- finance news
https://dom.wp.pl/ --- home more articles than news
https://film.wp.pl/ --- film/movies news, trailers, catalog of all movies ....
https://ksiazki.wp.pl/ --- books news, articles
https://kuchnia.wp.pl/ --- kitchen recipes, diets, cooking
https://fitness.wp.pl/ --- fitness articles, i don;t see news :/
https://komorkomania.pl/ --- tech news
https://fotoblogia.pl/ --- photo/tech news
https://portal.abczdrowie.pl/ --- it's about about health, but contains information about farmacy
https://parenting.pl/ --- news about kids and parents
https://www.money.pl/ --- business news
http://biztok.money.pl/ --- business news
https://kafeteria.pl/ --- general news
https://www.dobreprogramy.pl/ --- very big site about hardware/software including free/non-free software to download, tousends of files and probably millions pieces to download via archivebot

--- upper sites are affiliated with wp.pl, one of the biggest news site in poland ---
--- some of wp sites / wp affiliated sites are listed in menu, but we cannot connect to them. they are not included in this list ---

http://www.nowosci.com.pl/
http://www.polsatnews.pl/
http://www.rp.pl/
https://wiadomosci.onet.pl/

This issue is WIP
This issue needs rewiev, what should be in newsgrabber, what not. Ready for rewiev are:

Bot shouldn't crash if a badly named service file is committed

According to @HarryC145 in IRC, the entire bot apparently crashes if someone commits a badly-named file.

For example, if someone were to make a file named web_importantnewssite_com.py instead of web__importantnewssite_com.py, the bot would crash.

That shouldn't happen. The logical fix would be to print an error to stderr (and perhaps IRC) and skip the file.

Add standard list of regexes for videoURLs and liveURLs

Do not crash on bad Python service files

When a bad Python file is found a message should be printed in the IRC channel to let admins know which Python file should be manually fixed.

New News-Sites suggestions

Collection-Threat for newssite-Suggestions.

As suggested by JesseW on irc: http://www.kiro7.com/

Decode parts of found percent encoded URLs that shouldn't be encoded

For exaple http:// might be encoded as http%3A%2F%2F. This should be decoded.

last_upload_* files are sometimes not saved well during a bad event like a crash

If the script finds there are problems with one or more of the last_upload_* files it should go through the files and fix them if needed.

Rewrite bad http:/ and https:/ URLs to http:// and https:// URLs if found

Sometimes websites have bad URLs, for example URLs with only one slash instead of two slashes needed for the URL to work. We should automatically rewrite any URL found with only one slash.

Write new URLs to 'list' file and memory after !stop command

The IRC bot should reply to a command in the channel the command is given

The IRC bot should reply to both private chat and public channel if a user has given commands though private chat to the IRC bot

Some URLs are regrabbed.

Is this python27 or python3.4+?

Is not clear on wich python version this should run on.

Remove service from old refresh list if refresh is changed.

Skip site if its having technical issues

If a site is having technical issues, so say has gone down, the site should be skipped in that grab so time is not wasted. This could be implemented by "If x out of x URLs are not responding, then stop"

Custom URL extraction for found URLs that are being grabbed

Many pages have images and other page requisites which aren't extracted by wpull automatically. Add a scripts to exrtact these URLs from the pages too.

Add an option for special URL extraction rules for seed URLs of a service

Add an option for special URL extraction rules for seed URLs of a service. Sometimes the URLs on a website aren't static, but are generated using some more complex code. With this option a special rule for the seed URLs of a service can be added to also extract the URLs which aren't in the code as static URLs.

Use sitemap.xml

Use the sitemap by default to find more urls. Should implement sitemap format and recurse on sitemaps referring to other more details sitemaps. Disable the inspection of the sitemap using an option in the service itself.

Start listening to new socket instead of old socket if new socket for IRC is created

Support selenium for better video and application extraction

Selenium or other automated browsers can help downloading videos and applications by loading up pages like a normal browser would and clicking on the video or application to play it. This would also extract and download videos from webpages that currently are not supported by youtube-dl.

Support ignoreregex for each site

Sites should have an ignoreregex parameter to avoid regrabbing pixel trackers and other unwanted files (such as files that consistently give 4xx errors).

Restart upload item count for WARCs if date of latest upload is not date of uploading items.

Automatically percent encode seed URLs

The Intercept

Probably a good idea to add The Intercept as well.

Adding newspapers

I remember a discussion on IRC about adding the archiving of newspapers to the bot. Just opening an issue so that this is recorded and noted somewhere

Use Thread/Process pools

Using thread or process pools and queues instead of starting a new thread one for each grab (or subtask) would limit the number of jobs running at the same time and thus keep the server load at manageable levels.

Implement IRC bot for the terminal

Youtube-dl is sometimes not starting.

Split items on Internet Archive in 10 GB items

rsync concurrency limit

We need a way to limit the rsync concurrency on the workers -> master server. At the moment, the master server has close to 1000 open rsync processes and is struggling with it

Support videos and liveblogs not indicated in the URL

Support should be added for sites that don't indicate a video or liveblog in their article URLs. (video example, liveblog example)

harrycross.me doesn't work

It's not working.

This site can’t be reached
ERR_CONNECTION_REFUSED

http://newsgrabber.harrycross.me:29000/
http://newsgrabber.harrycross.me/

Check if file exists before starting uploadprocess

Ubuntu 16.04 gs-venv

worker.py wont work on ubuntu 16.04-venv
This is because it looks for the path when starting the crawl, which is harcoded.

psuedocode for fix:

GrabSite = ~/.local/bin/
If GrabSite = false:
Then GrabSite = ~/gs-venv/bin/

There is no connection with the warrior.

The warrior software seems to be restarting in a loop. If I select another project, this stops. Is this a known issue? Is NewsGrabber still an active project?

2019-01-24 15:26:13,761 - seesaw.warrior - DEBUG - Update warrior hq.
2019-01-24 15:26:13,761 - seesaw.warrior - DEBUG - Warrior ID '20829'.
2019-01-24 15:26:14,259 - seesaw.warrior - DEBUG - Select project newsgrabber
2019-01-24 15:26:14,260 - seesaw.warrior - DEBUG - Start selected project newsgrabber (reinstall=False)
2019-01-24 15:26:14,262 - seesaw.warrior - DEBUG - Install project newsgrabber
2019-01-24 15:26:14,272 - seesaw.warrior - DEBUG - git pull from https://github.com/ArchiveTeam/NewsGrabber-Warrior/
2019-01-24 15:26:14,796 - seesaw.warrior - DEBUG - git operation: Already up-to-date.

German-language newssites

In no particular order.

Public broadcasters:

tagesschau.de
heute.de
dradio.de
dw.com

Newspapers

faz.net
taz.de
zeit.de
spiegel.de
welt.de
focus.de
tagesspiegel.de
sueddeutsche.de
stern.de

ERROR Proxy error
Traceback (most recent call last):
  File "/home/box/wpull/freezer/pyinstaller/wpull_env/lib/python3.4/site-packages/wpull/proxy/server.py", line 55, in __call__
  File "/home/box/wpull/freezer/pyinstaller/wpull_env/lib/python3.4/site-packages/wpull/proxy/server.py", line 88, in __init__
AssertionError: /home/tanner/warrior/NewsGrabber-Warrior/proxy/proxy.crt
Deduplicating digest sha1:BZ2DZM6QWJSMLE5SJSCZMBTGYZNQKUGQ, url https://static.xx.fbcdn.net/rsrc.php/v3/y_/r/WTKa-3xI5b1.js
ERROR Proxy error
Traceback (most recent call last):
  File "/home/box/wpull/freezer/pyinstaller/wpull_env/lib/python3.4/site-packages/wpull/proxy/server.py", line 55, in __call__
  File "/home/box/wpull/freezer/pyinstaller/wpull_env/lib/python3.4/site-packages/wpull/proxy/server.py", line 88, in __init__
AssertionError: /home/tanner/warrior/NewsGrabber-Warrior/proxy/proxy.crt
WARNING WARNING: Could not send HEAD request to https://static.xx.fbcdn.net/images/icons/down_arrow_blue.gif: ''
INFO [generic] down_arrow_blue: Downloading webpage
WARNING ERROR: Unable to download webpage: '' (caused by BadStatusLine("''",)); please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see  https://yt-dl.org/update  on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.
WARNING Could not find external process metadata file: ./tmp-wpull-youtubedllycto27f/tmp*.info.json
INFO youtube-dl fetched \u2018https://static.xx.fbcdn.net/images/icons/down_arrow_blue.gif\u2019.
INFO Fetching \u2018https://static.xx.fbcdn.net/rsrc.php/v3ibfr4/yG/l/en_US/AdsPlacePageSetDataManager.js\u2019.
  1,033,306,112  30%    2.12MB/s    0:18:19  Deduplicating digest sha1:KCZQ7GBGK7DZ3JCQFZO26252O2BG3NKW, url https://static.xx.fbcdn.net/rsrc.php/v3ilgX4/yr/l/en_US/IuXRtSiO7qk.js
INFO Fetched \u2018https://static.xx.fbcdn.net/rsrc.php/v3ibfr4/yG/l/en_US/AdsPlacePageSetDataManager.js\u2019: 400 Bad Request. Length: 0 [text/html; charset="utf-8"].
Deduplicating digest sha1:7M7FGQFRL6EU6EBWRBUULTM7JU4TSMR4, url https://static.xx.fbcdn.net/rsrc.php/v3/y9/r/Zoqj4493zU6.js
WARNING WARNING: Assuming --restrict-filenames since file system encoding cannot encode all characters. Set the LC_ALL environment variable to fix this.
INFO Fetching \u2018https://static.xx.fbcdn.net/rsrc.php/v3iKY-4/yu/l/en_US/AdsLWIDialogUtils.js\u2019.
INFO Fetching \u2018https://static.xx.fbcdn.net/rsrc.php/v3iKY-4/yu/l/en_US/AdsLeadGenFormPreviewDetailsDataManager.js\u2019.
Deduplicating digest sha1:7BLPOOVPFAJLQ4TCGEKK7SN2IAPKWGZN, url https://static.xx.fbcdn.net/rsrc.php/v3/yi/r/GjW2KSaaE5I.js
INFO [generic] photo: Requesting header
ERROR Proxy error
Traceback (most recent call last):
  File "/home/box/wpull/freezer/pyinstaller/wpull_env/lib/python3.4/site-packages/wpull/proxy/server.py", line 55, in __call__
  File "/home/box/wpull/freezer/pyinstaller/wpull_env/lib/python3.4/site-packages/wpull/proxy/server.py", line 88, in __init__
AssertionError: /home/tanner/warrior/NewsGrabber-Warrior/proxy/proxy.crt
WARNING WARNING: Could not send HEAD request to https://static.xx.fbcdn.net/images/icons/photo.gif: ''
INFO [generic] photo: Downloading webpage
ERROR Proxy error
Traceback (most recent call last):
  File "/home/box/wpull/freezer/pyinstaller/wpull_env/lib/python3.4/site-packages/wpull/proxy/server.py", line 55, in __call__
  File "/home/box/wpull/freezer/pyinstaller/wpull_env/lib/python3.4/site-packages/wpull/proxy/server.py", line 88, in __init__
AssertionError: /home/tanner/warrior/NewsGrabber-Warrior/proxy/proxy.crt
WARNING ERROR: Unable to download webpage: '' (caused by BadStatusLine("''",)); please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see  https://yt-dl.org/update  on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.

I keep getting this in my console.

YouTube-DL has stopped working again

Need to investigate this and see why it keeps doing this.

archiveteam / newsgrabber Goto Github PK

newsgrabber's Introduction

NewsGrabber

How does this work?

Add a new website

Filename

refresh

version

urls

regex

videoregex

liveregex

wikidata (optional, only add if known & available)