itkach / mwscrape Goto Github PK

View Code? Open in Web Editor NEW

21.0 21.0 5.0 103 KB

Download rendered articles from MediaWiki API to CouchDB

License: Mozilla Public License 2.0

Python 99.66% Emacs Lisp 0.34%

mwscrape's People

Stargazers

Watchers

Forkers

korhoj tabdiukov webscrapist mhbraun sklart mpm2050

mwscrape's Issues

Python 3 support

Is Python 3 support planned? AFAIK mwclient already supports Python 3 in its resent version. So all mwscrape dependencies already support Python 3.

This issue is very important for such environments as for example Buildroot, that support only one Python instance (either python 2 or python 3) at a time.

2to3 tools found several needed changes:

python 3 incompatible print statements
urllib package was renamed in python 3
thread package was also renamed (_thread)

I'm not a user of mwscrape but rather a co-maintainer of various python packages in Buildroot. So it would be great, if you could fix this issue.

Thanks.

mwscrape --changes-since and --recent no feedback

Using
time mwscrape de.m.wikipedia.org --delete-not-found --changes-since 20150802
or
time mwscrape de.m.wikipedia.org --delete-not-found --recent --recent-days 14
do not provide a feedback (screen listing) of the collected articles.

The output
Starting session de-m-wikipedia-org-1439950680-485
does not create the document in mwscrape
The scrape stops after approximately 2h

Therefore I guess no data is collected.

Not possible to login to CouchDB

In the doc of CouchDB I can read that it were "wise" to create an admin to restrict the full access to the couchdb, and in the fresh couchdb installation in Futon on Apache CouchDB 1.6.1 I can read at the bottom right:
Welcome to Admin Party!
Everyone is admin. Fix this
But in mwscrape it seems to be not possible to use parameters as login to a secured CouchDB.
Without login I get an error:
couchdb.http.Unauthorized: (u'unauthorized', u'You are not a server admin.')

Add support for SQLite database

Since CouchDB is a rather big dependency that requires installation as a system-wide service, would it be possible to add support for an SQLite database as well? I would imagine that SQLite would be able to cope with all Mediawikis except Wikipedia; and it could make mwscrape much easier to set up for new users, since it would simply write to a single local file.

Specifying host as a tuple is deprecated as of mwclient 0.10.0.

Hi,
mwclient/mwclient#213 added changes to the URL handling causing these warnings when calling scrape.py:

/usr/lib/python2.7/site-packages/mwclient/client.py:377: DeprecationWarning: Specifying host as a tuple is deprecated as of mwclient 0.10.0. Please use the new scheme argument instead.

enable usage of https://dumps.wikimedia.org/

why scrape an online instance of mediawiki when you can do it offline?

UnicodeEncodeError: 'ascii' codec can't encode characters in position 2-12: ordinal not in range(128)

hello , help to solve the problem
(env-mwscrape)root@nikitozzz:~# mwscrape http://sportwiki.to/ --site-path=/
Starting session sportwiki-to-1439500632-631
Starting at None
0 5-HTP
5-HTP is up to date (rev. 60706), skipping
Traceback (most recent call last):
File "/root/env-mwscrape/bin/mwscrape", line 9, in
load_entry_point('mwscrape==1.0', 'console_scripts', 'mwscrape')()
File "/root/env-mwscrape/local/lib/python2.7/site-packages/mwscrape/scrape.py", line 489, in main
for page in ipages(pages):
File "/root/env-mwscrape/local/lib/python2.7/site-packages/mwscrape/scrape.py", line 472, in ipages
print('%7s %s' % (index, title))
UnicodeEncodeError: 'ascii' codec can't encode characters in position 2-12: ordinal not in range(128)

mwscrape.db in couch gets big

updated mwscrape to the version including the --speed parameter.
Using
mwscrape de.m.wikipedia.org --speed 5 --delete-not-found
to replace individual 10 scrapes resulted in mwscrape.db was increased to 72,5 GB (!) after a couple of days running. Compacting manually resolved the size issue once.
I guess the compaction of mwscrape.db with --speed does not work properly albeit it worked correctly without --speed. (Sometimes quite aggressive, tough)
Probably allow growth of mwscrape to a specific (adjustable :) ? ) value like 256 MB. This would compact the mwscrape.db after around 5000 scrapes. The compacting load on the couchdb would not be excessive. With an adjustable value the harddisk space may be tweaked.

This would give the option to tweak the system according to space and speed capabilities

There is an option within couchdb settings to have automatic compaction, however this seems to run systemwide and not for a specific database.

article count enwiktionary

With scraping enwiktionary to couchdb with

lang=en
mwscrape -c http://admin:password@localhost:5984 https://$lang.wiktionary.org --db $lang-wiktionary-org --speed 5

I am getting 2,404,262 articles in the couchdb after a week scraping.
According to https://en.wikipedia.org/wiki/Wiktionary there should be 7,5 mio articles. I know a lot of that stuff is Chinese and whatsoever, but the discrepancy is significant.

dewiktionary is pretty close with 1.09 mio in couchdb and online,
elwiktionary has 1,226,405 in couchdb and 1,318,825 online
frwiktionary has 4,709,048 and 4,798,530 online
which seems fine for me. There are always deviations depending on the timestamp and the different count methods of wikimedia itself.

Where does the difference in enwiktionary come from?
Is there any filter selecting English language only?

does --recent collect all changes?

A user came up with the following finding in dewiki:

Article Westerwelle Guido
his death was entered in Wikipedia on the same day at March 18 2016. However it did not show up in April nor in May compilation of dewiki data. The log files I create do not show any change of Westerwelle.
Article Genscher Hans-Dietrich
died on March 31 2016, his data was in April and May dewiki. The log files do not show any change of Genscher.

The scrape of dewiki runs as a cron job once a day:

mwscrape $couchdb --delete-not-found --recent --recent-days 21 2>&1 | tee /home/guest/dewi-$Datum.log

where$couchdb is de-m-wikipedia-org
and $Datumis the actual date.
The 2>&1 | tee does redirect the output of the scrape as well onto the screen as into the logfile.
This way I have an overlap of 21 days on items I would not capture the first time or any broken connections during scraping.

Nevertheless Genscher was updated correctly, but Westerwelle not.

It is virtually impossible to check if the changes made are completely scraped.
Any thoughts?

Parameter --end missing

Would it not be useful to have a parameter --end like in the mwscrape2slob (--endkey) to have better control over scrape process?
Especially if I scrape not only with one process, but with many scrapes at the same time, this will end up in unneccessary requests when one scrape reached the beginning of the next scrape.
With a parameter --end I could start one scrape e.g. at AAA and end it at BZZZ etc.

itkach / mwscrape Goto Github PK

mwscrape's People

Stargazers

Watchers

Forkers

mwscrape's Issues

Python 3 support

mwscrape --changes-since and --recent no feedback

Not possible to login to CouchDB

Add support for SQLite database

Specifying host as a tuple is deprecated as of mwclient 0.10.0.

enable usage of https://dumps.wikimedia.org/

UnicodeEncodeError: 'ascii' codec can't encode characters in position 2-12: ordinal not in range(128)

mwscrape.db in couch gets big

article count enwiktionary

does --recent collect all changes?

Parameter --end missing

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent