itkach / mwscrape Goto Github PK
View Code? Open in Web Editor NEWDownload rendered articles from MediaWiki API to CouchDB
License: Mozilla Public License 2.0
Download rendered articles from MediaWiki API to CouchDB
License: Mozilla Public License 2.0
Is Python 3 support planned? AFAIK mwclient already supports Python 3 in its resent version. So all mwscrape dependencies already support Python 3.
This issue is very important for such environments as for example Buildroot, that support only one Python instance (either python 2 or python 3) at a time.
2to3 tools found several needed changes:
I'm not a user of mwscrape but rather a co-maintainer of various python packages in Buildroot. So it would be great, if you could fix this issue.
Thanks.
Using
time mwscrape de.m.wikipedia.org --delete-not-found --changes-since 20150802
or
time mwscrape de.m.wikipedia.org --delete-not-found --recent --recent-days 14
do not provide a feedback (screen listing) of the collected articles.
The output
Starting session de-m-wikipedia-org-1439950680-485
does not create the document in mwscrape
The scrape stops after approximately 2h
Therefore I guess no data is collected.
In the doc of CouchDB I can read that it were "wise" to create an admin to restrict the full access to the couchdb, and in the fresh couchdb installation in Futon on Apache CouchDB 1.6.1 I can read at the bottom right:
Welcome to Admin Party!
Everyone is admin. Fix this
But in mwscrape it seems to be not possible to use parameters as login to a secured CouchDB.
Without login I get an error:
couchdb.http.Unauthorized: (u'unauthorized', u'You are not a server admin.')
Since CouchDB is a rather big dependency that requires installation as a system-wide service, would it be possible to add support for an SQLite database as well? I would imagine that SQLite would be able to cope with all Mediawikis except Wikipedia; and it could make mwscrape much easier to set up for new users, since it would simply write to a single local file.
Hi,
mwclient/mwclient#213 added changes to the URL handling causing these warnings when calling scrape.py:
/usr/lib/python2.7/site-packages/mwclient/client.py:377: DeprecationWarning: Specifying host as a tuple is deprecated as of mwclient 0.10.0. Please use the new scheme argument instead.
why scrape an online instance of mediawiki when you can do it offline?
hello , help to solve the problem
(env-mwscrape)root@nikitozzz:~# mwscrape http://sportwiki.to/ --site-path=/
Starting session sportwiki-to-1439500632-631
Starting at None
0 5-HTP
5-HTP is up to date (rev. 60706), skipping
Traceback (most recent call last):
File "/root/env-mwscrape/bin/mwscrape", line 9, in
load_entry_point('mwscrape==1.0', 'console_scripts', 'mwscrape')()
File "/root/env-mwscrape/local/lib/python2.7/site-packages/mwscrape/scrape.py", line 489, in main
for page in ipages(pages):
File "/root/env-mwscrape/local/lib/python2.7/site-packages/mwscrape/scrape.py", line 472, in ipages
print('%7s %s' % (index, title))
UnicodeEncodeError: 'ascii' codec can't encode characters in position 2-12: ordinal not in range(128)
updated mwscrape to the version including the --speed parameter.
Using
mwscrape de.m.wikipedia.org --speed 5 --delete-not-found
to replace individual 10 scrapes resulted in mwscrape.db was increased to 72,5 GB (!) after a couple of days running. Compacting manually resolved the size issue once.
I guess the compaction of mwscrape.db with --speed does not work properly albeit it worked correctly without --speed. (Sometimes quite aggressive, tough)
Probably allow growth of mwscrape to a specific (adjustable :) ? ) value like 256 MB. This would compact the mwscrape.db after around 5000 scrapes. The compacting load on the couchdb would not be excessive. With an adjustable value the harddisk space may be tweaked.
This would give the option to tweak the system according to space and speed capabilities
There is an option within couchdb settings to have automatic compaction, however this seems to run systemwide and not for a specific database.
With scraping enwiktionary to couchdb with
lang=en
mwscrape -c http://admin:password@localhost:5984 https://$lang.wiktionary.org --db $lang-wiktionary-org --speed 5
I am getting 2,404,262 articles in the couchdb after a week scraping.
According to https://en.wikipedia.org/wiki/Wiktionary there should be 7,5 mio articles. I know a lot of that stuff is Chinese and whatsoever, but the discrepancy is significant.
dewiktionary is pretty close with 1.09 mio in couchdb and online,
elwiktionary has 1,226,405 in couchdb and 1,318,825 online
frwiktionary has 4,709,048 and 4,798,530 online
which seems fine for me. There are always deviations depending on the timestamp and the different count methods of wikimedia itself.
Where does the difference in enwiktionary come from?
Is there any filter selecting English language only?
A user came up with the following finding in dewiki:
Article Westerwelle Guido
his death was entered in Wikipedia on the same day at March 18 2016. However it did not show up in April nor in May compilation of dewiki data. The log files I create do not show any change of Westerwelle.
Article Genscher Hans-Dietrich
died on March 31 2016, his data was in April and May dewiki. The log files do not show any change of Genscher.
The scrape of dewiki runs as a cron job once a day:
mwscrape $couchdb --delete-not-found --recent --recent-days 21 2>&1 | tee /home/guest/dewi-$Datum.log
where$couchdb
is de-m-wikipedia-org
and $Datum
is the actual date.
The 2>&1 | tee
does redirect the output of the scrape as well onto the screen as into the logfile.
This way I have an overlap of 21 days on items I would not capture the first time or any broken connections during scraping.
Nevertheless Genscher was updated correctly, but Westerwelle not.
It is virtually impossible to check if the changes made are completely scraped.
Any thoughts?
Would it not be useful to have a parameter --end like in the mwscrape2slob (--endkey) to have better control over scrape process?
Especially if I scrape not only with one process, but with many scrapes at the same time, this will end up in unneccessary requests when one scrape reached the beginning of the next scrape.
With a parameter --end I could start one scrape e.g. at AAA and end it at BZZZ etc.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.