Comments (3)
Would it not be useful to have a parameter --end
It may be useful in some scenarios. But it only work when the list of titles is sorted. That's the case when scraping everything, but not so when downloading recent changes and may not be the case if downloading titles specified in a file. There's pretty much one reason to start multiple processes though - to download faster. Downloading with one single-threaded process is too slow to be practical for large Mediawiki sites like enwiki and everybody seems to be firing up multiple processes anyway, so I added options to download in several parallel threads (see 977c469). This is better then starting multiple processes - easier both on local system and on target site - and combined with --recent is probably in general a better way to keep existing mwscrape databases up to date than checking all titles all the time. (It may not seem "fast" enough, but let's not forget that we don't want to attack or harm Wikipedia) I'm considering adding a "forever" mode which would download recent changes and then start over requesting changes since it's previous run - it just needs to be fast enough to be able to finish one run within recent changes window of 30 days provided by Wikipedia API.
from mwscrape.
This would be a considerable improvement in scraping. I am actually running tests on the same issue with --recent-days option in single thread. It looks like for dewiki the window of 30days with one thread is working (test not yet finished and still running).
from mwscrape.
Reconsidered the proposal: It does not make sense, as the articles do not appear in a sorted list. --end BZZZ could stop and after BZZZ some articles with BAAA may still appear. These articles would be lost as the process already ended. Sometimes a considerable overlap is needed to fetch all articles in a range. The amount of "considerable" is heavily fluctuating within Wikipedia for some reason. I recommend to close the issue.
from mwscrape.
Related Issues (11)
- Not possible to login to CouchDB HOT 3
- mwscrape.db in couch gets big HOT 8
- UnicodeEncodeError: 'ascii' codec can't encode characters in position 2-12: ordinal not in range(128) HOT 1
- mwscrape --changes-since and --recent no feedback HOT 1
- does --recent collect all changes? HOT 5
- Python 3 support HOT 2
- Add support for SQLite database HOT 1
- Specifying host as a tuple is deprecated as of mwclient 0.10.0. HOT 1
- enable usage of https://dumps.wikimedia.org/ HOT 2
- article count enwiktionary HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from mwscrape.