Coder Social home page Coder Social logo

Parameter --end missing about mwscrape HOT 3 CLOSED

francwalter avatar francwalter commented on September 15, 2024
Parameter --end missing

from mwscrape.

Comments (3)

itkach avatar itkach commented on September 15, 2024

Would it not be useful to have a parameter --end

It may be useful in some scenarios. But it only work when the list of titles is sorted. That's the case when scraping everything, but not so when downloading recent changes and may not be the case if downloading titles specified in a file. There's pretty much one reason to start multiple processes though - to download faster. Downloading with one single-threaded process is too slow to be practical for large Mediawiki sites like enwiki and everybody seems to be firing up multiple processes anyway, so I added options to download in several parallel threads (see 977c469). This is better then starting multiple processes - easier both on local system and on target site - and combined with --recent is probably in general a better way to keep existing mwscrape databases up to date than checking all titles all the time. (It may not seem "fast" enough, but let's not forget that we don't want to attack or harm Wikipedia) I'm considering adding a "forever" mode which would download recent changes and then start over requesting changes since it's previous run - it just needs to be fast enough to be able to finish one run within recent changes window of 30 days provided by Wikipedia API.

from mwscrape.

MHBraun avatar MHBraun commented on September 15, 2024

This would be a considerable improvement in scraping. I am actually running tests on the same issue with --recent-days option in single thread. It looks like for dewiki the window of 30days with one thread is working (test not yet finished and still running).

from mwscrape.

MHBraun avatar MHBraun commented on September 15, 2024

Reconsidered the proposal: It does not make sense, as the articles do not appear in a sorted list. --end BZZZ could stop and after BZZZ some articles with BAAA may still appear. These articles would be lost as the process already ended. Sometimes a considerable overlap is needed to fetch all articles in a range. The amount of "considerable" is heavily fluctuating within Wikipedia for some reason. I recommend to close the issue.

from mwscrape.

Related Issues (11)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.