Coder Social home page Coder Social logo

w3ss / mwoffliner Goto Github PK

View Code? Open in Web Editor NEW

This project forked from openzim/mwoffliner

0.0 3.0 0.0 3.36 MB

MWoffliner allows to scrape any Mediawiki remote instance (like Wikipedia) to the local filesystem.

License: GNU General Public License v3.0

JavaScript 11.57% Shell 2.19% Dockerfile 0.25% CSS 4.86% HTML 2.63% TypeScript 78.50%

mwoffliner's Introduction

mwoffliner

mwoffliner is a tool for making a local HTML snapshot of any online (recent) Mediawiki instance. It goes through all articles (or a selection if specified) and writes the HTML/images to a local directory. It has mainly been tested against Wikimedia projects like Wikipedia, Wiktionary, ... But it should also work for any recent Mediawiki.

NPM

Build Status CodeFactor

Prerequisites

  • *NIX Operating System (Linux/macOS)
  • NodeJS
  • Redis
  • Libzim (On linux we automatically download binaries)
  • Various build tools that are probably already installed on your machine (libjpeg, gcc)

Setup

MacOS

NodeJS

curl -o- https://raw.githubusercontent.com/creationix/nvm/v0.33.11/install.sh | bash && \
source ~/.bashrc && \
nvm install stable && \
node --version

Redis

> brew install redis

LibZim

See instructions here: https://github.com/openzim/libzim

Linux (Debian)

NodeJS

curl -o- https://raw.githubusercontent.com/creationix/nvm/v0.33.11/install.sh | bash && \
source ~/.bashrc && \
nvm install stable && \
node --version

Redis

> sudo apt-get install redis-server

Usage

Command Line

> npm i -g mwoffliner
> mwoffliner --help

> mwoffliner \
    --mwUrl=https://es.wikipedia.org \
    [email protected] \
    --verbose \
    --format=nozim \ # Won't make a final ZIM file
    --articleList=./articleList # Will download one article

Programmatic API

const mwoffliner = require('mwoffliner');
const parameters = {
    mwUrl: "https://es.wikipedia.org",
    adminEmail: "[email protected]",
    verbose: true,
    format: "nozim",
    articleList: "./articleList"
};
mwoffliner.execute(parameters); // returns a Promise

Development

Please see CONTRIBUTING.md

git clone https://github.com/openzim/mwoffliner.git
cd mwoffliner

npm i
./watch.sh # Watch for changes in "src/*"

Code Style

We follow a nearly exact tslint:recommended scheme - you can see more information here: ./tslint.json

It's best to use TSLint to check your code as you develop, this project is pre-configured for development with VSCode and the TSLint plugin.

Debugging

There is a pre-configured debug config for VSCode, just click on the debugging tab.

Make sure you read CONTRIBUTING.md for tips on how to best debug and submit issues.

Publishing

To publish, it's best to use a clean clone of the project:

git clone https://github.com/openzim/mwoffliner.git
npm i # required for Snyk checks
./build.sh
npm publish  # you must be logged in already (npm login)

Background

There are two Wikitext parsers. mwoffliner uses Parsoid.

  • Wikitext is the name of the markup language that Wikipedia uses.
  • MediaWiki is a PHP package that runs a wiki, including Wikipedia.
  • MediaWiki includes a parser for Wikitext into HTML, and this parser creates Wikipedia currently.
  • There is another Wikitext parser, called Parsoid, implemented in Javascript (Node.js).
  • Parsoid is planned to eventually become the main parser for Wikipedia.
  • mwoffliner uses Parsoid.
  • mwoffliner calls Parsoid and then post-processes the results for offline format.

mwoffliner's People

Contributors

automactic avatar baturin avatar bradyhunsaker avatar bshishov avatar cscott avatar fattredd avatar isnit0 avatar kelson42 avatar legoktm avatar mdholloway avatar skylsmoi avatar snyk-bot avatar subbuss avatar tamasfabi avatar vss-devel avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.