openzim / mwoffliner Goto Github PK
View Code? Open in Web Editor NEWMediawiki scraper: all your wiki articles in one highly compressed ZIM file
Home Page: https://www.npmjs.com/package/mwoffliner
License: GNU General Public License v3.0
Mediawiki scraper: all your wiki articles in one highly compressed ZIM file
Home Page: https://www.npmjs.com/package/mwoffliner
License: GNU General Public License v3.0
This would avoid the requirement to have the remote wiki to have visual editor installed. Theoritically this might work with the remote Mediawiki provides the API
Hello, I am running Windows 10 64bit, so I wanted to give the virtual machine from http://www.openzim.org/wiki/Build_your_ZIM_file#MWoffliner a try first.
I would like to dump a mediawiki installation inside my company´s intranet.
Calling ./mwoffliner.js
produced an error (I did not call npm install
before that though):
File versions are from December 2015:
zimmaker@zimmaker:~/mwoffliner$ ll
total 132
drwxrwxr-x 4 zimmaker zimmaker 4096 Feb 1 15:53 ./
drwxrwxr-x 13 zimmaker zimmaker 4096 Dec 30 2015 ../
-rwxrwxr-x 1 zimmaker zimmaker 11019 Dec 30 2015 mwmatrixoffliner.js*
-rwxrwxr-x 1 zimmaker zimmaker 88083 Dec 30 2015 mwoffliner.js*
drwxrwxr-x 51 zimmaker zimmaker 4096 Feb 1 15:53 node_modules/
-rw-rw-r-- 1 zimmaker zimmaker 977 Dec 30 2015 package.json
-rw-rw-r-- 1 zimmaker zimmaker 910 Dec 30 2015 README
drwxrwxr-x 2 zimmaker zimmaker 4096 Dec 30 2015 servers/
-rwxrwxr-x 1 zimmaker zimmaker 8188 Dec 30 2015 wpselectionsoffliner.js*
So I downloaded the current version of mwoffliner, called npm install
which failed with some errors (forgot to save them).
As next step I updated nodejs to recent version http://askubuntu.com/a/480642/491867
Now I am stuck at following error when calling npm install
zimmaker@zimmaker:~/mwoffliner-master$ npm install
> [email protected] install /home/zimmaker/mwoffliner-master/node_modules/contextify
> node-gyp rebuild
make: Entering directory `/home/zimmaker/mwoffliner-master/node_modules/contextify/build'
CXX(target) Release/obj.target/contextify/src/contextify.o
../src/contextify.cc: In static member function ‘static v8::Local<v8::Context> ContextWrap::createV8Context(v8::Local<v8::Object>)’:
../src/contextify.cc:131:68: warning: ‘v8::Local<v8::Object> v8::Function::NewInstance() const’ is deprecated (declared at /home/zimmaker/.node-gyp/7.5.0/include/node/v8.h:3292): Use maybe version [-Wdeprecated-declarations]
Local<Object> wrapper = Nan::New(constructor)->NewInstance();
^
../src/contextify.cc:150:16: error: ‘class v8::ObjectTemplate’ has no member named ‘SetAccessCheckCallbacks’
otmpl->SetAccessCheckCallbacks(GlobalPropertyNamedAccessCheck,
^
../src/contextify.cc: In static member function ‘static void ContextWrap::GlobalPropertyGetter(v8::Local<v8::String>, const Nan::PropertyCallbackInfo<v8::Value>&)’:
../src/contextify.cc:182:80: warning: ‘v8::Local<v8::Value> v8::Object::GetRealNamedProperty(v8::Local<v8::String>)’ is deprecated (declared at /home/zimmaker/.node-gyp/7.5.0/include/node/v8.h:2948): Use maybe version [-Wdeprecated-declarations]
Local<Value> rv = Nan::New(ctx->sandbox)->GetRealNamedProperty(property);
^
../src/contextify.cc: In static member function ‘static void ContextWrap::GlobalPropertyQuery(v8::Local<v8::String>, const Nan::PropertyCallbackInfo<v8::Integer>&)’:
../src/contextify.cc:209:67: warning: ‘v8::Local<v8::Value> v8::Object::GetRealNamedProperty(v8::Local<v8::String>)’ is deprecated (declared at /home/zimmaker/.node-gyp/7.5.0/include/node/v8.h:2948): Use maybe version [-Wdeprecated-declarations]
if (!Nan::New(ctx->sandbox)->GetRealNamedProperty(property).IsEmpty() ||
^
../src/contextify.cc:210:71: warning: ‘v8::Local<v8::Value> v8::Object::GetRealNamedProperty(v8::Local<v8::String>)’ is deprecated (declared at /home/zimmaker/.node-gyp/7.5.0/include/node/v8.h:2948): Use maybe version [-Wdeprecated-declarations]
!Nan::New(ctx->proxyGlobal)->GetRealNamedProperty(property).IsEmpty()) {
^
make: *** [Release/obj.target/contextify/src/contextify.o] Error 1
make: Leaving directory `/home/zimmaker/mwoffliner-master/node_modules/contextify/build'
gyp ERR! build error
gyp ERR! stack Error: `make` failed with exit code: 2
gyp ERR! stack at ChildProcess.onExit (/usr/local/lib/node_modules/npm/node_modules/node-gyp/lib/build.js:276:23)
gyp ERR! stack at emitTwo (events.js:106:13)
gyp ERR! stack at ChildProcess.emit (events.js:192:7)
gyp ERR! stack at Process.ChildProcess._handle.onexit (internal/child_process.js:215:12)
gyp ERR! System Linux 3.19.0-42-generic
gyp ERR! command "/usr/local/bin/node" "/usr/local/lib/node_modules/npm/node_modules/node-gyp/bin/node-gyp.js" "rebuild"
gyp ERR! cwd /home/zimmaker/mwoffliner-master/node_modules/contextify
gyp ERR! node -v v7.5.0
gyp ERR! node-gyp -v v3.5.0
gyp ERR! not ok
npm ERR! Linux 3.19.0-42-generic
npm ERR! argv "/usr/local/bin/node" "/usr/local/bin/npm" "install"
npm ERR! node v7.5.0
npm ERR! npm v4.1.2
npm ERR! code ELIFECYCLE
npm ERR! [email protected] install: `node-gyp rebuild`
npm ERR! Exit status 1
npm ERR!
npm ERR! Failed at the [email protected] install script 'node-gyp rebuild'.
npm ERR! Make sure you have the latest version of node.js and npm installed.
npm ERR! If you do, this is most likely a problem with the contextify package,
npm ERR! not with npm itself.
npm ERR! Tell the author that this fails on your system:
npm ERR! node-gyp rebuild
npm ERR! You can get information on how to open an issue for this project with:
npm ERR! npm bugs contextify
npm ERR! Or if that isn't available, you can get their info via:
npm ERR! npm owner ls contextify
npm ERR! There is likely additional logging output above.
npm ERR! Please include the following file with any support request:
npm ERR! /home/zimmaker/mwoffliner-master/npm-debug.log
Versions are, as you see:
I have been looking at lots of issues having the displayed error Failed at the [email protected] install script 'node-gyp rebuild'.
and tried many steps so far, like updating some packages, removing, reinstalling and so on (I am sorry I cannot replicate all them now)
I understand the error seems to be quite generic and more nodejs related, lots of users are having it with totally different packages. Log file npm-debug.log
is attached:
npm-debug.log.txt
Maybe someone has some experience or can explain how to setup a working configuration of mwoffliner? Is there a more recent version of the virtual machine?
Last, I found the following instructions https://labtestwikitech.wikimedia.org/wiki/Nova_Resource_Talk:Mwoffliner but will need time to get through it.
Some more information and references in the http://www.openzim.org/wiki/ on working with the Virtual machine would be nice. I am glad if I can contribute some experience if I achieve a working configuration.
Examples:
sudo apt-get install openssh-server
(which is not included) for more convenient access through host ssh clientA Bollywood app would include articles listed in both Wikiproject Film and WikiProject India (see, e.g. https://en.wikipedia.org/wiki/Talk:List_of_Bollywood_films_of_2016).
http://download.kiwix.org/portable/vikidia/kiwix-0.9+vikidia_en_all_2015-11.zip
has a number of links back to en.vikidia.org, mostly js and a css:
Uncaught SyntaxError: Unexpected end of input
head.js:17821 No found, inserting dynamically
https://download.vikidia.org/en.vikidia.org/extensions/VisualEditor/lib/ve/src/ve.track.js Failed to load resource: the server responded with a status of 503 (Service Unavailable)
https://download.vikidia.org/en.vikidia.org/skins/Vector/collapsibleTabs.js Failed to load resource: the server responded with a status of 503 (Service Unavailable)
https://download.vikidia.org/en.vikidia.org/extensions/VisualEditor/lib/ve/src/ve.js Failed to load resource: the server responded with a status of 503 (Service Unavailable)
https://download.vikidia.org/en.vikidia.org/extensions/VisualEditor/modules/ve-mw/init/targets/ve.init.mw.ViewPageTarget.init.js Failed to load resource: the server responded with a status of 503 (Service Unavailable)
https://download.vikidia.org/en.vikidia.org/extensions/VisualEditor/modules/ve-mw/init/styles/ve.init.mw.ViewPageTarget.init.css Failed to load resource: the server responded with a status of 503 (Service Unavailable)
https://en.vikidia.org/w/resources/src/jquery/jquery.byteLength.js Failed to load resource: net::ERR_CONNECTION_REFUSED
https://en.vikidia.org/w/resources/lib/jquery.client/jquery.client.js Failed to load resource: net::ERR_CONNECTION_REFUSED
https://en.vikidia.org/w/resources/src/jquery/jquery.mwExtension.js Failed to load resource: net::ERR_CONNECTION_REFUSED
https://en.vikidia.org/w/resources/src/jquery/jquery.accessKeyLabel.js Failed to load resource: net::ERR_CONNECTION_REFUSED
https://en.vikidia.org/w/resources/src/jquery/jquery.tabIndex.js Failed to load resource: net::ERR_CONNECTION_REFUSED
https://en.vikidia.org/w/resources/lib/jquery/jquery.ba-throttle-debounce.js Failed to load resource: net::ERR_CONNECTION_REFUSED
https://en.vikidia.org/w/resources/src/mediawiki/mediawiki.notify.js Failed to load resource: net::ERR_CONNECTION_REFUSED
https://en.vikidia.org/w/resources/src/mediawiki/mediawiki.util.js Failed to load resource: net::ERR_CONNECTION_REFUSED
https://en.vikidia.org/w/resources/src/mediawiki/mediawiki.Title.js Failed to load resource: net::ERR_CONNECTION_REFUSED
https://en.vikidia.org/w/resources/src/mediawiki/mediawiki.Uri.js Failed to load resource: net::ERR_CONNECTION_REFUSED
https://en.vikidia.org/w/resources/src/mediawiki.legacy/wikibits.js Failed to load resource: net::ERR_CONNECTION_REFUSED
https://en.vikidia.org/w/resources/src/mediawiki.legacy/ajax.js Failed to load resource: net::ERR_CONNECTION_REFUSED
https://en.vikidia.org/w/resources/src/mediawiki.page/mediawiki.page.startup.js Failed to load resource: net::ERR_CONNECTION_REFUSED
vector.js:8 Uncaught TypeError: $(...).lastTabIndex is not a function
https://download.vikidia.org/en.vikidia.org/extensions/VisualEditor/modules/ve-mw/init/styles/ve.init.mw.ViewPageTarget.init.css Failed to load resource: the server responded with a status of 503 (Service Unavailable)
Just a suggestion.
I've concoct a sort of nodejs replacement for zimwriterfs which doesn't depend on libzim (that's why I've made it). So I'd happy if it would be of any use for you as well.
Feel free to pop in to zimmer.
Hi @kelson42,
Following up from our IRC conversation, thanks again for offering to take a look at our largish .zim file creation request. I tried to include all necessary information but please let me know if I missed anything.
Attached is the desired article list from en.wikipedia.org (one article per line, 649008 lines, utf-8, with underscores instead of spaces).
Article list:
code_7370_article_list.txt
More info:
Small (48x48) and large (if helpful) icons. These are free to use & public domain. Feel free to convert/resize if helpful.
Steps to reproduce:
sudo apt-get update
sudo apt-get dist-upgrade
sudo reboot
sudo apt-get install build-essential module-assistant git redis-server redis-tools jpegoptim advancecomp gifsicle pngquant imagemagick curl liblzma-dev libmagic-dev zlib1g-dev libgumbo-dev libtool automake libicu-dev uuid uuid-dev libzim-dev
sudo m-a prepare
cp -r /media/cdrom0 ~/
sudo ~/cdrom0/VBoxLimuxAdditions.run
sudo reboot
wget https://nodejs.org/download/release/v4.4.7/node-v4.4.7.tar.gz
tar xf node-v4.4.7.tar.gz
cd node-v4.4.7
./configure
make
sudo make install
cd
git clone https://github.com/kiwix/mwoffliner.git
cd mwoffliner
npm install
sudo systemctl restart redis-server.service
cd
npm install parsoid
cp ~/node_modules/parsoid/localsettings.js.example ~/node_modules/parsoid/localsettings.js
~/node_modules/parsoid/bin/server.js
git clone https://github.com/wikimedia/openzim.git
cd openzim/zimlib
./autogen.sh
./configure
make
cd
wget http://download.kiwix.org/dev/xapian-core-1.4.1-git.tar.xz
tar xf xapian-core-1.4.1-git.tar.xz
cd xapian-core-1.4.0
./configure
make
sudo make install
cd ~openzim/zimwriterfs
./autogen
./configure CXXFLAGS=-I../zimlib/include LDFLAGS=-L../zimlib/src/.libs
make
sudo make install
cd
mkdir wikis
cd wikis
mkdir puella-magi
cd puella-magi
~/mwoffliner/mwoffliner.js --verbose --mwUrl="http://wiki.puella-magi.net/" --adminEmail="[email]" --mwWikiPath="" --mwApiPath="api.php" --parsoidUrl="http://localhost:8000"
Results in the following error:
Saving favicon.png...
Downloading http://wiki.puella-magi.net/api.php?action=query&meta=siteinfo&format=json...
TypeError: Parameter 'url' must be a string, not undefined
at Url.parse (url.js:90:11)
at Object.urlParse [as parse] (url.js:84:5)
at /home/mediawiki/mwoffliner/mwoffliner.js:2435:26
at /home/mediawiki/mwoffliner/mwoffliner.js:2193:6
at /home/mediawiki/mwoffliner/node_modules/async/lib/async.js:676:51
at /home/mediawiki/mwoffliner/node_modules/async/lib/async.js:726:13
at /home/mediawiki/mwoffliner/node_modules/async/lib/async.js:52:16
at /home/mediawiki/mwoffliner/node_modules/async/lib/async.js:264:21
at /home/mediawiki/mwoffliner/node_modules/async/lib/async.js:44:16
at /home/mediawiki/mwoffliner/node_modules/async/lib/async.js:723:17
In the english wiktionary zim, links do not point to the specific language of the word. This is most noticable and problematic when looking at etymologies.
Say for instance, the French word falloir. If I wish to follow its etymology from Latin by clicking on "fallo", it goes to the entry with fallo but not the Latin subheading. This is not a big poroblem in this case as "fallo" only appears in three languages, but when an entry has many more languages, it can be bothersome.
This does not occur in the online wiktionary as each link points directly to the subheading.
Should we add tag in zim files indicating whether thebook has embedded index?
Now have nopic indicating zim file does not have pics, should the same be done for embedded index?
The benefit:
The disadvantage:
/Z/fulltextIndex/xapian
, then we will know)We currently have too many people complaining that the zim files are old because they get confused with the current wording "version of dd/mm/yyyy" (it is not clear that it actually is "last edited on dd/mm/yyyy").
Since we can't list authors but have a link to oldid, and based on Creative Commons' best practices could we replace the wording with
"[Pagename] is licensed under a Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files."
Whereby [Pagename] links to the permanent link with oldid, and we don't indicate a date anymore (the link does that).
Also: license terms link needs to be upgraded from 3.0 to 4.0.
Reported by tim-moody on 2015-10-30 20:02 UTC
view/discuss links should not be displayed
Reported here first https://sourceforge.net/p/kiwix/bugs/920/
It seems that something goes wrong if the homepage points to a redirect (to a namespace which is not mirrored)
Source code of guetzi is here:
https://github.com/google/guetzli
Probably in addition of jpegoptim.
We need to find the best combination
Would be great to provide also a copy of references web pages.
From OTRS:
Hi,
there seems to be a problem with the february portable dumps
of the german wikipedia, see
http://download.kiwix.org/portable/wikipedia/?C=M;O=D
The non-portable, zim versions in zim/ directory are fine, though.
Because I suspect that they have been automatically generated
by a cron job or similar, there might be a bug in the toolchain and
hence some probability that the next dump may fail as well (which
is why I'm writing this..)
Regards,
Jim
Problem:
In Vikidia zim files, h2, h3 elements doesn't have id. Table of content system needs id to scroll an element to viewport.
Example:
In vikidia_en_all_2016-09.zim, article European Union, header <h2> Member countries </h2>
doesn't have id. Javascript couldn't scroll to this element if user select this header in table of contents.
Here:
https://ar.wikipedia.org/wiki/%D9%85%D8%B3%D8%AA%D8%AE%D8%AF%D9%85:Stephane_(Kiwix)/Landing
Same as request as issue #26
Notes are like references and usually appear at the end of the article:
(that's the view on es.wikipedia.org)
But in the zim file a (truncated) note appears directly into the article:
(some parts are missing, it starts much earlier than the ""y diversas publicaciones gubernamentales". There are other notes in the article but they do not seem to have that issue.
"You need also to install all necessary nodejs "packages with "npm install" in this directory."
none are present or named.
"You need also a redis server correctly configured and listening to /dev/shm/redis.sock."
what constitutes 'correctly configured' is not mentioned.
Trying to grab fr.wikiquote.org, mwoffliner fails with :
Getting article from https://fr.wikiquote.org/w/api.php?action=visualeditor&format=json&paction=parse&page=S%C3%A9rie_B&oldid=243531
RangeError: Maximum call stack size exceeded
at RegExp.[Symbol.replace] (native)
at RegExp.[Symbol.replace] (native)
at String.replace (native)
at Object.exports.toASCIILowerCase (/home/mgautier/Project/KIWIX/node_modules/domino/lib/utils.js:72:12)
at HTMLAnchorElement.getAttribute (/home/mgautier/Project/KIWIX/node_modules/domino/lib/Element.js:371:36)
at rewriteUrl (/home/mgautier/Project/KIWIX/node_modules/mwoffliner/lib/mwoffliner.lib.js:1061:40)
at /home/mgautier/Project/KIWIX/node_modules/async/lib/async.js:181:20
at replenish (/home/mgautier/Project/KIWIX/node_modules/async/lib/async.js:319:21)
at /home/mgautier/Project/KIWIX/node_modules/async/lib/async.js:326:29
at /home/mgautier/Project/KIWIX/node_modules/async/lib/async.js:44:16
The command I run is :
node node_modules/mwoffliner/bin/mwoffliner.script.js --mwUrl https://fr.wikiquote.org/ --adminEmail [email protected] --outputDirectory ~/Project/KIWIX/wikiquote.fr --redisSocket /tmp/redis.sock --keepHtml --verbose --cacheDirectory ~/Project/KIWIX/wikiquote.fr.cache --tmpDirectory ~/Project/KIWIX/wikiquote.fr.tmp
mwoffliner was installed with npm and is in version 1.1.3.
It's currently in English
This important to have web font for certain languages like Burmese or Parsi. The reason is that we do not have the control about default operating system/browser ones and they are often of a bad quality.
That is why we need custom web fonts. The good point is that Mediawiki already defines it correctly on these Wikipedia. We just need to scrappe them and reload them correctly.
This should work out-of-the-box if resourceLoader works offline... but to be checked. See #18
First reported at https://sourceforge.net/p/kiwix/feature-requests/333/
Here it is : https://tr.wikipedia.org/wiki/Kullanıcı:Stephane_(Kiwix)/Landing
(watch it the i in Kullanici are some specific Turkish characters. This is rather urgent in light of the current Wikipedia ban in Turkey.
At least try to see if this is not too big.
It seems that in a few cases we generate wrong redirects. In wikipedia_en_all_2016-02.zim, "badness" redirects to "National Diet Library" which is wrong. The "redirects" file is wrong so this is an error in the API redirects retrieving parts or in "redirects" file writting part of mwoffliner.
We only have the English version, but we should offer at least the next 5-6 largest Kiwix user bases.
More and more users access our ZIM files of Mediawiki on mobile. We need to create files which look good on both desktop and mobile. In a way something similar like http://en.m.wikipedia.org/
The MWoffliner should be able for any Mediawiki to transform the DOM/CSS in a way which make it more mobile friendly.
Here are a few pointers:
The eswp/nopic zim opens on !!! rather than the planned landing page
The problem is that many of these pages are simply including templates.. so just reuse always the same cache content.
A module with core function to dump a Mediawiki should be created, splitted from the runnable scripts and pubt npmjs.org
SVG is supported everywhere meanwhile and it's a lot smaller. Take also care the zimwriterfs compress it correctly.
The front page should be changed to this one:
https://ar.wikipedia.org/wiki/user:Stephane_(Kiwix)/Landing
(or https://ar.wikipedia.org/wiki/مستخدم:Stephane_(Kiwix)/Landing)
They do not seem to use the visual editor, this problem should be fixed first.
At least on last Wikipedia DE. See for example the "cosinus" article.
I've prepared dedicated, simpled landing pages for
Spanish wikipedia
https://es.wikipedia.org/wiki/user:Popo_le_Chien/Kiwix
English Wikipedia
https://en.wikipedia.org/wiki/user:Popo_le_Chien/Kiwix
French Wikipedia
https://fr.wikipedia.org/wiki/user:Popo_le_Chien/Kiwix
Because "Vikidia:Accueil" is not mirrored because "Vikidia" namespace is not in the visual editor one.
In order to drastically diminish zim size (particularly for mobile storage), can we generate zim files that only take the infobox and intro paragraphs?
An article's structure normally is:
Banners
Infobox
intro text (leade)
==section title==
and so on. The idea would be to take everything (leade+infobox) that's above the first section title. If we can do without the banners that's even better.
The resourceLoader is the Mediawiki sub-system which allow to load, per article, javascript/css dependences. The documentation is here: https://www.mediawiki.org/wiki/ResourceLoader
MWoffliner should per article:
We probably need to download each resourceLoader module separatly and store them separatly in the ZIM file. Then each article should know which one are needed and reload them.
A first attempt to provide a solution to this problem has been done here, it needs review:
https://phabricator.wikimedia.org/T114788
For now it is only based on https://www.mediawiki.org/wiki/Manual:$wgContentNamespace
It seams as since Mediawiki 1.27 there are new api's for login. Could you please add support for those new api's?
"This article is issued by WikiMed Encydlopedia" but it should "This article is issued by Wikipedia"
With data from Wikidata
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT ?item ?itemLabel $logotype WHERE {
?item wdt:P154 ?logotype.
?item wdt:P856 <https://fr.wikivoyage.org/> .
SERVICE wikibase:label {
bd:serviceParam wikibase:language "fr"
}
}
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.