outofcontrol / mediawiki-to-gfm Goto Github PK
View Code? Open in Web Editor NEWConverts Mediawiki format to Github Flavoured Markdown format
Converts Mediawiki format to Github Flavoured Markdown format
With PHP 8, it doesn't work at the moment, unfortunately:
composer update --no-dev
Composer is operating significantly slower than normal because you do not have the PHP curl extension enabled.
Loading composer repositories with package information
Updating dependencies
Your requirements could not be resolved to an installable set of packages.
Problem 1
- phpunit/phpunit[6.0.0, ..., 6.5.14] require php ^7.0 -> your php version (8.1.2) does not satisfy that requirement.
- Root composer.json requires phpunit/phpunit ~6 -> satisfiable by phpunit/phpunit[6.0.0, ..., 6.5.14].
Running update with --no-dev does not mean require-dev is ignored, it just means the packages will not be installed. If dev requirements are blocking the update you have to resolve those problems.
Thanks in advance.
By default, the script will stop once there's an error with one of the pages that you are trying to convert. I'm referring to errors like - "Unexpected "{" character, expecting "|-"...
If you wish the script to keep on running and converting the rest of the pages and ignoring these issues, you can remove (or comment out) the below portion from "vendor\ryakad\pandoc-php\src\Pandoc\Pandoc.php"
else
{
throw new PandocException(
sprintf('Pandoc could not convert successfully, error code: %s. Tried to run the following command: %s', $returnval, $command)
);
}
Does this also work for DokuWiki instances? (including the steps for exporting the site as XML)
D:\mediawiki-to-gfm>php convert.php --filename=wiki.xml --output=wiki
'which' is not recognized as an internal or external command,
operable program or batch file.
Unable to locate pandoc
These are all reported when run from the same location, so I don't believe this is a path/env issue.
When there are multiple pages with the same name that differs only in letter case, e.g.
the content of one page is lost on case-insensitive filesystems (e.g. on macOS).
I happened to have such pages with redirects, e.g. Mysql was redirected to MySQL.
I fixed that for my use case with the following code in saveFile()
:
$name = $fileMeta['directory'] . $fileMeta['filename'] . '.md';
if (file_exists($name)) {
$name = $fileMeta['directory'] . $fileMeta['filename'] . ' (2).md';
}
$file = fopen($name, 'w');
I get this error trying to run it
$ ./convert.php --filename=Wikipedia-20210326223156.xml --output=CONV
pandoc: Unknown writer: gfm
Pandoc could not convert successfully, error code: 9. Tried to run the following command: /usr/bin/pandoc --from=mediawiki --to=gfm /tmp/pandoc605e68074a512
I propose enhancing the existing software to function as a GitHub Actions workflow, facilitating the direct conversion of MediaWiki pages to GitHub-flavored Markdown within a GitHub repository. This integration aims to simplify the migration process, providing users with an automated solution for maintaining synchronized documentation between MediaWiki and GitHub repositories.
When pandoc fails, the faulty page isn't displayed.
See my proposed change in 5f270a41e2e30440ee6a06c0075bee92c34c403f that handles Pandoc exceptions, to report the name of the faulty page, and also adds an option to skip errors, and thus allow skipping to the next files.
Hope this helps,
Whenever pandoc cannot convert the contents (which I'm experiencing right now with strange tables in the pages), there's no other solution than process the content manually.
It could be handy to have an option, which will save the original content in a "raw" source code bloc in an otherwise empty page, to allow the process to comple, letting the user manually fix the output, instead of having to fix the pages in the input.
Hope this makes sense.
sudo apt install php7.3
(l'occasion de tester si PHP 7.3 supporte bien le scriptsudo apt install pandoc
sudo apt install git
sudo apt install composer
sudo apt install phpunit
While converting, using
user@xat:~/work/projects/markdown/mediawiki-to-gfm$ ./convert.php --filename=./sysadmin-20190408122828.xml --output=./sortie
error :
Converted: ./sortie/AD_Role
[...]
Converted: ./sortie/Creation_compte_WindowsError at "source" (line 213, column 1):
unexpected end of input
expecting "<"^
Pandoc could not convert successfully, error code: 65. Tried to run the following command: /usr/bin/pandoc --from=mediawiki --to=gfm /tmp/pandoc5cab3f8624803
I'll try to use php 7.2
In my case, the tool has exported the first version of each page instead of the last version. So old versions of pages have been exported.
I have done the mediawiki.xml
export from MediaWiki 1.29.
To fix the problem I have changed the following line in convertData()
:
$text = $this->cleanText($text[0], $fileMeta);
to:
$text = $this->cleanText(end($text), $fileMeta);
I get a "Pandoc executable is not executable" error, but Pandoc is working fine. Not sure what to do from here.
Error at "/tmp/pandoc661fbeeccf423" (line 43, column 3):
unexpected '-'
|-<div style="text-align: left;">
^
Pandoc\PandocException: Pandoc could not convert successfully, error code: 64. Tried to run the following command: /usr/bin/pandoc --from=mediawiki --to=gfm /tmp/pandoc661fbeeccf423 in /mediawiki-to-gfm/vendor/ryakad/pandoc-php/src/Pandoc/Pandoc.php:287
Stack trace:
#0 /mediawiki-to-gfm/app/src/Convert.php(194): Pandoc\Pandoc->runWith('{{DISPLAYTITLE:...', Array)
#1 /mediawiki-to-gfm/app/src/Convert.php(149): App\Convert->runPandoc('{{DISPLAYTITLE:...')
#2 /mediawiki-to-gfm/app/src/Convert.php(117): App\Convert->convertData()
#3 /mediawiki-to-gfm/convert.php(50): App\Convert->run()
#4 {main}
Latest, fresh built docker image
I'm haven't strong knowledge with docker so maybe I missed something.
But I try to do what is explained at : https://github.com/outofcontrol/mediawiki-to-gfm#run-with-docker
Run with docker
Create a new directory and put filename.xml into the new directory:
mkdir my_wiki mv filename.xml my_wiki/ cd my_wiki
Now you can convert filename.xml using docker. Note: do not use the output parameter. The output will always be written into the subdirectory output of the current path. (hence the creation of a new directory). This is necessary, because the docker container does not have access to your filesystem except for the current directory (because of the -v $PWD:/app parameter for docker)
docker run -v $PWD:/app outofcontrol/mediawiki-to-gfm --filename=filename.xml
But doing that I get :
Unable to find image 'outofcontrol/mediawiki-to-gfm:latest' locally
docker: Error response from daemon: pull access denied for outofcontrol/mediawiki-to-gfm, repository does not exist or may require 'docker login': denied: requested access to the resource is denied.
See 'docker run --help'.
I found this PR : #22
and so I tried to use docker run -v $PWD:/app thawn/mediawiki-to-gfm --filename=filename.xmll
and it seems to work.
Is there something obvious I missed ?
Seems like carriage returns are kept (and after forking and removing them, I now recall/see that git automatically adds CRLF even to LF files).
I ran a quick perl -pi -e 's/\r/\n/g' convert.php
to remove the CR \r
characters and everything ran fine.
hi,
I tried the script with the latest pandoc binary, but everything between <pre> bla </pre>
is lost. We use a lot of them for showing commands on a shell. I've found jgm/pandoc#5333 but don't find any workaround for it.
cu denny
Source document: This file from https://forgottenrealms.fandom.com/wiki/Special:Statistics
unexpected end of input
expecting white space
Pandoc\PandocException: Pandoc could not convert successfully, error code: 64. Tried to run the following command: /opt/homebrew/bin/pandoc --from=mediawiki --to=gfm /var/folders/7x/hf32blrs4cj6r0vmvl85fgy80000gn/T/pandoc6470f6dd2d79e in /Users/joshuaziggas/Downloads/mediawiki-to-gfm/vendor/ryakad/pandoc-php/src/Pandoc/Pandoc.php:287
Stack trace:
#0 /Users/j/Downloads/mediawiki-to-gfm/app/src/Convert.php(194): Pandoc\Pandoc->runWith('{{FA}}\n{{otheru...', Array)
#1 /Users/j/Downloads/mediawiki-to-gfm/app/src/Convert.php(149): App\Convert->runPandoc('{{FA}}\n{{otheru...')
#2 /Users/j/Downloads/mediawiki-to-gfm/app/src/Convert.php(117): App\Convert->convertData()
#3 /Users/j/Downloads/mediawiki-to-gfm/convert.php(50): App\Convert->run()
#4 {main}
PHP 8.2.6
The offending line(s) in the XML seems to be:
<format>text/x-wiki</format>
<text bytes="12522" sha1="8k8blefnyku1lbh8tnrm419ffv95n5r" xml:space="preserve">{{FA}}
{{otheruses4|the [[Dwarf|dwarven]] script|the dwarven language|Dwarvish}}
{{Item
| image = Runestone dwarf.png
| caption = A dwarven [[runestone]], stating "This place is Dhurri's Bridge.<br /> Here 42 of the best warriors of the House of Helmung fell,<br /> to keep orcs from the Halls. We slew 608. [Day] 218,<br /> [year since the founding of the House] 377.<br /> [[Nain]], warrior of the [[House of Helmung]]"<ref name="DD">{{Cite book/Dwarves Deep|inside cover}}</ref>
Any thoughts? There are a few more areas in the file like this one.
Describe the bug
Trying to run the image on WSL2 (Ubuntu on Windows) cannot find a matching image for the platform.
To Reproduce
Steps to reproduce the behavior:
docker run --platform linux/amd64 -v $PWD:/app oooc/mediawiki-to-gfm --filename=filename.xml
Unable to find image 'oooc/mediawiki-to-gfm:latest' locally
latest: Pulling from oooc/mediawiki-to-gfm
Digest: sha256:ef61d222c8b025c23c8f574f40d2d8e680a4fead90eb47042a76429e482dbc48
Status: Image is up to date for oooc/mediawiki-to-gfm:latest
docker: image with reference docker.io/oooc/mediawiki-to-gfm:latest was found but does not match the specified platform: wanted linux/amd64, actual: linux/arm64/v8.
See 'docker run --help'.
Expected behavior
I've run Docker on WSL2 and it works fine.
Example:
docker run --platform linux/amd64 hello-world
Unable to find image 'hello-world:latest' locally
latest: Pulling from library/hello-world
c1ec31eb5944: Pull complete
Digest: sha256:d000bc569937abbe195e20322a0bde6b2922d805332fd6d8a68b19f524b7d21d
Status: Downloaded newer image for hello-world:latest
Hello from Docker!
This message shows that your installation appears to be working correctly.
To generate this message, Docker took the following steps:
1. The Docker client contacted the Docker daemon.
2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
(amd64)
3. The Docker daemon created a new container from that image which runs the
executable that produces the output you are currently reading.
4. The Docker daemon streamed that output to the Docker client, which sent it
to your terminal.
To try something more ambitious, you can run an Ubuntu container with:
$ docker run -it ubuntu bash
Share images, automate workflows, and more with a free Docker ID:
https://hub.docker.com/
For more examples and ideas, visit:
https://docs.docker.com/get-started/
Desktop (please complete the following information):
It crashes when i try the convertion. I don't know where the
comes from, i checked the mediawiki dump, and it's not there./tmp/pandoc638e46d7bac03
, it doesn't exist anymore.I use pandoc 2.18 and mediawiki-to-gfm as a git checkout from today.
web-php ~/clones/mediawiki-to-gfm # ./convert.php --filename=/tmp/mediawiki.dump.xml --output=converted
Error at "/tmp/pandoc638e46d7bac03" (line 3, column 1):
unexpected '<'
<table> <tr> <td>
^
Pandoc\PandocException: Pandoc could not convert successfully, error code: 65. Tried to run the following command: /usr/bin/pandoc --from=mediawiki --to=gfm /tmp/pandoc638e46d7bac03 in /root/clones/mediawiki-to-gfm/vendor/ryakad/pandoc-php/src/Pandoc/Pandoc.php:287
Stack trace:
#0 /root/clones/mediawiki-to-gfm/app/src/Convert.php(194): Pandoc\Pandoc->runWith()
#1 /root/clones/mediawiki-to-gfm/app/src/Convert.php(149): App\Convert->runPandoc()
#2 /root/clones/mediawiki-to-gfm/app/src/Convert.php(117): App\Convert->convertData()
#3 /root/clones/mediawiki-to-gfm/convert.php(50): App\Convert->run()
#4 {main}
The README says: Tested in PHP 7.0 and 7.1
I'm happy to confirmation the script can run on PHP 7.2 too. I run it in Ubuntu 18.04 under Windows Subsystem for Linux (WSL).
php --version
PHP 7.2.5-0ubuntu0.18.04.1 (cli) (built: May 9 2018 17:21:02) ( NTS )
Copyright (c) 1997-2018 The PHP Group
Zend Engine v3.2.0, Copyright (c) 1998-2018 Zend Technologies
with Zend OPcache v7.2.5-0ubuntu0.18.04.1, Copyright (c) 1999-2018, by Zend Technologies
On Ubuntu Bionic, packages offer pandoc 1.19, so it is necessary to install the newer binary of pandoc 2.x from the official package at https://github.com/jgm/pandoc/releases/tag/2.2.1
sudo dpkg -i pandoc-2.2.1-1-amd64.deb
I also had to install phpunit from Bionic packages:
sudo apt install phpunit
Othwerise, composer update --no-dev
throws this:
$ composer update --no-dev
Loading composer repositories with package information
Updating dependencies
Your requirements could not be resolved to an installable set of packages.
Problem 1
- phpunit/phpunit 6.5.8 requires ext-dom * -> the requested PHP extension dom is missing from your system.
- phpunit/phpunit 6.5.7 requires ext-dom * -> the requested PHP extension dom is missing from your system.
- phpunit/phpunit 6.5.6 requires ext-dom * -> the requested PHP extension dom is missing from your system.
- phpunit/phpunit 6.5.5 requires ext-dom * -> the requested PHP extension dom is missing from your system.
- phpunit/phpunit 6.5.4 requires ext-dom * -> the requested PHP extension dom is missing from your system.
- phpunit/phpunit 6.5.3 requires ext-dom * -> the requested PHP extension dom is missing from your system.
- phpunit/phpunit 6.5.2 requires ext-dom * -> the requested PHP extension dom is missing from your system.
- phpunit/phpunit 6.5.1 requires ext-dom * -> the requested PHP extension dom is missing from your system.
- phpunit/phpunit 6.5.0 requires ext-dom * -> the requested PHP extension dom is missing from your system.
- phpunit/phpunit 6.4.4 requires ext-dom * -> the requested PHP extension dom is missing from your system.
- phpunit/phpunit 6.4.3 requires ext-dom * -> the requested PHP extension dom is missing from your system.
- phpunit/phpunit 6.4.2 requires ext-dom * -> the requested PHP extension dom is missing from your system.
- phpunit/phpunit 6.4.1 requires ext-dom * -> the requested PHP extension dom is missing from your system.
- phpunit/phpunit 6.4.0 requires ext-dom * -> the requested PHP extension dom is missing from your system.
- phpunit/phpunit 6.3.1 requires ext-dom * -> the requested PHP extension dom is missing from your system.
- phpunit/phpunit 6.3.0 requires ext-dom * -> the requested PHP extension dom is missing from your system.
- phpunit/phpunit 6.2.4 requires ext-dom * -> the requested PHP extension dom is missing from your system.
- phpunit/phpunit 6.2.3 requires ext-dom * -> the requested PHP extension dom is missing from your system.
- phpunit/phpunit 6.2.2 requires ext-dom * -> the requested PHP extension dom is missing from your system.
- phpunit/phpunit 6.2.1 requires ext-dom * -> the requested PHP extension dom is missing from your system.
- phpunit/phpunit 6.2.0 requires ext-dom * -> the requested PHP extension dom is missing from your system.
- phpunit/phpunit 6.1.4 requires ext-dom * -> the requested PHP extension dom is missing from your system.
- phpunit/phpunit 6.1.3 requires ext-dom * -> the requested PHP extension dom is missing from your system.
- phpunit/phpunit 6.1.2 requires ext-dom * -> the requested PHP extension dom is missing from your system.
- phpunit/phpunit 6.1.1 requires ext-dom * -> the requested PHP extension dom is missing from your system.
- phpunit/phpunit 6.1.0 requires ext-dom * -> the requested PHP extension dom is missing from your system.
- phpunit/phpunit 6.0.9 requires ext-dom * -> the requested PHP extension dom is missing from your system.
- phpunit/phpunit 6.0.8 requires ext-dom * -> the requested PHP extension dom is missing from your system.
- phpunit/phpunit 6.0.7 requires ext-dom * -> the requested PHP extension dom is missing from your system.
- phpunit/phpunit 6.0.6 requires ext-dom * -> the requested PHP extension dom is missing from your system.
- phpunit/phpunit 6.0.5 requires ext-dom * -> the requested PHP extension dom is missing from your system.
- phpunit/phpunit 6.0.4 requires ext-dom * -> the requested PHP extension dom is missing from your system.
- phpunit/phpunit 6.0.3 requires ext-dom * -> the requested PHP extension dom is missing from your system.
- phpunit/phpunit 6.0.2 requires ext-dom * -> the requested PHP extension dom is missing from your system.
- phpunit/phpunit 6.0.13 requires ext-dom * -> the requested PHP extension dom is missing from your system.
- phpunit/phpunit 6.0.12 requires ext-dom * -> the requested PHP extension dom is missing from your system.
- phpunit/phpunit 6.0.11 requires ext-dom * -> the requested PHP extension dom is missing from your system.
- phpunit/phpunit 6.0.10 requires ext-dom * -> the requested PHP extension dom is missing from your system.
- phpunit/phpunit 6.0.1 requires ext-dom * -> the requested PHP extension dom is missing from your system.
- phpunit/phpunit 6.0.0 requires ext-dom * -> the requested PHP extension dom is missing from your system.
- Installation request for phpunit/phpunit ~6 -> satisfiable by phpunit/phpunit[6.0.0, 6.0.1, 6.0.10, 6.0.11, 6.0.12, 6.0.13, 6.0.2, 6.0.3, 6.0.4, 6.0.5, 6.0.6, 6.0.7, 6.0.8, 6.0.9, 6.1.0, 6.1.1, 6.1.2, 6.1.3, 6.1.4, 6.2.0, 6.2.1, 6.2.2, 6.2.3, 6.2.4, 6.3.0, 6.3.1, 6.4.0, 6.4.1, 6.4.2, 6.4.3, 6.4.4, 6.5.0, 6.5.1, 6.5.2, 6.5.3, 6.5.4, 6.5.5, 6.5.6, 6.5.7, 6.5.8].
To enable extensions, verify that they are enabled in your .ini files:
- /etc/php/7.2/cli/php.ini
- /etc/php/7.2/cli/conf.d/10-opcache.ini
- /etc/php/7.2/cli/conf.d/10-pdo.ini
- /etc/php/7.2/cli/conf.d/20-calendar.ini
- /etc/php/7.2/cli/conf.d/20-ctype.ini
- /etc/php/7.2/cli/conf.d/20-exif.ini
- /etc/php/7.2/cli/conf.d/20-fileinfo.ini
- /etc/php/7.2/cli/conf.d/20-ftp.ini
- /etc/php/7.2/cli/conf.d/20-gettext.ini
- /etc/php/7.2/cli/conf.d/20-iconv.ini
- /etc/php/7.2/cli/conf.d/20-json.ini
- /etc/php/7.2/cli/conf.d/20-phar.ini
- /etc/php/7.2/cli/conf.d/20-posix.ini
- /etc/php/7.2/cli/conf.d/20-readline.ini
- /etc/php/7.2/cli/conf.d/20-shmop.ini
- /etc/php/7.2/cli/conf.d/20-sockets.ini
- /etc/php/7.2/cli/conf.d/20-sysvmsg.ini
- /etc/php/7.2/cli/conf.d/20-sysvsem.ini
- /etc/php/7.2/cli/conf.d/20-sysvshm.ini
- /etc/php/7.2/cli/conf.d/20-tokenizer.ini
You can also run `php --ini` inside terminal to see which files are used by PHP in CLI mode.
Running update with --no-dev does not mean require-dev is ignored, it just means the packages will not be installed. If dev requirements are blocking the update you have to resolve those problems.
Thanks for the script. It allowed me to quickly convert ~400 of MediaWiki pages.
Hi Team,
I am getting subjected error. Please find the code which is by default available after installing the composer. Kindly help me to troubleshoot this issue.
#!/usr/bin/env php
* @link https://github.com/outofcontrol/mediawiki-to-gfm * Original Author * @author Philip Ashlock * @link https://github.com/philipashlock/mediawiki-to-markdown * @license MIT License https://opensource.org/licenses/MIT */ if (is_file(__DIR__.'/vendor/autoload.php') === true) { require_once 'vendor/autoload.php'; } else { exit("Please run 'composer update --no-dev' first." . PHP_EOL); } $args = getopt( '', [ 'filename:', 'output::', 'format::', 'addmeta::', 'flatten::', 'indexes::', 'version::', 'help::' ] ); $convert = new App/Convert($args); if (isset($args['help'])) { $convert->help(); exit; } if (isset($args['version'])) { $convert->getVersion(); exit; } try { $convert->run(); } catch (Exception $e) { echo $e->getMessage() . PHP_EOL; exit(1); }I'm trying to convert the entire Eclipse wiki exported to a single xml
file, one of the errors I got was apparently due to too big file:
Warning: SimpleXMLElement::__construct(): Entity: line 2518746: parser error : internal error: Huge input lookup in /mediawiki-to-gfm/app/src/Convert.php on line 346
This error seems to be gone with this change:
diff --git a/app/src/Convert.php b/app/src/Convert.php
index fc00640..dea6812 100644
--- a/app/src/Convert.php
+++ b/app/src/Convert.php
@@ -343,7 +343,7 @@ class Convert
*/
public function loadData($xmlData)
{
- if (($xml = new \SimpleXMLElement($xmlData)) === false) {
+ if (($xml = new \SimpleXMLElement($xmlData, LIBXML_PARSEHUGE)) === false) {
throw new \Exception('Invalid XML File.');
}
$this->dataToConvert = $xml->xpath('page');
I invoke the converter with:
php -d memory_limit=4G ./convert.php --filename=/wikiexport/Eclipsepedia-20230228212752.xml --output=/wikiexport/markdown/
The xml
file is about 130 MB. Since this is an extreme case (from my POV), I'm not sure fixing the bug is worth the effort.
AFAIU, the documents attached to pages aren't handled by this tool.
Maybe I'm wrong.
In either case, it could be helpful to document this limitation, and maybe provide a hint on how to migrate these.
Thanks in advance.
(initially filed at philipashlock/mediawiki-to-markdown#21 FWIW)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.