earwig / mwparserfromhell Goto Github PK

View Code? Open in Web Editor NEW

742.0 37.0 75.0 2.27 MB

A Python parser for MediaWiki wikicode

Home Page: https://mwparserfromhell.readthedocs.io/

License: MIT License

Python 66.81% Shell 0.74% C 32.12% Batchfile 0.33%

python parser mediawiki wikipedia

mwparserfromhell's Introduction

mwparserfromhell

mwparserfromhell (the MediaWiki Parser from Hell) is a Python package that provides an easy-to-use and outrageously powerful parser for MediaWiki wikicode. It supports Python 3.8+.

Developed by Earwig with contributions from Σ, Legoktm, and others. Full documentation is available on ReadTheDocs. Development occurs on GitHub.

Installation

The easiest way to install the parser is through the Python Package Index; you can install the latest release with pip install mwparserfromhell (get pip). Make sure your pip is up-to-date first, especially on Windows.

Alternatively, get the latest development version:

git clone https://github.com/earwig/mwparserfromhell.git
cd mwparserfromhell
python setup.py install

The comprehensive unit testing suite requires pytest (pip install pytest) and can be run with python -m pytest.

Usage

Normal usage is rather straightforward (where text is page text):

>>> import mwparserfromhell
>>> wikicode = mwparserfromhell.parse(text)

wikicode is a mwparserfromhell.Wikicode object, which acts like an ordinary str object with some extra methods. For example:

>>> text = "I has a template! {{foo|bar|baz|eggs=spam}} See it?"
>>> wikicode = mwparserfromhell.parse(text)
>>> print(wikicode)
I has a template! {{foo|bar|baz|eggs=spam}} See it?
>>> templates = wikicode.filter_templates()
>>> print(templates)
['{{foo|bar|baz|eggs=spam}}']
>>> template = templates[0]
>>> print(template.name)
foo
>>> print(template.params)
['bar', 'baz', 'eggs=spam']
>>> print(template.get(1).value)
bar
>>> print(template.get("eggs").value)
spam

Since nodes can contain other nodes, getting nested templates is trivial:

>>> text = "{{foo|{{bar}}={{baz|{{spam}}}}}}"
>>> mwparserfromhell.parse(text).filter_templates()
['{{foo|{{bar}}={{baz|{{spam}}}}}}', '{{bar}}', '{{baz|{{spam}}}}', '{{spam}}']

You can also pass recursive=False to filter_templates() and explore templates manually. This is possible because nodes can contain additional Wikicode objects:

>>> code = mwparserfromhell.parse("{{foo|this {{includes a|template}}}}")
>>> print(code.filter_templates(recursive=False))
['{{foo|this {{includes a|template}}}}']
>>> foo = code.filter_templates(recursive=False)[0]
>>> print(foo.get(1).value)
this {{includes a|template}}
>>> print(foo.get(1).value.filter_templates()[0])
{{includes a|template}}
>>> print(foo.get(1).value.filter_templates()[0].get(1).value)
template

Templates can be easily modified to add, remove, or alter params. Wikicode objects can be treated like lists, with append(), insert(), remove(), replace(), and more. They also have a matches() method for comparing page or template names, which takes care of capitalization and whitespace:

>>> text = "{{cleanup}} '''Foo''' is a [[bar]]. {{uncategorized}}"
>>> code = mwparserfromhell.parse(text)
>>> for template in code.filter_templates():
...     if template.name.matches("Cleanup") and not template.has("date"):
...         template.add("date", "July 2012")
...
>>> print(code)
{{cleanup|date=July 2012}} '''Foo''' is a [[bar]]. {{uncategorized}}
>>> code.replace("{{uncategorized}}", "{{bar-stub}}")
>>> print(code)
{{cleanup|date=July 2012}} '''Foo''' is a [[bar]]. {{bar-stub}}
>>> print(code.filter_templates())
['{{cleanup|date=July 2012}}', '{{bar-stub}}']

You can then convert code back into a regular str object (for saving the page!) by calling str() on it:

>>> text = str(code)
>>> print(text)
{{cleanup|date=July 2012}} '''Foo''' is a [[bar]]. {{bar-stub}}
>>> text == code
True

Limitations

While the MediaWiki parser generates HTML and has access to the contents of templates, among other things, mwparserfromhell acts as a direct interface to the source code only. This has several implications:

Syntax elements produced by a template transclusion cannot be detected. For example, imagine a hypothetical page "Template:End-bold" that contained the text </b>. While MediaWiki would correctly understand that <b>foobar{{end-bold}} translates to <b>foobar</b>, mwparserfromhell has no way of examining the contents of {{end-bold}}. Instead, it would treat the bold tag as unfinished, possibly extending further down the page.
Templates adjacent to external links, as in http://example.com{{foo}}, are considered part of the link. In reality, this would depend on the contents of the template.
When different syntax elements cross over each other, as in {{echo|''Hello}}, world!'', the parser gets confused because this cannot be represented by an ordinary syntax tree. Instead, the parser will treat the first syntax construct as plain text. In this case, only the italic tag would be properly parsed.

Workaround: Since this commonly occurs with text formatting and text formatting is often not of interest to users, you may pass skip_style_tags=True to mwparserfromhell.parse(). This treats '' and ''' as plain text.

A future version of mwparserfromhell may include multiple parsing modes to get around this restriction more sensibly.

Additionally, the parser lacks awareness of certain wiki-specific settings:

Word-ending links are not supported, since the linktrail rules are language-specific.
Localized namespace names aren't recognized, so file links (such as [[File:...]]) are treated as regular wikilinks.
Anything that looks like an XML tag is treated as a tag, even if it is not a recognized tag name, since the list of valid tags depends on loaded MediaWiki extensions.

Integration

mwparserfromhell is used by and originally developed for EarwigBot; Page objects have a parse method that essentially calls mwparserfromhell.parse() on page.get().

If you're using Pywikibot, your code might look like this:

import mwparserfromhell
import pywikibot

def parse(title):
    site = pywikibot.Site()
    page = pywikibot.Page(site, title)
    text = page.get()
    return mwparserfromhell.parse(text)

If you're not using a library, you can parse any page with the following Python 3 code (using the API and the requests library):

import mwparserfromhell
import requests

API_URL = "https://en.wikipedia.org/w/api.php"

def parse(title):
    params = {
        "action": "query",
        "prop": "revisions",
        "rvprop": "content",
        "rvslots": "main",
        "rvlimit": 1,
        "titles": title,
        "format": "json",
        "formatversion": "2",
    }
    headers = {"User-Agent": "My-Bot-Name/1.0"}
    req = requests.get(API_URL, headers=headers, params=params)
    res = req.json()
    revision = res["query"]["pages"][0]["revisions"][0]
    text = revision["slots"]["main"]["content"]
    return mwparserfromhell.parse(text)

mwparserfromhell's People

Contributors

Stargazers

Watchers

Forkers

notconfusing legoktm xiaoyili alchimista valhallasw liangent davidswinegar l00mi topcarry kumasento lmorillas gencer jayvdb iscinc xzise rua pombredanne tjuyzl hazardsj hroest nkhuyu biwin halfak larivact vegetable68 betson chakchak1234 rammohan david-taub islammohamed framawiki thesanddoctor frankier mottl hugovk theassyrian paladnix bryghtshadow c01o stjordanis nyurik bt2901 julthep lahwaacz rheingoldriver anticompositenumber prohit93 hasantuberlin digi-ark kishorkunal-raj stribny vic7354 yi-mao manishgit138 prakriti07 odidev techthiyanes saper marhabarima akupar ctrlcctrlv capuanob mayhemheroes declerambaul acidburn0zzz yilee2019 davidebbo ayaz345 eseiver mhmohona maxjeblick iphearum kphaitao-ai r-barnes selfbriefs

mwparserfromhell's Issues

List tags should include the item text in their contents

When parsing ordered or unordered lists, the actual text in the list item is returned as a string of separate nodes of various following the "Tag" node, all belonging to the same parent. For example:

# this {{is}}
# a [[test]]

filter(recursive=False) will now give these node types (unicode equivalent in brackets): Tag (#), Text ( this ), Template ({{is}}), Text (\n), Tag (#), Text ( a ), Wikilink ([[test]]), Text (\n)

Note that the last Text node will contain the newline that actually ends the list item. The parser should parse this the same way, by including everything on that line as part of the Tag node's .contents property. So this should really return: Tag (# this {{is}}\n), Tag (# a [[test]]\n)

Similar behaviour should also apply to unordered lists (*) and to definition lists (; and :).

Allow some methods of Wikicode (like remove()) to be passed other Wikicode objects

Reported by @HazardSJ:

a.remove(b) (given a and b as Wikicode objects, b as a subset of a):

expected: all nodes within b are removed from a
actual: ValueError is raised

Note: the docs are clear that remove() doesn't accept a Wikicode object, but the functionality is useful when sections are passed. Therefore, the method should be careful to iterate correctly over b's nodes.

Not detecting all templates on [[Talk:Lost Girls]]

legoktm@localhost:~$ python
Python 2.7.3 (default, Jul 29 2012, 23:31:23) 
[GCC 4.2.1 Compatible Apple Clang 4.0 ((tags/Apple/clang-421.0.57))] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pywikibot, mwparserfromhell
>>> p=pywikibot.Page(pywikibot.Site(), 'Talk:Lost Girls')
>>> text = p.get()
>>> code = mwparserfromhell.parse(text)
>>> code.filter_templates()
[u'{{Oz-project}}', u'{{t1|Supercbbox}}']

It's missed the following template:
{{comicsproj}}

I'm using this version of the page. And I'm using mwparserfromhell 0.1 from pypi.

Unit tests

Unit tests for all the things!

Templates inside section headings

Given the following code:

import mwparserfromhell as mwp
wikitext = u"== {{User:SuggestBot/suggest}} ==";
code = mwp.parse(wikitext);
templates = code.filter_templates();

I then find that templates is an empty list.

I would suspect the template inside a heading to be valid markup, but maybe I'm mistaken? Looking around in various MediaWiki help pages did not suggest it wasn't OK though.

.maches() should take a tuple to check against multiple strings

Similar to str.startswith, the .matches() function should also accept a tuple of strings and/or Wikicode objects to check against. If one of them matches, it would return True.

The use case for this is checking against template redirects.

C tokenizer: emit tokens simpler than the expensive PyObject* kind

Allocating and filling the slots of PyObject*s every time we create a token (even if it is later discarded) is a large overhead; ideally, we use custom structs for each token that have the appropriate attributes.

The parser will either have to (1) convert these tokens to PyObject*s at the end of parsing (2) wrap them in some kind of capsule? (3) write a C port of the builder too that uses these new tokens.

(3) is ultimately the fastest solution, but it's the most work and, since regular Python tokens are never generated, we will need a new way to run C tokenizer test cases.

C Tokenizer should be Py3k-compatible

The C tokenizer should correctly work with Python 3 before being fully released in v0.2.

Implement Tag attribute has/get/add/remove API

This might include some convenience methods for dealing with references.

Ignores outer nested template in presence of http://

This simple example (based on an actual page I was troubleshooting) fails:

{{test
| website             = {{URL|http://x.com}}
}}

filter_templates() produces:

[u'{{URL|http://x.com}}\n}}']

For a very similar example (lacking only the 'http://'):

{{test
| website             = {{URL|x.com}}
}}

filter_templates() correctly produces

[u'{{test\n| website             = {{URL|x.com}}\n}}', u'{{URL|x.com}}']

Document everything appropriate and build the Sphinx doc tree

HTML entities should be valid within certain parser-blacklisted tags

Given: <pre> </pre>
Expected:
[TagOpenOpen(showtag=True), Text(text="pre"), TagCloseOpen(), HTMLEntityStart(), Text(text="nbsp"), HTMLEntityEnd(), TagOpenClose(), Text(text="pre"), TagCloseClose()]
Actual:
[TagOpenOpen(showtag=True), Text(text="pre"), TagCloseOpen(), Text(text=" "), TagOpenClose(), Text(text="pre"), TagCloseClose()]

py3k stuff

Step 1: from __future__ import unicode_literals in all files.

Step 2: from . import compatibility (or from . import compat) – uses str and bytes on Python 3, and str = unicode and bytes = str on Python 2.

setup.py AttributeError with python-2.6.8

Similar to issue #44 it seems that 37003d2 changes line 13 of compat.py

causing

python setup.py install
Traceback (most recent call last):
File "setup.py", line 26, in
from mwparserfromhell import version
File "/home/ec2-user/codetree/master/rrb/snpedia/bots/mwparserfromhell/mwparserfromhell/init.py", line 37, in
from . import (compat, definitions, nodes, parser, smart_list, string_mixin,
File "/home/ec2-user/codetree/master/rrb/snpedia/bots/mwparserfromhell/mwparserfromhell/compat.py", line 13, in
py3k = sys.version_info.major == 3
AttributeError: 'tuple' object has no attribute 'major'

under

Python 2.6.8 (unknown, Mar 14 2013, 09:31:22)
[GCC 4.6.2 20111027 (Red Hat 4.6.2-2)] on linux2

It seemed to like the previous structure
sys.version_info[0]
found in 53c2658
but not the
sys.version_info.major
found in 37003d2
as that named syntax was introduced in 2.7

segmentation fault (Python Interpreter Crashes)

Hello Everyone,

I am getting segmentation fault while parsing some wiki pages.
It is very easily reproducible.
I am using python 2.6.6.
To reproduce this fetch wiki page of 'Albert Einstein' and call mwparserfromhell.parse() with the text of this page.
I have observed segmentation fault with some other pages.
I use wiki api to fetch page of Albert Einstein in xml format and the parse the xml content to get the
text of this page and then call mwparserfromhell.parse with that text.
I have successfully used mwparserfromhell to parse hundreds of pages.
Let me know if you cannot reproduce this.

Thanks,
Ashwin

parse() crashes python

text = page.get()
wikicode = mwparserfromhell.parse(text)
references = wikicodefilter_templates()

I get a popup message at the parse() line stating that python has stopped working.
Python: 2.7.3 (default, Apr 10 2012, 23:31:26) [MSC v.1500 32 bit (Intel)]

I just did a pip uninstall and reinstalled with the most current version and it now is non-functioning

Template rejected by parser

I've come across this little gem here:

{{Infobox Platz
| Name=Strausberger Platz
| Alternativnamen=
| Stadtwappen=Coat of arms of Berlin.svg
| Kategorie=Platz in Berlin
| Bild=Strausberger Platz Berlin April 2006 109.jpg|miniatur
| Bild zeigt=Der Platz in Richtung Westen gesehen
| Ort=Berlin
| Ortsteil=[[Berlin-Friedrichshain]]
| Angelegt=1967
| Neugestaltet=
| Straßen=<br />Lichtenberger Straße, [[Karl-Marx-Allee]]
| Bauwerke=„Haus Berlin“
| Nutzergruppen=[[Fußgänger]], [[Radfahrer]], [[Auto]]
| Platzgestaltung=
| Baukosten=
}}

It's from here: http://de.wikipedia.org/w/index.php?title=Strausberger_Platz&oldid=112496475

As soon as the line
Bild=Strausberger Platz Berlin April 2006 109.jpg|miniatur
is added to the template, it is rejected and a Text node is created instead.
Even though this markup doesn't make sense, I would expect a Template node with a value-less parameter named "miniatur" instead of a Text node.

get_sections() should handle headings nested inside other nodes

Another avenue to address part 4 of #40.

Not sure how to solve this, though.

Might be impossible.

Template problems with empty parameters

This one is a real issue, I promise. Occasionally you run across a template with an empty final unnamed parameter, e.g.

{{foo|}}

The Template node handles empty params oddly:

>>> wikicode = mwparserfromhell.parse('{{foo|}}')
>>> tmpls = wikicode.filter_templates()
>>> tmpls[0]
u'{{foo|}}'
>>> tmpls[0].get(1)
u''
>>> tmpls[0].add(1, 'bar')
u'1=bar'
>>> tmpls[0]
u'{{foo||1=bar}}'

You can work around this by explicitly removing the empty param:

>>> wikicode = mwparserfromhell.parse('{{foo|}}')
>>> tmpls = wikicode.filter_templates()
>>> tmpls[0].remove(1)
>>> tmpls[0].add(1, 'bar')
u'bar'
>>> tmpls[0]
u'{{foo|bar}}'

Similar things happen with named params with empty values, e.g. {{foo|baz=}}. This seems to me like a bug. If it is, I can certainly take a crack at a fix.

mwparserfromhell version 0.3.2 installed via pip on Ubuntu 12.04.3.

HTML Tags: Doesn't parse ref tag name parameters with double-quotes and hyphen

Hey Earwig, try testing this:
wikicode = Builder().build(Tokenizer().tokenize('<ref name="a-b">'))
Error I get is: in my local tokenizer.py, line 472, in _actually_close_tag_opening:
if isinstance(self._stack[-1], tokens.TagAttrStart): IndexError: list index out of range
Only seems to occur when: 1) It's a ref tag, 2) name parameter is specified and has a value with certain characters in it, like - (hyphen) or = (equals), 3) the name parameter value is in double-quote. Bug? Zad68 14:10, 14 March 2013 (UTC)

[1]

Travis support

Implement support for Travis CI! Should be simple enough.

Implement missing node types

~~Comment~~
ParserFunction / MagicWord / BehaviorSwitch
~~Link~~
~~Table~~
Redirect

Doesn't recognize tables

In EG http://en.wikipedia.org/wiki/ELCA_Youth_Gathering it ignores the {| |} markup.

Problem with iteration over slice of nodes

>>> wikicode = mwparserfromhell.parse(text)
>>> x = [node for node in wikicode.nodes[:10]]
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "<input>", line 1, in <listcomp>
  File "/home/bunyk/python/wikibot/lib/python3.2/site-packages/mwparserfromhell/smart_list.py", line 257, in __iter__
    while i < self._stop:
TypeError: unorderable types: NoneType() < int()

Issue with templates

Using 0.1.1 with python 2.7.
With this markup (simplified from http://en.wikipedia.org/w/index.php?action=raw&title=Applied%20Materials) :

{{Infobox company
| revenue          = {{nowrap|{{decrease}}
<ref name="Applied Materials, Inc.-10-K">{{cite web|url=http://pdf.secdatabase.com/1278/0000006951-12-000018.pdf |title=Applied Materials, Inc. Inc 2012 Annual Report, Form 10-K, Filing Date December 5, 2012 |publisher=secdatabase.com |accessdate =Dec 26, 2012}}</ref>}}
}}

I cannot get the template "Infobox company" from
mwparserfromhell.parse(test).filter_templates()[0].name
If I remove the newline between the {{decrease}} and the ref, it works.

C tokenizer: use Python 3's new Unicode APIs

Instead of turning self->text into a tuple of one-char Unicode strings, we should keep the original Unicode object and use the various (new) accessor methods to work with it. This is a fairly fundamental change to the way C parsing works, so I'm leaving it for v0.4.

LaTex in Wikitext, ignore <math> tags?

I seem to encounter performance issues on pages that have math in them. For instance: http://en.wikipedia.org/wiki/Spence%27s_function

Lines like

 :<math>\operatorname{Li}_2(z)+\operatorname{Li}_2(1-z)=\frac{{\pi}^2}{6}-\ln z \cdot\ln(1-z) </math>

correctly aren't recorgnized at the \pi template, but take a very long time to not be recognized.

Best way forwards, institute a ignoreMath=True option on filter_templates ?

Add missing setters for some nodes and extras

Attribute
Heading
HTMLEntity
Tag
Text

Insert parameter into a template after another parameter

It should be possible to specificy in what order you want the parameter to be added.
Maybe something like:

template.add('caption','Thing',after='image')

Template parameters containing newlines should be invalid

Speed up parsing by storing intermediate tokens

There are certain situations where we rapidly construct and destroy and then have to reconstruct the same token stack when hitting bad routes. These are rare, but act as a major slowdown when they do occur, potentially hitting the parse cycle limit. Ideally, when failing a route, we store the tokenization of complete nodes in some kind of cache which associates them with the starting/ending head locations and the context. This cache can then be popped from if we reach that head location again with the same context, allowing us to bypass regular parsing.

Installation error

As I mentioned on IRC, I had a problem installing via the version from git (installed successfully via pip). While attempting to use the feature/html_tags branch, I ran setup.py install, but among other things, it returned:

IIRC, this was the same error I got before after doing a simple clone and attempting an install. Until this is fixed, back to pip I go (too bad tag filtering doesn't work there).

Strip whitespace from template names

Given the following MediaWiki code:

The given template's name is "u'User:Foo/Bar\n'"

Is that a feature or a bug? I would have expected the newline to be stripped from the name.

Crashing on some malformed data

Hello,

I get an AttributeError that 'NoneType' Object has no attribute 'iternodes' [1].
Which is fair, but why isn't there a check for the None type here? What is the most elegant way to handle this?

I should note Wikicode that this is being called on is malformed as well [2]. It's from a wikipedia dump.

[1] File "xmlToOCLCNum.py", line 38, in findOCLCNums
templates = wikicode.filter_templates(recursive=True)
File "/usr/local/lib/python2.7/dist-packages/mwparserfromhell/wikicode.py", line 360, in filter_templates
return list(self.ifilter_templates(recursive, matches, flags))
File "/usr/local/lib/python2.7/dist-packages/mwparserfromhell/wikicode.py", line 320, in ifilter_templates
return self.filter(recursive, matches, flags, forcetype=Template)
File "/usr/local/lib/python2.7/dist-packages/mwparserfromhell/wikicode.py", line 344, in filter
return list(self.ifilter(recursive, matches, flags, forcetype))
File "/usr/local/lib/python2.7/dist-packages/mwparserfromhell/wikicode.py", line 301, in ifilter
for node in nodes:
File "/usr/local/lib/python2.7/dist-packages/mwparserfromhell/wikicode.py", line 82, in _get_all_nodes
for child in self._get_children(node):
File "/usr/local/lib/python2.7/dist-packages/mwparserfromhell/wikicode.py", line 59, in _get_children
for context, child in node.iternodes(self._get_all_nodes):
AttributeError: 'NoneType' object has no attribute 'iternodes'

[2]" #REDIRECT [[Folklore]] {{R with old history|printworthy}}

{{This is a redirect|with possibilities
A folktale is a story passed down from generaqtions."

Fix parsing of {{{params}}}

Currently, {{{param}}} will be parsed as a template of {param followed by the text }. This is a problem that should be fixed before releasing the parser into the wild.

git install error

C:\Python27\setups\mwparserfromhell>setup.py install
running install
running bdist_egg
running egg_info
writing mwparserfromhell.egg-info\PKG-INFO
writing top-level names to mwparserfromhell.egg-info\top_level.txt
writing dependency_links to mwparserfromhell.egg-info\dependency_links.txt
reading manifest file 'mwparserfromhell.egg-info\SOURCES.txt'
writing manifest file 'mwparserfromhell.egg-info\SOURCES.txt'
installing library code to build\bdist.win32\egg
running install_lib
running build_py
running build_ext
building 'mwparserfromhell.parser._tokenizer' extension
error: Unable to find vcvarsall.bat

Changelog

Start a changelog covering relevant updates (bugfixes, new/modified APIs) to be read when upgrading. First cover old changes from v0.1.1, then for the new v0.2.

HTML Tags: self-closing tags not handled properly

  $ python
  Python 2.7.3 (default, Aug  1 2012, 05:14:39)
  [GCC 4.6.3] on linux2
  Type "help", "copyright", "credits" or "license" for more information.
  >>> import mwparserfromhell
  from mwparserfromhell.parser.tokenizer import Tokenizer
  from mwparserfromhell.parser.builder import Builder
  >>> >>> >>>

  # Without self-closing ref tag, works
  >>> wikicode = Builder().build(Tokenizer().tokenize('I has a template!<ref name=foo>{{bar}}</ref>'))
  >>> wikicode.filter_tags()
  [u'<ref name=foo>{{bar}}</ref>']
  >>> wikicode.filter_tags(recursive=True)
  [u'<ref name=foo>{{bar}}</ref>']

  # With self-closing tag, doesn't work
  >>> wikicode = Builder().build(Tokenizer().tokenize('I has a template!<ref name=foo>{{bar}}</ref><ref name=baz/>'))
  >>> wikicode.filter_tags()
  []
  >>> wikicode.filter_text()
  [u'baz']
  >>> wikicode.filter_tags(recursive=True)
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "/home/zad68/.local/lib/python2.7/site-packages/mwparserfromhell-0.2.dev-py2.7-linux-x86_64.egg/mwparserfromhell/wikicode.py", line 376, in filter_tags
      return list(self.ifilter_tags(recursive, matches, flags))
    File "/home/zad68/.local/lib/python2.7/site-packages/mwparserfromhell-0.2.dev-py2.7-linux-x86_64.egg/mwparserfromhell/wikicode.py", line 301, in ifilter
      for node in nodes:
    File "/home/zad68/.local/lib/python2.7/site-packages/mwparserfromhell-0.2.dev-py2.7-linux-x86_64.egg/mwparserfromhell/wikicode.py", line 82, in _get_all_nodes
      for child in self._get_children(node):
    File "/home/zad68/.local/lib/python2.7/site-packages/mwparserfromhell-0.2.dev-py2.7-linux-x86_64.egg/mwparserfromhell/wikicode.py", line 59, in _get_children
      for context, child in node.__iternodes__(self._get_all_nodes):
  AttributeError: 'NoneType' object has no attribute '__iternodes__'

  # Edge case with self-closing tag only:
  >>> wikicode = Builder().build(Tokenizer().tokenize('<ref name=foo/>'))
  >>> wikicode.filter_tags()
  []
  >>> wikicode.filter_text()
  [u'foo']

  # If the tag isn't "ref", different but still incorrect behavior:
  # it doesn't stack trace but doesn't work either...
  >>> wikicode = Builder().build(Tokenizer().tokenize('I has<bloop name=baz/> a template!'))
  >>> wikicode.filter_tags()
  []
  >>> wikicode.filter_tags(recursive=True)
  []
  >>>

  wikicode = Builder().build(Tokenizer().tokenize("==Epidemiology==\nFoo.<ref>hi<br />there</ref>"))
  # this looks OK:
  >>> wikicode.filter_tags()
  [u'<ref>hi<br />there</ref>']
  # but doing it recursively yields slightly different stack trace
  >>> wikicode.filter_tags(recursive=True)
  Traceback (most recent call last):
  ...
  AttributeError: 'NoneType' object has no attribute 'nodes'

[1]

template.add() malfunction?

(link to source where problem occurs)

I'm running a script that adds archiveurl and archivedate parameters to templates (linked above).

When I ran the script for a page that had multiple dead links that needed fixing, this (diff) happened--everything looks fine for the first reference fixed, but in the second one, you'll notice that instead of adding parameters named "archiveurl" and "archivedate", the names of the parameters were the same as their values. This isn't right.

Was my template instance somehow corrupted? Or is there a bug?

Templates not found when unclosed HMTL elements present?

Parsing and filtering templates on this revision leads to none found: https://en.wikipedia.org/w/index.php?title=User:Today%27s_Xtra&oldid=581237766

If I edit the source so all the font elements have matching close tags, the templates are found correctly.

Given the mix of HTML and wikitext, I'm not sure whether this actually should find any templates. In that case perhaps still interesting as a testcase?

is strip_code() implemented?

>>> import mwparserfromhell
>>> expected = "bold text"
>>> actual = mwparserfromhell.parse("'''bold text'''").strip_code(True, True)
>>> assert expected == actual, actual
Traceback (most recent call last):
  File "<input>", line 1, in <module>
AssertionError: '''bold text'''

Am I doing something wrong?
I haven't found __strip__ implemented for any of the Node subclasses.

Implement Tag support in the parser

Make pattern matches on filter functions more useful

I've found it a real problem that the matches= parameter only does a regular expression match on the whole wikitext. It is rather limited, but I don't know if proper node selection like XPath or CSS selectors is feasible.

So instead I propose using a custom filter function. The matches= would accept a function that takes the Wikicode object, and returns true if it's to be included in the returned list. This lets you build your own condition, like this:

filter_templates(matches = lambda node: node.name in ["foo", "bar"] and not node.has("lang"))

If backwards compatibility is a problem, the function can check the type and either do the function-based filtering or the regex-based filtering.

setup.py AttributeError with python-3.2

With Debian wheezy amd64 and Python 3.2, I receive an AttributeError when attempting to install. Or, is Python 3.2 not supported?

$ git clone https://github.com/earwig/mwparserfromhell.git
$ cd mwparserfromhell
$ python3 setup.py install
Traceback (most recent call last):
  File "setup.py", line 26, in <module>
    from mwparserfromhell import __version__
  File "/home/user/mwparserfromhell/mwparserfromhell/__init__.py", line 37, in <module>
    from . import (compat, definitions, nodes, parser, smart_list, string_mixin,
  File "/home/user/mwparserfromhell/mwparserfromhell/nodes/__init__.py", line 35, in <module>
    from ..string_mixin import StringMixIn
  File "/home/user/mwparserfromhell/mwparserfromhell/string_mixin.py", line 43, in <module>
    class StringMixIn(object):
  File "/home/user/mwparserfromhell/mwparserfromhell/string_mixin.py", line 129, in StringMixIn
    @inheritdoc
  File "/home/user/mwparserfromhell/mwparserfromhell/string_mixin.py", line 40, in inheritdoc
    method.__doc__ = getattr(str, method.__name__).__doc__
AttributeError: type object 'str' has no attribute 'casefold'

Support <ref> and related tags

create a method for extracting and working with references in a article

Templates need to distinguish between unnamed params and named param '1'

The method of addressing the first unnamed parameter with template.get(1) leads to an ambiguity when trying to parse a template that includes a named parameter "1", e.g.:

>>> x = mwparserfromhell.parse('{{foo|bar|baz|1=quux}}')
>>> x.filter_templates()
[u'{{foo|bar|baz|1=quux}}']
>>> tmpl = x.filter_templates()[0]
>>> tmpl.name
u'foo'
>>> tmpl.params
[u'bar', u'baz', u'1=quux']
>>> tmpl.get(1)
u'1=quux'
>>> tmpl.get(2)
u'baz'

This becomes an actual problem in practice when trying to parse WikiProjectBannerShell templates, which include both unnamed params and a mandatory parameter named '1'.

Great work on the parser, by the way. Legoktm put me on to it earlier this evening, and it has already made some of my bot code much clearer and more robust.

Add support for external links

The tokenizer incorrectly handles some difficult tag-related markup

Bold and italics that cross contexts are handled incorrectly, because the tree structure does not support overlapping nodes (for example, ''foo'''bar''baz''', or ''foo{{bar|baz''}}). Fixing this will probably be very difficult.
Open tags that do not have a close tag before the parser reaches EOF are ignored, whereas some of them should be parsed (like bold and italics) and have some kind of "hidden close" flag set.
MediaWiki counts the occurrences of ; in the block before any text and uses this as the maximum number of parsable :s after. The current implementation only allows one : regardless of how many ;s there are.
MediaWiki prevents some tags from crossing certain contexts (italics and bold can't cross headings, for example) but this implementation has no such restriction.
The parser only recognizes a space as the separator character between the URL and its link title in [ ] tags, but MediaWiki also accepts some other syntax (e.g. [http://example.com/''Example''] is valid).

1, 4, and 5 are high priority, whereas 2 is mid and 3 is low.

C tokenizer: use goto for error handling

The Python C API docs recommend goto (gasp) for error handling in C functions, where there's a block at the end of each function that calls Py_XDECREF() on every Python object used in that function and then returns a pre-set value (0 or 1 depending on success or failure). This is better than our current system of a bunch of Py_DECREF() calls every time a function raises an error.

Receiving RuntimeError: maxmium recursion depth exceeded

I'm not sure if it's this page (it does have a lot of templates), or the fact that I'm running it MultiProcessed over 32 Cores (see my project https://github.com/notconfusing/mwparameterhell )

But this is the if I catch the run time error:

try:
    wikicode = mwparserfromhell.parse(pagetext)
    templates = wikicode.filter_templates(recursive=True)
except RuntimeError:
    print pagetext

this is the pagetext:

<noinclude>{{Template sandbox notice}}</noinclude>
<div class="boilerplate metadata rfa" style="background-color:#FFFFF5; margin: 2em 0 0 0; padding: 0 10px 0 10px; border: 1px solid #AAAAAA;">The [[Qur'an]], [[sura|chapter]] {{#ifexpr: ({{{1|1}}} = 1) | 1 ([[Al-Fatiha]]) | 
{{#ifexpr: ({{{1|1}}} = 2) | 2 ([[Al-Baqara]]) |
{{#ifexpr: ({{{1|1}}} = 3) | 3 ([[Ali Imran]]) |
{{#ifexpr: ({{{1|1}}} = 4) | 4 ([[An-Nisa]]) |
{{#ifexpr: ({{{1|1}}} = 5) | 5 ([[Al-Ma'ida]]) |
{{#ifexpr: ({{{1|1}}} = 6) | 6 ([[Al-An'am]]) |
{{#ifexpr: ({{{1|1}}} = 7) | 7 ([[Al-A'raf]]) |
{{#ifexpr: ({{{1|1}}} = 8) | 8 ([[Al-Anfal]]) |
{{#ifexpr: ({{{1|1}}} = 9) | 9 ([[At-Tawba]]) |
{{#ifexpr: ({{{1|1}}} = 10) | 10 ([[Yunus (sura)|Yunus]]) |
{{#ifexpr: ({{{1|1}}} = 11) | 11 ([[Hud (sura)|Hud]]) |
{{#ifexpr: ({{{1|1}}} = 12) | 12 ([[Yusuf (sura)|Yusuf]]) |
{{#ifexpr: ({{{1|1}}} = 13) | 13 ([[Ar-Ra'd]]) |
{{#ifexpr: ({{{1|1}}} = 14) | 14 ([[Ibrahim (sura)|Ibrahim]]) |
{{#ifexpr: ({{{1|1}}} = 15) | 15 ([[Al-Hijr]]) |
{{#ifexpr: ({{{1|1}}} = 16) | 16 ([[An-Nahl]]) |
{{#ifexpr: ({{{1|1}}} = 17) | 17 ([[Al-Isra]]) |
{{#ifexpr: ({{{1|1}}} = 18) | 18 ([[Al-Kahf]]) |
{{#ifexpr: ({{{1|1}}} = 19) | 19 ([[Maryam (sura)|Maryam]]) |
{{#ifexpr: ({{{1|1}}} = 20) | 20 ([[Ta-Ha]]) |
{{#ifexpr: ({{{1|1}}} = 21) | 21 ([[Al-Anbiya]]) |
{{#ifexpr: ({{{1|1}}} = 22) | 22 ([[Al-Hajj]]) |
{{#ifexpr: ({{{1|1}}} = 23) | 23 ([[Al-Muminun]]) |
{{#ifexpr: ({{{1|1}}} = 24) | 24 ([[An-Noor]]) |
{{#ifexpr: ({{{1|1}}} = 25) | 25 ([[Al-Furqan]]) |
{{#ifexpr: ({{{1|1}}} = 26) | 26 ([[Ash-Shu'ara]]) |
{{#ifexpr: ({{{1|1}}} = 27) | 27 ([[An-Naml]]) |
{{#ifexpr: ({{{1|1}}} = 28) | 28 ([[Al-Qisas]]) |
{{#ifexpr: ({{{1|1}}} = 29) | 29 ([[Al-Ankabut]]) |
{{#ifexpr: ({{{1|1}}} = 30) | 30 ([[Ar-Rum]]) |
{{#ifexpr: ({{{1|1}}} = 31) | 31 ([[Luqman (sura)|Luqman]]) |
{{#ifexpr: ({{{1|1}}} = 32) | 32 ([[As-Sajda]]) |
{{#ifexpr: ({{{1|1}}} = 33) | 33 ([[Al-Ahzab]]) |
{{#ifexpr: ({{{1|1}}} = 34) | 34 ([[Saba (sura)|Saba]]) |
{{#ifexpr: ({{{1|1}}} = 35) | 35 ([[Fatir]]) |
{{#ifexpr: ({{{1|1}}} = 36) | 36 ([[Ya-Seen]]) |
{{#ifexpr: ({{{1|1}}} = 37) | 37 ([[As-Saaffat]]) |
{{#ifexpr: ({{{1|1}}} = 38) | 38 ([[Sad (sura)|Sad]]) |
{{#ifexpr: ({{{1|1}}} = 39) | 39 ([[Az-Zumar]]) |
{{#ifexpr: ({{{1|1}}} = 40) | 40 ([[Ghafir]]) |
{{#ifexpr: ({{{1|1}}} = 41) | 41 ([[Fussilat]]) |
{{#ifexpr: ({{{1|1}}} = 42) | 42 ([[Ash-Shura]]) |
{{#ifexpr: ({{{1|1}}} = 43) | 43 ([[Az-Zukhruf]]) |
{{#ifexpr: ({{{1|1}}} = 44) | 44 ([[Ad-Dukhan]]) |
{{#ifexpr: ({{{1|1}}} = 45) | 45 ([[Al-Jathiya]]) |
{{#ifexpr: ({{{1|1}}} = 46) | 46 ([[Al-Ahqaf]]) |
{{#ifexpr: ({{{1|1}}} = 47) | 47 ([[Muhammad (sura)|Muhammad]]) |
{{#ifexpr: ({{{1|1}}} = 48) | 48 ([[Al-Fath]]) |
{{#ifexpr: ({{{1|1}}} = 49) | 49 ([[Al-Hujraat]]) |
{{#ifexpr: ({{{1|1}}} = 50) | 50 ([[Qaf (sura)|Qaf]]) |
{{#ifexpr: ({{{1|1}}} = 51) | 51 ([[Adh-Dhariyat]]) |
{{#ifexpr: ({{{1|1}}} = 52) | 52 ([[At-Tur]]) |
{{#ifexpr: ({{{1|1}}} = 53) | 53 ([[An-Najm]]) |
{{#ifexpr: ({{{1|1}}} = 54) | 54 ([[Al-Qamar]]) |
{{#ifexpr: ({{{1|1}}} = 55) | 55 ([[Ar-Rahman]]) |
{{#ifexpr: ({{{1|1}}} = 56) | 56 ([[Al-Waqia]]) |
{{#ifexpr: ({{{1|1}}} = 57) | 57 ([[Al-Hadid]]) |
{{#ifexpr: ({{{1|1}}} = 58) | 58 ([[Al-Mujadila]]) |
{{#ifexpr: ({{{1|1}}} = 59) | 59 ([[Al-Hashr]]) |
{{#ifexpr: ({{{1|1}}} = 60) | 60 ([[Al-Mumtahina]]) |
{{#ifexpr: ({{{1|1}}} = 61) | 61 ([[As-Saff]]) |
{{#ifexpr: ({{{1|1}}} = 62) | 62 ([[Al-Jumua]]) |
{{#ifexpr: ({{{1|1}}} = 63) | 63 ([[Al-Munafiqoon]]) |
{{#ifexpr: ({{{1|1}}} = 64) | 64 ([[At-Taghabun]]) |
{{#ifexpr: ({{{1|1}}} = 65) | 65 ([[At-Talaq]]) |
{{#ifexpr: ({{{1|1}}} = 66) | 66 ([[At-Tahrim]]) |
{{#ifexpr: ({{{1|1}}} = 67) | 67 ([[Al-Mulk]]) |
{{#ifexpr: ({{{1|1}}} = 68) | 68 ([[Al-Qalam]]) |
{{#ifexpr: ({{{1|1}}} = 69) | 69 ([[Al-Haaqqa]]) |
{{#ifexpr: ({{{1|1}}} = 70) | 70 ([[Al-Maarij]]) |
{{#ifexpr: ({{{1|1}}} = 71) | 71 ([[Nooh (sura)|Nooh]]) |
{{#ifexpr: ({{{1|1}}} = 72) | 72 ([[Al-Jinn]]) |
{{#ifexpr: ({{{1|1}}} = 73) | 73 ([[Al-Muzzammil]]) |
{{#ifexpr: ({{{1|1}}} = 74) | 74 ([[Al-Muddaththir]]) |
{{#ifexpr: ({{{1|1}}} = 75) | 75 ([[Al-Qiyama]]) |
{{#ifexpr: ({{{1|1}}} = 76) | 76 ([[Al-Insan]]) |
{{#ifexpr: ({{{1|1}}} = 77) | 77 ([[Al-Mursalat]]) |
{{#ifexpr: ({{{1|1}}} = 78) | 78 ([[An-Naba]]) |
{{#ifexpr: ({{{1|1}}} = 79) | 79 ([[An-Naziat]]) |
{{#ifexpr: ({{{1|1}}} = 80) | 80 ([[Abasa]]) |
{{#ifexpr: ({{{1|1}}} = 81) | 81 ([[At-Takwir]]) |
{{#ifexpr: ({{{1|1}}} = 82) | 82 ([[Al-Infitar]]) |
{{#ifexpr: ({{{1|1}}} = 83) | 83 ([[Al-Mutaffifin]]) |
{{#ifexpr: ({{{1|1}}} = 84) | 84 ([[Al-Inshiqaq]]) |
{{#ifexpr: ({{{1|1}}} = 85) | 85 ([[Al-Burooj]]) |
{{#ifexpr: ({{{1|1}}} = 86) | 86 ([[At-Tariq]]) |
{{#ifexpr: ({{{1|1}}} = 87) | 87 ([[Al-Ala]]) |
{{#ifexpr: ({{{1|1}}} = 88) | 88 ([[Al-Ghashiya]]) |
{{#ifexpr: ({{{1|1}}} = 89) | 89 ([[Al-Fajr (sura)|Al-Fajr]]) |
{{#ifexpr: ({{{1|1}}} = 90) | 90 ([[Al-Balad]]) |
{{#ifexpr: ({{{1|1}}} = 91) | 91 ([[Ash-Shams]]) |
{{#ifexpr: ({{{1|1}}} = 92) | 92 ([[Al-Lail]]) |
{{#ifexpr: ({{{1|1}}} = 93) | 93 ([[Ad-Dhuha]]) |
{{#ifexpr: ({{{1|1}}} = 94) | 94 ([[Al-Inshirah]]) |
{{#ifexpr: ({{{1|1}}} = 95) | 95 ([[At-Tin]]) |
{{#ifexpr: ({{{1|1}}} = 96) | 96 ([[Al-Alaq]]) |
{{#ifexpr: ({{{1|1}}} = 97) | 97 ([[Al-Qadr]]) |
{{#ifexpr: ({{{1|1}}} = 98) | 98 ([[Al-Bayyina]]) |
{{#ifexpr: ({{{1|1}}} = 99) | 99 ([[Az-Zalzala]]) |
{{#ifexpr: ({{{1|1}}} = 100) | 100 ([[Al-Adiyat]]) |
{{#ifexpr: ({{{1|1}}} = 101) | 101 ([[Al-Qaria]]) |
{{#ifexpr: ({{{1|1}}} = 102) | 102 ([[At-Takathur]]) |
{{#ifexpr: ({{{1|1}}} = 103) | 103 ([[Al-Asr]]) |
{{#ifexpr: ({{{1|1}}} = 104) | 104 ([[Al-Humaza]]) |
{{#ifexpr: ({{{1|1}}} = 105) | 105 ([[Al-Fil]]) |
{{#ifexpr: ({{{1|1}}} = 106) | 106 ([[Quraysh (sura)|Quraysh]]) |
{{#ifexpr: ({{{1|1}}} = 107) | 107 ([[Al-Ma'un]]) |
{{#ifexpr: ({{{1|1}}} = 108) | 108 ([[Al-Kawthar]]) |
{{#ifexpr: ({{{1|1}}} = 109) | 109 ([[Al-Kafirun]]) |
{{#ifexpr: ({{{1|1}}} = 110) | 110 ([[An-Nasr]]) |
{{#ifexpr: ({{{1|1}}} = 111) | 111 ([[Al-Masadd]]) |
{{#ifexpr: ({{{1|1}}} = 112) | 112 ([[Al-Ikhlas]]) |
{{#ifexpr: ({{{1|1}}} = 113) | 113 ([[Al-Falaq]]) |
{{#ifexpr: ({{{1|1}}} = 114) | 114 ([[An-Nas]]) |
error }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }}, [[ayat|verse]] [http://www.usc.edu/dept/MSA/quran/{{three digit|{{{1|1}}}}}.qmt.html#{{three digit|{{{1|1}}}}}.{{three digit|{{{2|1}}}}} {{{2|1}}}]''':'''{{cquote| {{{3|Default text}}}&mdash; <small>[[Qur'an translations|translated]] by {{#ifexpr: ({{{4|0}}} = 0) | Unknown | 
{{#ifexpr: ({{{4|0}}} = 1) | [[Salman the Persian]] |
{{#ifexpr: ({{{4|0}}} = 101) | [[Marmaduke Pickthall]] |
{{#ifexpr: ({{{4|0}}} = 102) | [[Abdullah Yusuf Ali]] |
{{#ifexpr: ({{{4|0}}} = 601) | [[Muhammad Muhsin Khan]] |
{{#ifexpr: ({{{4|0}}} = 701) | [[Mohammed Habib Shakir|M. H. Shakir]] |
{{#ifexpr: ({{{4|0}}} = 901) | [[Maulana Muhammad Ali]] |
{{#ifexpr: ({{{4|0}}} = 902) | [[Rashad Khalifa]] |
{{#ifexpr: ({{{4|0}}} = 1001) | [[Theodor Bibliander]] |
{{#ifexpr: ({{{4|0}}} = 1002) | [[Robert of Ketton]] |
{{#ifexpr: ({{{4|0}}} = 1003) | [[Andre du Ryer]] |
{{#ifexpr: ({{{4|0}}} = 1004) | [[Alexander Ross (writer)|Alexander Ross]] |
{{#ifexpr: ({{{4|0}}} = 1005) | [[Abraham Hinckelmann]] |
{{#ifexpr: ({{{4|0}}} = 1006) | [[George Sale]] |
{{#ifexpr: ({{{4|0}}} = 1007) | [[John Medows Rodwell]] |
{{#ifexpr: ({{{4|0}}} = 1008) | [[Arthur John Arberry]] |
error }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }}</small>
{{#if:{{{trans|}}}|
----
[[Transliteration]]: {{{trans}}}| }}
{{#if:{{{arab|}}}|
----
[[Arabic language|Arabic]]: {{{arab}}}| }} }}</font></div>