Coder Social home page Coder Social logo

earwig / mwparserfromhell Goto Github PK

View Code? Open in Web Editor NEW
742.0 37.0 75.0 2.27 MB

A Python parser for MediaWiki wikicode

Home Page: https://mwparserfromhell.readthedocs.io/

License: MIT License

Python 66.81% Shell 0.74% C 32.12% Batchfile 0.33%
python parser mediawiki wikipedia

mwparserfromhell's Introduction

mwparserfromhell

Coverage Status

mwparserfromhell (the MediaWiki Parser from Hell) is a Python package that provides an easy-to-use and outrageously powerful parser for MediaWiki wikicode. It supports Python 3.8+.

Developed by Earwig with contributions from Σ, Legoktm, and others. Full documentation is available on ReadTheDocs. Development occurs on GitHub.

Installation

The easiest way to install the parser is through the Python Package Index; you can install the latest release with pip install mwparserfromhell (get pip). Make sure your pip is up-to-date first, especially on Windows.

Alternatively, get the latest development version:

git clone https://github.com/earwig/mwparserfromhell.git
cd mwparserfromhell
python setup.py install

The comprehensive unit testing suite requires pytest (pip install pytest) and can be run with python -m pytest.

Usage

Normal usage is rather straightforward (where text is page text):

>>> import mwparserfromhell
>>> wikicode = mwparserfromhell.parse(text)

wikicode is a mwparserfromhell.Wikicode object, which acts like an ordinary str object with some extra methods. For example:

>>> text = "I has a template! {{foo|bar|baz|eggs=spam}} See it?"
>>> wikicode = mwparserfromhell.parse(text)
>>> print(wikicode)
I has a template! {{foo|bar|baz|eggs=spam}} See it?
>>> templates = wikicode.filter_templates()
>>> print(templates)
['{{foo|bar|baz|eggs=spam}}']
>>> template = templates[0]
>>> print(template.name)
foo
>>> print(template.params)
['bar', 'baz', 'eggs=spam']
>>> print(template.get(1).value)
bar
>>> print(template.get("eggs").value)
spam

Since nodes can contain other nodes, getting nested templates is trivial:

>>> text = "{{foo|{{bar}}={{baz|{{spam}}}}}}"
>>> mwparserfromhell.parse(text).filter_templates()
['{{foo|{{bar}}={{baz|{{spam}}}}}}', '{{bar}}', '{{baz|{{spam}}}}', '{{spam}}']

You can also pass recursive=False to filter_templates() and explore templates manually. This is possible because nodes can contain additional Wikicode objects:

>>> code = mwparserfromhell.parse("{{foo|this {{includes a|template}}}}")
>>> print(code.filter_templates(recursive=False))
['{{foo|this {{includes a|template}}}}']
>>> foo = code.filter_templates(recursive=False)[0]
>>> print(foo.get(1).value)
this {{includes a|template}}
>>> print(foo.get(1).value.filter_templates()[0])
{{includes a|template}}
>>> print(foo.get(1).value.filter_templates()[0].get(1).value)
template

Templates can be easily modified to add, remove, or alter params. Wikicode objects can be treated like lists, with append(), insert(), remove(), replace(), and more. They also have a matches() method for comparing page or template names, which takes care of capitalization and whitespace:

>>> text = "{{cleanup}} '''Foo''' is a [[bar]]. {{uncategorized}}"
>>> code = mwparserfromhell.parse(text)
>>> for template in code.filter_templates():
...     if template.name.matches("Cleanup") and not template.has("date"):
...         template.add("date", "July 2012")
...
>>> print(code)
{{cleanup|date=July 2012}} '''Foo''' is a [[bar]]. {{uncategorized}}
>>> code.replace("{{uncategorized}}", "{{bar-stub}}")
>>> print(code)
{{cleanup|date=July 2012}} '''Foo''' is a [[bar]]. {{bar-stub}}
>>> print(code.filter_templates())
['{{cleanup|date=July 2012}}', '{{bar-stub}}']

You can then convert code back into a regular str object (for saving the page!) by calling str() on it:

>>> text = str(code)
>>> print(text)
{{cleanup|date=July 2012}} '''Foo''' is a [[bar]]. {{bar-stub}}
>>> text == code
True

Limitations

While the MediaWiki parser generates HTML and has access to the contents of templates, among other things, mwparserfromhell acts as a direct interface to the source code only. This has several implications:

  • Syntax elements produced by a template transclusion cannot be detected. For example, imagine a hypothetical page "Template:End-bold" that contained the text </b>. While MediaWiki would correctly understand that <b>foobar{{end-bold}} translates to <b>foobar</b>, mwparserfromhell has no way of examining the contents of {{end-bold}}. Instead, it would treat the bold tag as unfinished, possibly extending further down the page.

  • Templates adjacent to external links, as in http://example.com{{foo}}, are considered part of the link. In reality, this would depend on the contents of the template.

  • When different syntax elements cross over each other, as in {{echo|''Hello}}, world!'', the parser gets confused because this cannot be represented by an ordinary syntax tree. Instead, the parser will treat the first syntax construct as plain text. In this case, only the italic tag would be properly parsed.

    Workaround: Since this commonly occurs with text formatting and text formatting is often not of interest to users, you may pass skip_style_tags=True to mwparserfromhell.parse(). This treats '' and ''' as plain text.

    A future version of mwparserfromhell may include multiple parsing modes to get around this restriction more sensibly.

Additionally, the parser lacks awareness of certain wiki-specific settings:

  • Word-ending links are not supported, since the linktrail rules are language-specific.
  • Localized namespace names aren't recognized, so file links (such as [[File:...]]) are treated as regular wikilinks.
  • Anything that looks like an XML tag is treated as a tag, even if it is not a recognized tag name, since the list of valid tags depends on loaded MediaWiki extensions.

Integration

mwparserfromhell is used by and originally developed for EarwigBot; Page objects have a parse method that essentially calls mwparserfromhell.parse() on page.get().

If you're using Pywikibot, your code might look like this:

import mwparserfromhell
import pywikibot

def parse(title):
    site = pywikibot.Site()
    page = pywikibot.Page(site, title)
    text = page.get()
    return mwparserfromhell.parse(text)

If you're not using a library, you can parse any page with the following Python 3 code (using the API and the requests library):

import mwparserfromhell
import requests

API_URL = "https://en.wikipedia.org/w/api.php"

def parse(title):
    params = {
        "action": "query",
        "prop": "revisions",
        "rvprop": "content",
        "rvslots": "main",
        "rvlimit": 1,
        "titles": title,
        "format": "json",
        "formatversion": "2",
    }
    headers = {"User-Agent": "My-Bot-Name/1.0"}
    req = requests.get(API_URL, headers=headers, params=params)
    res = req.json()
    revision = res["query"]["pages"][0]["revisions"][0]
    text = revision["slots"]["main"]["content"]
    return mwparserfromhell.parse(text)

mwparserfromhell's People

Contributors

andrewwang43 avatar anticompositenumber avatar bryghtshadow avatar davidebbo avatar davidswinegar avatar declerambaul avatar earwig avatar eseiver avatar hugovk avatar jayvdb avatar kishorkunal-raj avatar lahwaacz avatar larivact avatar legoktm avatar nyurik avatar odidev avatar r-barnes avatar reinerh avatar ricordisamoa avatar valhallasw avatar yuvipanda avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mwparserfromhell's Issues

List tags should include the item text in their contents

When parsing ordered or unordered lists, the actual text in the list item is returned as a string of separate nodes of various following the "Tag" node, all belonging to the same parent. For example:

# this {{is}}
# a [[test]]

filter(recursive=False) will now give these node types (unicode equivalent in brackets): Tag (#), Text ( this ), Template ({{is}}), Text (\n), Tag (#), Text ( a ), Wikilink ([[test]]), Text (\n)

Note that the last Text node will contain the newline that actually ends the list item. The parser should parse this the same way, by including everything on that line as part of the Tag node's .contents property. So this should really return: Tag (# this {{is}}\n), Tag (# a [[test]]\n)

Similar behaviour should also apply to unordered lists (*) and to definition lists (; and :).

Allow some methods of Wikicode (like remove()) to be passed other Wikicode objects

Reported by @HazardSJ:

a.remove(b) (given a and b as Wikicode objects, b as a subset of a):

  • expected: all nodes within b are removed from a
  • actual: ValueError is raised

Note: the docs are clear that remove() doesn't accept a Wikicode object, but the functionality is useful when sections are passed. Therefore, the method should be careful to iterate correctly over b's nodes.

Not detecting all templates on [[Talk:Lost Girls]]

legoktm@localhost:~$ python
Python 2.7.3 (default, Jul 29 2012, 23:31:23) 
[GCC 4.2.1 Compatible Apple Clang 4.0 ((tags/Apple/clang-421.0.57))] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pywikibot, mwparserfromhell
>>> p=pywikibot.Page(pywikibot.Site(), 'Talk:Lost Girls')
>>> text = p.get()
>>> code = mwparserfromhell.parse(text)
>>> code.filter_templates()
[u'{{Oz-project}}', u'{{t1|Supercbbox}}']

It's missed the following template:
{{comicsproj}}

I'm using this version of the page. And I'm using mwparserfromhell 0.1 from pypi.

Templates inside section headings

Given the following code:

import mwparserfromhell as mwp
wikitext = u"== {{User:SuggestBot/suggest}} ==";
code = mwp.parse(wikitext);
templates = code.filter_templates();

I then find that templates is an empty list.

I would suspect the template inside a heading to be valid markup, but maybe I'm mistaken? Looking around in various MediaWiki help pages did not suggest it wasn't OK though.

C tokenizer: emit tokens simpler than the expensive PyObject* kind

Allocating and filling the slots of PyObject*s every time we create a token (even if it is later discarded) is a large overhead; ideally, we use custom structs for each token that have the appropriate attributes.

The parser will either have to (1) convert these tokens to PyObject*s at the end of parsing (2) wrap them in some kind of capsule? (3) write a C port of the builder too that uses these new tokens.

(3) is ultimately the fastest solution, but it's the most work and, since regular Python tokens are never generated, we will need a new way to run C tokenizer test cases.

Ignores outer nested template in presence of http://

This simple example (based on an actual page I was troubleshooting) fails:

{{test
| website             = {{URL|http://x.com}}
}}

filter_templates() produces:

[u'{{URL|http://x.com}}\n}}']

For a very similar example (lacking only the 'http://'):

{{test
| website             = {{URL|x.com}}
}}

filter_templates() correctly produces

[u'{{test\n| website             = {{URL|x.com}}\n}}', u'{{URL|x.com}}']

HTML entities should be valid within certain parser-blacklisted tags

  • Given: <pre>&nbsp;</pre>
  • Expected:
    [TagOpenOpen(showtag=True), Text(text="pre"), TagCloseOpen(), HTMLEntityStart(), Text(text="nbsp"), HTMLEntityEnd(), TagOpenClose(), Text(text="pre"), TagCloseClose()]
  • Actual:
    [TagOpenOpen(showtag=True), Text(text="pre"), TagCloseOpen(), Text(text="&nbsp;"), TagOpenClose(), Text(text="pre"), TagCloseClose()]

py3k stuff

Step 1: from __future__ import unicode_literals in all files.

Step 2: from . import compatibility (or from . import compat) – uses str and bytes on Python 3, and str = unicode and bytes = str on Python 2.

setup.py AttributeError with python-2.6.8

Similar to issue #44 it seems that 37003d2 changes line 13 of compat.py

causing

python setup.py install
Traceback (most recent call last):
File "setup.py", line 26, in
from mwparserfromhell import version
File "/home/ec2-user/codetree/master/rrb/snpedia/bots/mwparserfromhell/mwparserfromhell/init.py", line 37, in
from . import (compat, definitions, nodes, parser, smart_list, string_mixin,
File "/home/ec2-user/codetree/master/rrb/snpedia/bots/mwparserfromhell/mwparserfromhell/compat.py", line 13, in
py3k = sys.version_info.major == 3
AttributeError: 'tuple' object has no attribute 'major'

under

Python 2.6.8 (unknown, Mar 14 2013, 09:31:22)
[GCC 4.6.2 20111027 (Red Hat 4.6.2-2)] on linux2

It seemed to like the previous structure
sys.version_info[0]
found in 53c2658
but not the
sys.version_info.major
found in 37003d2
as that named syntax was introduced in 2.7

segmentation fault (Python Interpreter Crashes)

Hello Everyone,

I am getting segmentation fault while parsing some wiki pages.
It is very easily reproducible.
I am using python 2.6.6.
To reproduce this fetch wiki page of 'Albert Einstein' and call mwparserfromhell.parse() with the text of this page.
I have observed segmentation fault with some other pages.
I use wiki api to fetch page of Albert Einstein in xml format and the parse the xml content to get the
text of this page and then call mwparserfromhell.parse with that text.
I have successfully used mwparserfromhell to parse hundreds of pages.
Let me know if you cannot reproduce this.

Thanks,
Ashwin

parse() crashes python

text = page.get()
wikicode = mwparserfromhell.parse(text)
references = wikicodefilter_templates()

I get a popup message at the parse() line stating that python has stopped working.
Python: 2.7.3 (default, Apr 10 2012, 23:31:26) [MSC v.1500 32 bit (Intel)]

I just did a pip uninstall and reinstalled with the most current version and it now is non-functioning

Template rejected by parser

I've come across this little gem here:

{{Infobox Platz
| Name=Strausberger Platz
| Alternativnamen=
| Stadtwappen=Coat of arms of Berlin.svg
| Kategorie=Platz in Berlin
| Bild=Strausberger Platz Berlin April 2006 109.jpg|miniatur
| Bild zeigt=Der Platz in Richtung Westen gesehen
| Ort=Berlin
| Ortsteil=[[Berlin-Friedrichshain]]
| Angelegt=1967
| Neugestaltet=
| Straßen=<br />Lichtenberger Straße, [[Karl-Marx-Allee]]
| Bauwerke=„Haus Berlin“
| Nutzergruppen=[[Fußgänger]], [[Radfahrer]], [[Auto]]
| Platzgestaltung=
| Baukosten=
}}

It's from here: http://de.wikipedia.org/w/index.php?title=Strausberger_Platz&oldid=112496475

As soon as the line
Bild=Strausberger Platz Berlin April 2006 109.jpg|miniatur
is added to the template, it is rejected and a Text node is created instead.
Even though this markup doesn't make sense, I would expect a Template node with a value-less parameter named "miniatur" instead of a Text node.

Template problems with empty parameters

This one is a real issue, I promise. Occasionally you run across a template with an empty final unnamed parameter, e.g.

{{foo|}}

The Template node handles empty params oddly:

>>> wikicode = mwparserfromhell.parse('{{foo|}}')
>>> tmpls = wikicode.filter_templates()
>>> tmpls[0]
u'{{foo|}}'
>>> tmpls[0].get(1)
u''
>>> tmpls[0].add(1, 'bar')
u'1=bar'
>>> tmpls[0]
u'{{foo||1=bar}}'

You can work around this by explicitly removing the empty param:

>>> wikicode = mwparserfromhell.parse('{{foo|}}')
>>> tmpls = wikicode.filter_templates()
>>> tmpls[0].remove(1)
>>> tmpls[0].add(1, 'bar')
u'bar'
>>> tmpls[0]
u'{{foo|bar}}'

Similar things happen with named params with empty values, e.g. {{foo|baz=}}. This seems to me like a bug. If it is, I can certainly take a crack at a fix.

mwparserfromhell version 0.3.2 installed via pip on Ubuntu 12.04.3.

HTML Tags: Doesn't parse ref tag name parameters with double-quotes and hyphen

Hey Earwig, try testing this:
wikicode = Builder().build(Tokenizer().tokenize('<ref name="a-b">'))
Error I get is: in my local tokenizer.py, line 472, in _actually_close_tag_opening:
if isinstance(self._stack[-1], tokens.TagAttrStart): IndexError: list index out of range
Only seems to occur when: 1) It's a ref tag, 2) name parameter is specified and has a value with certain characters in it, like - (hyphen) or = (equals), 3) the name parameter value is in double-quote. Bug? Zad68 14:10, 14 March 2013 (UTC)

[1]

Problem with iteration over slice of nodes

>>> wikicode = mwparserfromhell.parse(text)
>>> x = [node for node in wikicode.nodes[:10]]
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "<input>", line 1, in <listcomp>
  File "/home/bunyk/python/wikibot/lib/python3.2/site-packages/mwparserfromhell/smart_list.py", line 257, in __iter__
    while i < self._stop:
TypeError: unorderable types: NoneType() < int()

Issue with templates

Using 0.1.1 with python 2.7.
With this markup (simplified from http://en.wikipedia.org/w/index.php?action=raw&title=Applied%20Materials) :

{{Infobox company
| revenue          = {{nowrap|{{decrease}}
<ref name="Applied Materials, Inc.-10-K">{{cite web|url=http://pdf.secdatabase.com/1278/0000006951-12-000018.pdf |title=Applied Materials, Inc. Inc 2012 Annual Report, Form 10-K, Filing Date December 5, 2012 |publisher=secdatabase.com |accessdate =Dec 26, 2012}}</ref>}}
}}

I cannot get the template "Infobox company" from
mwparserfromhell.parse(test).filter_templates()[0].name
If I remove the newline between the {{decrease}} and the ref, it works.

C tokenizer: use Python 3's new Unicode APIs

Instead of turning self->text into a tuple of one-char Unicode strings, we should keep the original Unicode object and use the various (new) accessor methods to work with it. This is a fairly fundamental change to the way C parsing works, so I'm leaving it for v0.4.

LaTex in Wikitext, ignore <math> tags?

I seem to encounter performance issues on pages that have math in them. For instance: http://en.wikipedia.org/wiki/Spence%27s_function

Lines like

 :<math>\operatorname{Li}_2(z)+\operatorname{Li}_2(1-z)=\frac{{\pi}^2}{6}-\ln z \cdot\ln(1-z) </math>

correctly aren't recorgnized at the \pi template, but take a very long time to not be recognized.

Best way forwards, institute a ignoreMath=True option on filter_templates ?

Speed up parsing by storing intermediate tokens

There are certain situations where we rapidly construct and destroy and then have to reconstruct the same token stack when hitting bad routes. These are rare, but act as a major slowdown when they do occur, potentially hitting the parse cycle limit. Ideally, when failing a route, we store the tokenization of complete nodes in some kind of cache which associates them with the starting/ending head locations and the context. This cache can then be popped from if we reach that head location again with the same context, allowing us to bypass regular parsing.

Installation error

As I mentioned on IRC, I had a problem installing via the version from git (installed successfully via pip). While attempting to use the feature/html_tags branch, I ran setup.py install, but among other things, it returned:
mwparserfromhell error

IIRC, this was the same error I got before after doing a simple clone and attempting an install. Until this is fixed, back to pip I go (too bad tag filtering doesn't work there).

Strip whitespace from template names

Given the following MediaWiki code:

{{User:Foo/Bar
|param1=foo|param2=bar|param3=baz}}

The given template's name is "u'User:Foo/Bar\n'"

Is that a feature or a bug? I would have expected the newline to be stripped from the name.

Crashing on some malformed data

Hello,

I get an AttributeError that 'NoneType' Object has no attribute 'iternodes' [1].
Which is fair, but why isn't there a check for the None type here? What is the most elegant way to handle this?

I should note Wikicode that this is being called on is malformed as well [2]. It's from a wikipedia dump.

[1] File "xmlToOCLCNum.py", line 38, in findOCLCNums
templates = wikicode.filter_templates(recursive=True)
File "/usr/local/lib/python2.7/dist-packages/mwparserfromhell/wikicode.py", line 360, in filter_templates
return list(self.ifilter_templates(recursive, matches, flags))
File "/usr/local/lib/python2.7/dist-packages/mwparserfromhell/wikicode.py", line 320, in ifilter_templates
return self.filter(recursive, matches, flags, forcetype=Template)
File "/usr/local/lib/python2.7/dist-packages/mwparserfromhell/wikicode.py", line 344, in filter
return list(self.ifilter(recursive, matches, flags, forcetype))
File "/usr/local/lib/python2.7/dist-packages/mwparserfromhell/wikicode.py", line 301, in ifilter
for node in nodes:
File "/usr/local/lib/python2.7/dist-packages/mwparserfromhell/wikicode.py", line 82, in _get_all_nodes
for child in self._get_children(node):
File "/usr/local/lib/python2.7/dist-packages/mwparserfromhell/wikicode.py", line 59, in _get_children
for context, child in node.iternodes(self._get_all_nodes):
AttributeError: 'NoneType' object has no attribute 'iternodes'

[2]" #REDIRECT [[Folklore]] {{R with old history|printworthy}}

{{This is a redirect|with possibilities
A folktale is a story passed down from generaqtions."

Fix parsing of {{{params}}}

Currently, {{{param}}} will be parsed as a template of {param followed by the text }. This is a problem that should be fixed before releasing the parser into the wild.

git install error

C:\Python27\setups\mwparserfromhell>setup.py install
running install
running bdist_egg
running egg_info
writing mwparserfromhell.egg-info\PKG-INFO
writing top-level names to mwparserfromhell.egg-info\top_level.txt
writing dependency_links to mwparserfromhell.egg-info\dependency_links.txt
reading manifest file 'mwparserfromhell.egg-info\SOURCES.txt'
writing manifest file 'mwparserfromhell.egg-info\SOURCES.txt'
installing library code to build\bdist.win32\egg
running install_lib
running build_py
running build_ext
building 'mwparserfromhell.parser._tokenizer' extension
error: Unable to find vcvarsall.bat

Changelog

Start a changelog covering relevant updates (bugfixes, new/modified APIs) to be read when upgrading. First cover old changes from v0.1.1, then for the new v0.2.

HTML Tags: self-closing tags not handled properly

  $ python
  Python 2.7.3 (default, Aug  1 2012, 05:14:39)
  [GCC 4.6.3] on linux2
  Type "help", "copyright", "credits" or "license" for more information.
  >>> import mwparserfromhell
  from mwparserfromhell.parser.tokenizer import Tokenizer
  from mwparserfromhell.parser.builder import Builder
  >>> >>> >>>

  # Without self-closing ref tag, works
  >>> wikicode = Builder().build(Tokenizer().tokenize('I has a template!<ref name=foo>{{bar}}</ref>'))
  >>> wikicode.filter_tags()
  [u'<ref name=foo>{{bar}}</ref>']
  >>> wikicode.filter_tags(recursive=True)
  [u'<ref name=foo>{{bar}}</ref>']

  # With self-closing tag, doesn't work
  >>> wikicode = Builder().build(Tokenizer().tokenize('I has a template!<ref name=foo>{{bar}}</ref><ref name=baz/>'))
  >>> wikicode.filter_tags()
  []
  >>> wikicode.filter_text()
  [u'baz']
  >>> wikicode.filter_tags(recursive=True)
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "/home/zad68/.local/lib/python2.7/site-packages/mwparserfromhell-0.2.dev-py2.7-linux-x86_64.egg/mwparserfromhell/wikicode.py", line 376, in filter_tags
      return list(self.ifilter_tags(recursive, matches, flags))
    File "/home/zad68/.local/lib/python2.7/site-packages/mwparserfromhell-0.2.dev-py2.7-linux-x86_64.egg/mwparserfromhell/wikicode.py", line 301, in ifilter
      for node in nodes:
    File "/home/zad68/.local/lib/python2.7/site-packages/mwparserfromhell-0.2.dev-py2.7-linux-x86_64.egg/mwparserfromhell/wikicode.py", line 82, in _get_all_nodes
      for child in self._get_children(node):
    File "/home/zad68/.local/lib/python2.7/site-packages/mwparserfromhell-0.2.dev-py2.7-linux-x86_64.egg/mwparserfromhell/wikicode.py", line 59, in _get_children
      for context, child in node.__iternodes__(self._get_all_nodes):
  AttributeError: 'NoneType' object has no attribute '__iternodes__'

  # Edge case with self-closing tag only:
  >>> wikicode = Builder().build(Tokenizer().tokenize('<ref name=foo/>'))
  >>> wikicode.filter_tags()
  []
  >>> wikicode.filter_text()
  [u'foo']

  # If the tag isn't "ref", different but still incorrect behavior:
  # it doesn't stack trace but doesn't work either...
  >>> wikicode = Builder().build(Tokenizer().tokenize('I has<bloop name=baz/> a template!'))
  >>> wikicode.filter_tags()
  []
  >>> wikicode.filter_tags(recursive=True)
  []
  >>>
  wikicode = Builder().build(Tokenizer().tokenize("==Epidemiology==\nFoo.<ref>hi<br />there</ref>"))
  # this looks OK:
  >>> wikicode.filter_tags()
  [u'<ref>hi<br />there</ref>']
  # but doing it recursively yields slightly different stack trace
  >>> wikicode.filter_tags(recursive=True)
  Traceback (most recent call last):
  ...
  AttributeError: 'NoneType' object has no attribute 'nodes'

[1]

template.add() malfunction?

(link to source where problem occurs)

I'm running a script that adds archiveurl and archivedate parameters to templates (linked above).

When I ran the script for a page that had multiple dead links that needed fixing, this (diff) happened--everything looks fine for the first reference fixed, but in the second one, you'll notice that instead of adding parameters named "archiveurl" and "archivedate", the names of the parameters were the same as their values. This isn't right.

Was my template instance somehow corrupted? Or is there a bug?

is strip_code() implemented?

>>> import mwparserfromhell
>>> expected = "bold text"
>>> actual = mwparserfromhell.parse("'''bold text'''").strip_code(True, True)
>>> assert expected == actual, actual
Traceback (most recent call last):
  File "<input>", line 1, in <module>
AssertionError: '''bold text'''

Am I doing something wrong?
I haven't found __strip__ implemented for any of the Node subclasses.

Make pattern matches on filter functions more useful

I've found it a real problem that the matches= parameter only does a regular expression match on the whole wikitext. It is rather limited, but I don't know if proper node selection like XPath or CSS selectors is feasible.

So instead I propose using a custom filter function. The matches= would accept a function that takes the Wikicode object, and returns true if it's to be included in the returned list. This lets you build your own condition, like this:

filter_templates(matches = lambda node: node.name in ["foo", "bar"] and not node.has("lang"))

If backwards compatibility is a problem, the function can check the type and either do the function-based filtering or the regex-based filtering.

setup.py AttributeError with python-3.2

With Debian wheezy amd64 and Python 3.2, I receive an AttributeError when attempting to install. Or, is Python 3.2 not supported?

$ git clone https://github.com/earwig/mwparserfromhell.git
$ cd mwparserfromhell
$ python3 setup.py install
Traceback (most recent call last):
  File "setup.py", line 26, in <module>
    from mwparserfromhell import __version__
  File "/home/user/mwparserfromhell/mwparserfromhell/__init__.py", line 37, in <module>
    from . import (compat, definitions, nodes, parser, smart_list, string_mixin,
  File "/home/user/mwparserfromhell/mwparserfromhell/nodes/__init__.py", line 35, in <module>
    from ..string_mixin import StringMixIn
  File "/home/user/mwparserfromhell/mwparserfromhell/string_mixin.py", line 43, in <module>
    class StringMixIn(object):
  File "/home/user/mwparserfromhell/mwparserfromhell/string_mixin.py", line 129, in StringMixIn
    @inheritdoc
  File "/home/user/mwparserfromhell/mwparserfromhell/string_mixin.py", line 40, in inheritdoc
    method.__doc__ = getattr(str, method.__name__).__doc__
AttributeError: type object 'str' has no attribute 'casefold'

Templates need to distinguish between unnamed params and named param '1'

The method of addressing the first unnamed parameter with template.get(1) leads to an ambiguity when trying to parse a template that includes a named parameter "1", e.g.:

>>> x = mwparserfromhell.parse('{{foo|bar|baz|1=quux}}')
>>> x.filter_templates()
[u'{{foo|bar|baz|1=quux}}']
>>> tmpl = x.filter_templates()[0]
>>> tmpl.name
u'foo'
>>> tmpl.params
[u'bar', u'baz', u'1=quux']
>>> tmpl.get(1)
u'1=quux'
>>> tmpl.get(2)
u'baz'

This becomes an actual problem in practice when trying to parse WikiProjectBannerShell templates, which include both unnamed params and a mandatory parameter named '1'.

Great work on the parser, by the way. Legoktm put me on to it earlier this evening, and it has already made some of my bot code much clearer and more robust.

The tokenizer incorrectly handles some difficult tag-related markup

  1. Bold and italics that cross contexts are handled incorrectly, because the tree structure does not support overlapping nodes (for example, ''foo'''bar''baz''', or ''foo{{bar|baz''}}). Fixing this will probably be very difficult.
  2. Open tags that do not have a close tag before the parser reaches EOF are ignored, whereas some of them should be parsed (like bold and italics) and have some kind of "hidden close" flag set.
  3. MediaWiki counts the occurrences of ; in the block before any text and uses this as the maximum number of parsable :s after. The current implementation only allows one : regardless of how many ;s there are.
  4. MediaWiki prevents some tags from crossing certain contexts (italics and bold can't cross headings, for example) but this implementation has no such restriction.
  5. The parser only recognizes a space as the separator character between the URL and its link title in [ ] tags, but MediaWiki also accepts some other syntax (e.g. [http://example.com/''Example''] is valid).

1, 4, and 5 are high priority, whereas 2 is mid and 3 is low.

C tokenizer: use goto for error handling

The Python C API docs recommend goto (gasp) for error handling in C functions, where there's a block at the end of each function that calls Py_XDECREF() on every Python object used in that function and then returns a pre-set value (0 or 1 depending on success or failure). This is better than our current system of a bunch of Py_DECREF() calls every time a function raises an error.

Receiving RuntimeError: maxmium recursion depth exceeded

I'm not sure if it's this page (it does have a lot of templates), or the fact that I'm running it MultiProcessed over 32 Cores (see my project https://github.com/notconfusing/mwparameterhell )

But this is the if I catch the run time error:

try:
    wikicode = mwparserfromhell.parse(pagetext)
    templates = wikicode.filter_templates(recursive=True)
except RuntimeError:
    print pagetext

this is the pagetext:

<noinclude>{{Template sandbox notice}}</noinclude>
<div class="boilerplate metadata rfa" style="background-color:#FFFFF5; margin: 2em 0 0 0; padding: 0 10px 0 10px; border: 1px solid #AAAAAA;">The [[Qur'an]], [[sura|chapter]] {{#ifexpr: ({{{1|1}}} = 1) | 1 ([[Al-Fatiha]]) | 
{{#ifexpr: ({{{1|1}}} = 2) | 2 ([[Al-Baqara]]) |
{{#ifexpr: ({{{1|1}}} = 3) | 3 ([[Ali Imran]]) |
{{#ifexpr: ({{{1|1}}} = 4) | 4 ([[An-Nisa]]) |
{{#ifexpr: ({{{1|1}}} = 5) | 5 ([[Al-Ma'ida]]) |
{{#ifexpr: ({{{1|1}}} = 6) | 6 ([[Al-An'am]]) |
{{#ifexpr: ({{{1|1}}} = 7) | 7 ([[Al-A'raf]]) |
{{#ifexpr: ({{{1|1}}} = 8) | 8 ([[Al-Anfal]]) |
{{#ifexpr: ({{{1|1}}} = 9) | 9 ([[At-Tawba]]) |
{{#ifexpr: ({{{1|1}}} = 10) | 10 ([[Yunus (sura)|Yunus]]) |
{{#ifexpr: ({{{1|1}}} = 11) | 11 ([[Hud (sura)|Hud]]) |
{{#ifexpr: ({{{1|1}}} = 12) | 12 ([[Yusuf (sura)|Yusuf]]) |
{{#ifexpr: ({{{1|1}}} = 13) | 13 ([[Ar-Ra'd]]) |
{{#ifexpr: ({{{1|1}}} = 14) | 14 ([[Ibrahim (sura)|Ibrahim]]) |
{{#ifexpr: ({{{1|1}}} = 15) | 15 ([[Al-Hijr]]) |
{{#ifexpr: ({{{1|1}}} = 16) | 16 ([[An-Nahl]]) |
{{#ifexpr: ({{{1|1}}} = 17) | 17 ([[Al-Isra]]) |
{{#ifexpr: ({{{1|1}}} = 18) | 18 ([[Al-Kahf]]) |
{{#ifexpr: ({{{1|1}}} = 19) | 19 ([[Maryam (sura)|Maryam]]) |
{{#ifexpr: ({{{1|1}}} = 20) | 20 ([[Ta-Ha]]) |
{{#ifexpr: ({{{1|1}}} = 21) | 21 ([[Al-Anbiya]]) |
{{#ifexpr: ({{{1|1}}} = 22) | 22 ([[Al-Hajj]]) |
{{#ifexpr: ({{{1|1}}} = 23) | 23 ([[Al-Muminun]]) |
{{#ifexpr: ({{{1|1}}} = 24) | 24 ([[An-Noor]]) |
{{#ifexpr: ({{{1|1}}} = 25) | 25 ([[Al-Furqan]]) |
{{#ifexpr: ({{{1|1}}} = 26) | 26 ([[Ash-Shu'ara]]) |
{{#ifexpr: ({{{1|1}}} = 27) | 27 ([[An-Naml]]) |
{{#ifexpr: ({{{1|1}}} = 28) | 28 ([[Al-Qisas]]) |
{{#ifexpr: ({{{1|1}}} = 29) | 29 ([[Al-Ankabut]]) |
{{#ifexpr: ({{{1|1}}} = 30) | 30 ([[Ar-Rum]]) |
{{#ifexpr: ({{{1|1}}} = 31) | 31 ([[Luqman (sura)|Luqman]]) |
{{#ifexpr: ({{{1|1}}} = 32) | 32 ([[As-Sajda]]) |
{{#ifexpr: ({{{1|1}}} = 33) | 33 ([[Al-Ahzab]]) |
{{#ifexpr: ({{{1|1}}} = 34) | 34 ([[Saba (sura)|Saba]]) |
{{#ifexpr: ({{{1|1}}} = 35) | 35 ([[Fatir]]) |
{{#ifexpr: ({{{1|1}}} = 36) | 36 ([[Ya-Seen]]) |
{{#ifexpr: ({{{1|1}}} = 37) | 37 ([[As-Saaffat]]) |
{{#ifexpr: ({{{1|1}}} = 38) | 38 ([[Sad (sura)|Sad]]) |
{{#ifexpr: ({{{1|1}}} = 39) | 39 ([[Az-Zumar]]) |
{{#ifexpr: ({{{1|1}}} = 40) | 40 ([[Ghafir]]) |
{{#ifexpr: ({{{1|1}}} = 41) | 41 ([[Fussilat]]) |
{{#ifexpr: ({{{1|1}}} = 42) | 42 ([[Ash-Shura]]) |
{{#ifexpr: ({{{1|1}}} = 43) | 43 ([[Az-Zukhruf]]) |
{{#ifexpr: ({{{1|1}}} = 44) | 44 ([[Ad-Dukhan]]) |
{{#ifexpr: ({{{1|1}}} = 45) | 45 ([[Al-Jathiya]]) |
{{#ifexpr: ({{{1|1}}} = 46) | 46 ([[Al-Ahqaf]]) |
{{#ifexpr: ({{{1|1}}} = 47) | 47 ([[Muhammad (sura)|Muhammad]]) |
{{#ifexpr: ({{{1|1}}} = 48) | 48 ([[Al-Fath]]) |
{{#ifexpr: ({{{1|1}}} = 49) | 49 ([[Al-Hujraat]]) |
{{#ifexpr: ({{{1|1}}} = 50) | 50 ([[Qaf (sura)|Qaf]]) |
{{#ifexpr: ({{{1|1}}} = 51) | 51 ([[Adh-Dhariyat]]) |
{{#ifexpr: ({{{1|1}}} = 52) | 52 ([[At-Tur]]) |
{{#ifexpr: ({{{1|1}}} = 53) | 53 ([[An-Najm]]) |
{{#ifexpr: ({{{1|1}}} = 54) | 54 ([[Al-Qamar]]) |
{{#ifexpr: ({{{1|1}}} = 55) | 55 ([[Ar-Rahman]]) |
{{#ifexpr: ({{{1|1}}} = 56) | 56 ([[Al-Waqia]]) |
{{#ifexpr: ({{{1|1}}} = 57) | 57 ([[Al-Hadid]]) |
{{#ifexpr: ({{{1|1}}} = 58) | 58 ([[Al-Mujadila]]) |
{{#ifexpr: ({{{1|1}}} = 59) | 59 ([[Al-Hashr]]) |
{{#ifexpr: ({{{1|1}}} = 60) | 60 ([[Al-Mumtahina]]) |
{{#ifexpr: ({{{1|1}}} = 61) | 61 ([[As-Saff]]) |
{{#ifexpr: ({{{1|1}}} = 62) | 62 ([[Al-Jumua]]) |
{{#ifexpr: ({{{1|1}}} = 63) | 63 ([[Al-Munafiqoon]]) |
{{#ifexpr: ({{{1|1}}} = 64) | 64 ([[At-Taghabun]]) |
{{#ifexpr: ({{{1|1}}} = 65) | 65 ([[At-Talaq]]) |
{{#ifexpr: ({{{1|1}}} = 66) | 66 ([[At-Tahrim]]) |
{{#ifexpr: ({{{1|1}}} = 67) | 67 ([[Al-Mulk]]) |
{{#ifexpr: ({{{1|1}}} = 68) | 68 ([[Al-Qalam]]) |
{{#ifexpr: ({{{1|1}}} = 69) | 69 ([[Al-Haaqqa]]) |
{{#ifexpr: ({{{1|1}}} = 70) | 70 ([[Al-Maarij]]) |
{{#ifexpr: ({{{1|1}}} = 71) | 71 ([[Nooh (sura)|Nooh]]) |
{{#ifexpr: ({{{1|1}}} = 72) | 72 ([[Al-Jinn]]) |
{{#ifexpr: ({{{1|1}}} = 73) | 73 ([[Al-Muzzammil]]) |
{{#ifexpr: ({{{1|1}}} = 74) | 74 ([[Al-Muddaththir]]) |
{{#ifexpr: ({{{1|1}}} = 75) | 75 ([[Al-Qiyama]]) |
{{#ifexpr: ({{{1|1}}} = 76) | 76 ([[Al-Insan]]) |
{{#ifexpr: ({{{1|1}}} = 77) | 77 ([[Al-Mursalat]]) |
{{#ifexpr: ({{{1|1}}} = 78) | 78 ([[An-Naba]]) |
{{#ifexpr: ({{{1|1}}} = 79) | 79 ([[An-Naziat]]) |
{{#ifexpr: ({{{1|1}}} = 80) | 80 ([[Abasa]]) |
{{#ifexpr: ({{{1|1}}} = 81) | 81 ([[At-Takwir]]) |
{{#ifexpr: ({{{1|1}}} = 82) | 82 ([[Al-Infitar]]) |
{{#ifexpr: ({{{1|1}}} = 83) | 83 ([[Al-Mutaffifin]]) |
{{#ifexpr: ({{{1|1}}} = 84) | 84 ([[Al-Inshiqaq]]) |
{{#ifexpr: ({{{1|1}}} = 85) | 85 ([[Al-Burooj]]) |
{{#ifexpr: ({{{1|1}}} = 86) | 86 ([[At-Tariq]]) |
{{#ifexpr: ({{{1|1}}} = 87) | 87 ([[Al-Ala]]) |
{{#ifexpr: ({{{1|1}}} = 88) | 88 ([[Al-Ghashiya]]) |
{{#ifexpr: ({{{1|1}}} = 89) | 89 ([[Al-Fajr (sura)|Al-Fajr]]) |
{{#ifexpr: ({{{1|1}}} = 90) | 90 ([[Al-Balad]]) |
{{#ifexpr: ({{{1|1}}} = 91) | 91 ([[Ash-Shams]]) |
{{#ifexpr: ({{{1|1}}} = 92) | 92 ([[Al-Lail]]) |
{{#ifexpr: ({{{1|1}}} = 93) | 93 ([[Ad-Dhuha]]) |
{{#ifexpr: ({{{1|1}}} = 94) | 94 ([[Al-Inshirah]]) |
{{#ifexpr: ({{{1|1}}} = 95) | 95 ([[At-Tin]]) |
{{#ifexpr: ({{{1|1}}} = 96) | 96 ([[Al-Alaq]]) |
{{#ifexpr: ({{{1|1}}} = 97) | 97 ([[Al-Qadr]]) |
{{#ifexpr: ({{{1|1}}} = 98) | 98 ([[Al-Bayyina]]) |
{{#ifexpr: ({{{1|1}}} = 99) | 99 ([[Az-Zalzala]]) |
{{#ifexpr: ({{{1|1}}} = 100) | 100 ([[Al-Adiyat]]) |
{{#ifexpr: ({{{1|1}}} = 101) | 101 ([[Al-Qaria]]) |
{{#ifexpr: ({{{1|1}}} = 102) | 102 ([[At-Takathur]]) |
{{#ifexpr: ({{{1|1}}} = 103) | 103 ([[Al-Asr]]) |
{{#ifexpr: ({{{1|1}}} = 104) | 104 ([[Al-Humaza]]) |
{{#ifexpr: ({{{1|1}}} = 105) | 105 ([[Al-Fil]]) |
{{#ifexpr: ({{{1|1}}} = 106) | 106 ([[Quraysh (sura)|Quraysh]]) |
{{#ifexpr: ({{{1|1}}} = 107) | 107 ([[Al-Ma'un]]) |
{{#ifexpr: ({{{1|1}}} = 108) | 108 ([[Al-Kawthar]]) |
{{#ifexpr: ({{{1|1}}} = 109) | 109 ([[Al-Kafirun]]) |
{{#ifexpr: ({{{1|1}}} = 110) | 110 ([[An-Nasr]]) |
{{#ifexpr: ({{{1|1}}} = 111) | 111 ([[Al-Masadd]]) |
{{#ifexpr: ({{{1|1}}} = 112) | 112 ([[Al-Ikhlas]]) |
{{#ifexpr: ({{{1|1}}} = 113) | 113 ([[Al-Falaq]]) |
{{#ifexpr: ({{{1|1}}} = 114) | 114 ([[An-Nas]]) |
error }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }}, [[ayat|verse]] [http://www.usc.edu/dept/MSA/quran/{{three digit|{{{1|1}}}}}.qmt.html#{{three digit|{{{1|1}}}}}.{{three digit|{{{2|1}}}}} {{{2|1}}}]''':'''{{cquote| {{{3|Default text}}}&mdash; <small>[[Qur'an translations|translated]] by {{#ifexpr: ({{{4|0}}} = 0) | Unknown | 
{{#ifexpr: ({{{4|0}}} = 1) | [[Salman the Persian]] |
{{#ifexpr: ({{{4|0}}} = 101) | [[Marmaduke Pickthall]] |
{{#ifexpr: ({{{4|0}}} = 102) | [[Abdullah Yusuf Ali]] |
{{#ifexpr: ({{{4|0}}} = 601) | [[Muhammad Muhsin Khan]] |
{{#ifexpr: ({{{4|0}}} = 701) | [[Mohammed Habib Shakir|M. H. Shakir]] |
{{#ifexpr: ({{{4|0}}} = 901) | [[Maulana Muhammad Ali]] |
{{#ifexpr: ({{{4|0}}} = 902) | [[Rashad Khalifa]] |
{{#ifexpr: ({{{4|0}}} = 1001) | [[Theodor Bibliander]] |
{{#ifexpr: ({{{4|0}}} = 1002) | [[Robert of Ketton]] |
{{#ifexpr: ({{{4|0}}} = 1003) | [[Andre du Ryer]] |
{{#ifexpr: ({{{4|0}}} = 1004) | [[Alexander Ross (writer)|Alexander Ross]] |
{{#ifexpr: ({{{4|0}}} = 1005) | [[Abraham Hinckelmann]] |
{{#ifexpr: ({{{4|0}}} = 1006) | [[George Sale]] |
{{#ifexpr: ({{{4|0}}} = 1007) | [[John Medows Rodwell]] |
{{#ifexpr: ({{{4|0}}} = 1008) | [[Arthur John Arberry]] |
error }} }} }} }} }} }} }} }} }} }} }} }} }} }} }} }}</small>
{{#if:{{{trans|}}}|
----
[[Transliteration]]: {{{trans}}}| }}
{{#if:{{{arab|}}}|
----
[[Arabic language|Arabic]]: {{{arab}}}| }} }}</font></div>

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.