The amara3-xml from uogbuji

Implement an XML builder

Came across the now archived XML Witch and thought it might be handy to have something like this in Amara. Might even be good to support async with for it, for cases of XML building interleaved with I/O.

Try out libxml2 for faster XML & HTML parsing

Possibly via Cython? Check on cross-platform deployment issues in any such case.

libxml API docs

Idea popped into my head when I stumbled across this thread.

Release 3.4.0

Release notes (update in place, for easy cut & paste):

MicroXpath: Fix selections from // and * axes
HTML5: Fix treatment of comment nodes
Code cleanup (e.g. formatting & avoiding HumpCase class names)
Parse URL sources directly from the microx command line

Migrate to Oori Data and reclaim the amara PyPI project

The Amara saga continues! I don't exactly remember why I decided to dead end the Amara PyPI project when it hit 2.0, but I moved to a series of Amara 3 generation projects (amara3.iri, amara3.xml & amara3-names). Those were far more lone wolf efforts, but at Oori Data we're seeing a lot of need for the sorts of capability that's inchoate in Amara 3.

Time to re-consolidate the Amara projects, call it a 4th generation, and move it to Oori Data GitHub.

Note: Pip & PyPI are case insensitive. PEP 426 says All comparisons of distribution names MUST be case insensitive, and MUST consider hyphens and underscores to be equivalent. An amara 4.0.0a1 package will seamlessly supersede Amara 2.0.0

Memory leaks in tree implementation

As usual for Python tree-like & graph-like structures amara.uxml.tree seems to be plagued by memory weeks. I went ahead and replaced child-to-parent links with weakrefs as best I could, but seems it's still leaking. I've been using the following MARC splitter program with memory profiler.

https://gist.github.com/uogbuji/bccdb1e2fdbb7bb88459#file-marc-split-memprof-py

Erroneous processing of MicroXPath // & ancestor axis (also * name tests)

from amara3.uxml import html5
from amara3.uxml.treeutil import descendants, select_elements
from amara3.uxml import xml
from amara3.uxml.treeutil import *
from amara3.uxml.tree import *
from amara3.uxml.uxpath import context as xpathcontext, parse as xpathparse

import requests
resp = requests.get('http://garybyker.library.link/resource/5bLglR2qVao/')
root = html5.parse(resp.text)
xpathctx = xpathcontext(root)
ALL_IMAGES = xpathparse('//img')
it = ALL_IMAGES.compute(xpathctx)
i = next(it) #StopIteration
# root => {uxml.element (-9223363273091541553) "html" with 3 children}
# root.xml_children[2] => {uxml.element (-9223363273092099405) "body" with 10 children}

X = xpathparse('/html//img')
it = X.compute(xpathctx)
n = next(it)

Last 3 lines result in:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/uche/.local/pyenv/main/lib/python3.6/site-packages/amara3/uxml/uxpath/ast.py", line 313, in compute
    yield from self.relative.compute(new_ctx)
  File "/home/uche/.local/pyenv/main/lib/python3.6/site-packages/amara3/uxml/uxpath/ast.py", line 234, in compute
    yield from self.right.compute(new_ctx)
  File "/home/uche/.local/pyenv/main/lib/python3.6/site-packages/amara3/uxml/uxpath/ast.py", line 379, in compute
    to_process = list(child.xml_children) + to_process[1:]
AttributeError: 'comment' object has no attribute 'xml_children'

Note alternative works fine

imgs = [ e for e in descendants(root) if e.xml_name == 'img' ]
# imgs => [{uxml.element (8763762573951) "img" with 0 children}, {uxml.element (8763762576069) "img" with 0 children}, {uxml.element (8763762579471) "img" with 0 children}, {uxml.element (-9223363273092194223) "img" with 0 children}, {uxml.element (8763762583444) "img" with 0 children}, {uxml.element (-9223363273092209555) "img" with 0 children}]

Also not working:

PARENT_RESOURCE = xpathparse('ancestor::div[@class="thumbnail-holder"]/a/@href')
img = imgs[0]
imgctx = xpathcontext(img) #, force_root=False)
res = next(PARENT_RESOURCE.compute(imgctx), None)

Working alternative:

preparent, parent = img, img.xml_parent
while parent:
    if parent.xml_name == 'div' and 'thumbnail-holder' in parent.xml_attributes.get('class', ''):
        break
    preparent = parent
    parent = parent.xml_parent

uxml.writer not escaping attribute cdata?

I wrote some code using amara3.uxml to modify MARCXML records. I thought I'd be able to use a xmlter.sender coroutine for input and write it out with uxml.writer losslessly. That's not happening though, when it encounters character references on input, specifically the quot character reference in this case, it gets turned into a quote character, producing non-well-formed output.

Here's a script and some input data to reproduce

uogbuji / amara3-xml Goto Github PK

amara3-xml's People

Contributors

Stargazers

Watchers

Forkers

amara3-xml's Issues

Implement an XML builder

Try out libxml2 for faster XML & HTML parsing

Release 3.4.0

Migrate to Oori Data and reclaim the amara PyPI project

Memory leaks in tree implementation

Erroneous processing of MicroXPath // & ancestor axis (also * name tests)

uxml.writer not escaping attribute cdata?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent