ozekik / lightrdf Goto Github PK

View Code? Open in Web Editor NEW

28.0 4.0 2.0 246 KB

A fast and lightweight Python RDF parser which wraps bindings to Rust's Rio using PyO3

License: Apache License 2.0

Python 56.23% Rust 43.77%

semantic-web python rdf rust pyo3 linked-data owl ntriples turtle

lightrdf's Introduction

Ontology, Semantic Web

Open Data, Digital Twins, Geospatial

Logic, Philsophy

lightrdf's People

Contributors

Stargazers

Watchers

Forkers

eggplants sciumo

lightrdf's Issues

Unable to parse Starwars.ttl

Issue:

Download Starwars Turtle file:
Starwars.ttl
Parse through all triples. There's a line on 3569 that causing the parse to crash.

rdfs:label "ประเทศไนเจอร์"@th , "Niġer"@mt , "尼日尔"@zh-SG , "尼日尔"@zh-MY , "尼日尔"@zh-Hans , "尼日尔"@zh-CN , "尼日尔"@zh , "Niyer"@tl , "Ngāika"@mi , "Nigeru"@olo , "ނީޖަރު"@dv , "Nijèr"@gcr , "නයිජර්"@si , "Nnijer"@kab , "Nìger"@co , "Nìger"@pms , "Ниҷер"@tg , "Niiser"@ff , "Níher"@gn , "Niiger"@frr , "Nigerän"@vo , "Nícher"@an , "نایجېر"@ps , "尼日"@zh-classical , "နိုင်ဂျာနိုင်ငံ"@my , "Nìjẹ̀r"@yo , "尼日"@zh-TW , "尼日"@zh-Hant , "尼日"@lzh , "નાઈજર"@gu , "Niijer"@om , "നീഷർ"@ml , "Niseer"@wo , "ኒጄር"@am , "নাইজের"@bpy , "Nixèr"@sc , "नाइजर"@new , "नाइजर"@mai , "नाइजर"@hi , "नाइजर"@dty , "नाइजर"@bho , "नाइजर"@bh , "نیجر"@mzn , "نیجر"@lrc , "نیجر"@fa , "نیجر"@azb , "نيجر"@arz , "نيجر"@ary , "نائجر"@ur , "نائجر"@pnb , "Нигермудин Орн"@xal , "Republiek Niger"@nds , "Pow Nijer"@kw , "Niger"@hif , "Niger"@hak , "Niger"@gsw , "Niger"@gag , "Niger"@fy , "Niger"@fr , "Niger"@fo , "Niger"@fiu-vro , "Niger"@fi , "Niger"@eu , "Niger"@et , "Niger"@en-GB , "Niger"@en-CA , "Niger"@en , "Niger"@ee , "Niger"@dsb , "Niger"@de-CH , "Niger"@de-AT , "Niger"@de , "Niger"@da , "Niger"@cy , "Niger"@cs , "Niger"@crh-Latn , "Niger"@crh , "Niger"@ceb , "Niger"@cdo , "Niger"@bs , "Niger"@br , "Niger"@bjn , "Niger"@ban , "Niger"@az , "Niger"@als , "Niger"@ak , "Niger"@af , "Niger"@ace , "Niger"@bcl , "Niger"@hr , "Niger"@hsb , "Niger"@hu , "Niger"@ia , "Niger"@id , "Niger"@ie , "Niger"@ig , "Niĝero"@eo , "Niger"@ilo , "Niger"@it , "Niger"@jv , "Niger"@kaa , "Niger"@ki , "Niger"@nl , "Niger"@no , "Niger"@simple , "Niger"@ts , "Niger"@uz , "Niger"@vec , "Niger"@lb , "Niger"@vep , "Niger"@lg , "Niger"@li , "Niger"@lij , "Niger"@lmo , "Niger"@vi , "Niger"@vro , "Niger"@war , "Niger"@za , "Niger"@zh-min-nan , "Niger"@min , "Niger"@ms , "Niger"@nah , "Niger"@nan , "Niger"@nb , "Niger"@nds-NL , "Niger"@nn , "Niger"@nov , "Niger"@nso , "Niger"@pam , "Niger"@pap , "Niger"@pl , "Niger"@ro , "Niger"@scn , "Niger"@sco , "Niger"@se , "Niger"@sh , "Niger"@sk , "Niger"@sl , "Niger"@sm , "Niger"@sn , "Niger"@sr-EL , "Niger"@st , "Niger"@stq , "Niger"@su , "Niger"@sv , "Niger"@sw , "Niger"@szy , "Niger"@tk , "Niijir"@pih , "Nizëre"@sg , "Nijar"@ha , "नाईजर"@ne , "ניז'ר"@he , "নাইজার"@bn , "Niher"@zea , "Nigeri"@rw , "Nigeri"@sq , "ニジェール"@ja , "Nizer"@ln , "Nîjer"@ku , "Nigèr"@oc , "ନାଇଜର"@or , "Нігер"@uk , "Нігер"@be-x-old , "Нігер"@be-tarask , "Нігер"@be , "Níxer"@ast , "Níxer"@gl , "ನೈಜರ್"@kn , "Nicer"@diq , "INayijari"@ss , "Ніґер"@rue , "Նիգեր"@hy , "ናይጀር"@ti , "Nijier"@jam , "नीजे"@sa , "नीजे"@pi , "ܢܝܓܪ"@arc , "Nijer"@bm , "Nijer"@din , "Nijer"@io , "Nijer"@kg , "Nijer"@lad , "Nijer"@lfn , "Nijer"@tr , "Nìgeir"@gd , "Nijera"@mg , "Nigi"@ext , "Nigeris"@bat-smg , "Nigeris"@lt , "Nigeris"@sgs , "ניזשער"@yi , "ནི་ཇར།"@bo , "Nayjar"@so , "Yn Neegeyr"@gv , "Νίγηρας"@el , "Nigēra"@lv , "ნიგერი"@xmf , "ნიგერი"@ka , "நைஜர்"@ta , "Нигер"@udm , "Нигер"@tt , "Нигер"@sr-EC , "Нигер"@sr , "Nijè"@ht , "Нигер"@sah , "Нигер"@ru , "Нигер"@os , "Нигер"@mrj , "Нигер"@mn , "Нигер"@mk , "Нигер"@ky , "Нигер"@kk , "Нигер"@ce , "Нигер"@bxr , "Нигер"@bg , "Нигер"@ba , "Нигер"@ady , "Nizɛɛrɩ"@kbp , "Nig·èr"@frp , "INayighe"@zu , "An Nígir"@ga , "니제르"@ko , "نیجەر"@ckb , "Res publica Nigritana"@la , "నైజర్"@te , "नायजर"@mr , "النيجر"@ar , "النيجر"@aeb-Arab , "نائيجر"@sd , "نىگېر"@ug , "Niqir"@qu , "Ńiger"@szl , "မိူင်းၼၢႆးၵျႃး"@shn , "尼日爾"@zh-yue , "尼日爾"@zh-MO , "尼日爾"@zh-HK , "尼日爾"@yue , "尼日爾"@wuu , "ਨਾਈਜਰ"@pa , "Níger"@ca , "Níger"@cbk-zam , "Níger"@es , "Níger"@is , "Níger"@pt , "Níger"@pt-BR ;

Error:

lightrdf.Error: error while parsing language tag 'zh-classical': A subtag may be eight characters in length at maximum on line 3569 at position 468

lightrdf.Error: error while parsing IRI '': No scheme found in an absolute IRI

Hi @ozekik!

Thank you for the awesome library! 👏

Unfortunately, while using your library, I got the error 🐛 mentioned in the title. 😞
But using rdflib I was not getting a similar error. 🤔

Environment

OS: Ubuntu 20.04
Python: 3.8.5
LightRDF: 0.2.1

Steps to reproduce.

Download pathways archive.

wget -q https://ftp.uniprot.org/pub/databases/uniprot/current_release/rdf/pathways.rdf.xz

Unzip it using xz package.

sudo apt install xz-utils
unxz pathways.rdf.xz

Run count_triples_lightrdf_parser.py.

python3 count_triples_lightrdf_parser.py pathways.rdf

Error log.

Traceback (most recent call last):
  File "count_triples_lightrdf_parser.py", line 8, in <module>
    for triple in parser.parse(sys.argv[1]):
lightrdf.Error: error while parsing IRI '': No scheme found in an absolute IRI

Please tell me where I am wrong. Thank you 🙏

Incorrect parsing

Hi @ozekik!

I found a bug when parsing. I considered generations.rdf file when parsing, but a similar bug appeared in many other files. For the some reason the library recognizes this tag

<ns2:versionInfo rdf:datatype="http://www.w3.org/2001/XMLSchema#string">An example ontology created by Matthew Horridge</ns2:versionInfo>

like this str

'"An example ontology created by Matthew Horridge"^^<http://www.w3.org/2001/XMLSchema#string>'

in last item of triple ( triple[-1] ).

When using the rdflib library, I was not getting a similar problem.
Thanks.

Add namespace support

It would be convenient to pass in a map of prefix-namespace expansions and have the search option return CURIEs where contraction is possible

While trivial to do in a python wrapper, it would be presumably faster to do at the rust level

Add support for parsing objects into literals vs URIs vs blank nodes

Currently the user has to parse the object to be able to do a lot of operations on it

This is relatively straightforward, I think:

^riog\d+$ is a blank node
Literals:
- ^"(.*)"^^<(\S+)>$ type
- ^"(.*)"@\w+$ language
- ^"(.*)"$ untyped
Otherwise a URI

But it might be nice to centralize this, or do it in rust for speed. To avoid the overhead of an OO interface how about a parallel search_statements with arguments s, p, o_uri, o_literal_value, o_datatype, o_lang?

This is my use case:

https://github.com/INCATools/rdf-sql-bulkloader

For now I am doing this in python

Rio libraries need updating to fix a very weird bug

When using LightRDF in the Ontology Development Kit, we have come across a very strange bug where LightRDF would fail to parse RDF/XML files that seem completely valid.

Here is a file that LightRDF fails to parse: https://github.com/INCATools/ontology-development-kit/files/10042121/tdm-bad.txt

(Sorry for the size of the file, but I was unable to reduce the error case to a minimal demonstrating example.)

Trying to parse that file with LightRDF as follows:

import sys
from lightrdf import Parser
parser = Parser()
try:
    for triple in parser.parse("tdm-bad.xml"):
        pass
except Exception as e:
    print(e)
    sys.exit(1)

yields the following error: Unexpected EOF during reading Comment.

I have no idea where the bug exactly is. However, rebuilding LightRDF after updating the Rio dependencies (rio_api, rio_turtle, and rio_xml) in Cargo.toml to their latest version (0.8.3) seems enough to fix it.

Providing a Linux-arm64 wheel

LightRDF is available as a wheel on PyPI for many combinations of systems and architectures (Windows/MacOS/Linux…, i686/x86_64/arm64…). Thanks for that!

However, one particular combination that is missing is Linux/arm64. Any chance it could be added?

Serialize RDF

I was looking for a replacement for RDFLib for just parsing, do some BGP searching and write the new triples back.
It seems that LightRDF can handle parsing and BGP searching, but not serializing the RDF triples again to a file.

Are there any plans for this?

Parse from String

Hello,

I am interested in using your library for fast parsing from turtle to n-triples.
However, as the current API only supports parsing from a file, I was wondering if it would be possible to extend the library to also parse string objects?

Thanks
Lars

lightrdf.Error: error while parsing IRI 'https://memory-alpha.fandom.com/wiki/USS_Voyager#Crew?oldid=2476206#Command_crew': Invalid IRI code point '#' on line 3135044 at position 92

First I have to say nice job on the library, I really love the speed and simplicity of lightrdf, and it worked very well with ChEMBL_27, but I ran into an issue when I tried to read the wikidata tll.

Details

wikidata's file latest-all.ttl

lightrdf.Error: error while parsing IRI 'https://memory-alpha.fandom.com/wiki/USS_Voyager#Crew?oldid=2476206#Command_crew': Invalid IRI code point '#' on line 3135044 at position 92

--- nearby lines from file, including problematic line, if my sed is correct

sed -n '3135042,3135046p;3135047q' latest-all.ttl

ref:8f38c16e1f141b68f172b65f48b0982234890e56 a wikibase:Reference ;
	pr:P854 <https://memory-alpha.fandom.com/wiki/USS_Voyager#Crew?oldid=2476206#Command_crew> ;
	pr:P813 "2020-03-12T00:00:00Z"^^xsd:dateTime ;
	prv:P813 v:ae909ef12942e232eea24326bdd78c8e ;

wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.gz
gunzip and parse the file

Note: these are big files, another strength of lightrdf, the ability to process large files without requiring the entire data set to fit in RAM.


parse the file` 'revisions_lang=en_uris.ttl'`
for each .ttl in this dataset, I'm simply dumping the triples to a file like this

parser = lightrdf.Parser()
triples = parser.parse(str(input_file), format="ttl", base_iri=None)
for (s, p, o) in triples:
    f_triples.write(f"{s}\t{p}\t{o}\n")


Other notes: 
- if there is a way to use a try catch to move past the line I would like to know how that is done
- ubuntu 20.04, conda env pip install lightrdf

expose trig support

looks like Rio can handle it, there's just no module for it in lightrdf