proycon / foliatools Goto Github PK

A number of command-line tools for working with FoLiA (Format for Linguistic Annotation). Includes validators, converters, visualisers, and more.

License: GNU General Public License v3.0

Python 67.00% XSLT 21.15% Shell 1.08% HTML 10.77%

folia nlp computational-linguistics clarin clariah conllu converters

foliatools's Introduction

https://github.com/proycon/foliatools/actions/workflows/foliatools.yml/badge.svg?branch=master

http://applejack.science.ru.nl/lamabadge.php/foliatools

Project Status: Active – The project has reached a stable, usable state and is being actively developed.

Latest release in the Python Package Index

FoLiA Tools

A number of command-line tools are readily available for working with FoLiA, to various ends. The following tools are currently available:

foliavalidator -- Tests if documents are valid FoLiA XML. Always use this to test your documents if you produce your own FoLiA documents!. See the extra documentation in the dedicated scetion below.
foliaquery -- Advanced query tool that searches FoLiA documents for a specified pattern, or modifies a document according to the query. Supports FQL (FoLiA Query Language) and CQL (Corpus Query Language).
foliaeval -- Evaluation tool, can compute various evaluation metrics for selected annotation types, either against a gold standard reference or as a measure of inter-annotated agreement.
folia2txt -- Convert FoLiA XML to plain text (pure text, without any annotations). Use this to extract plain text from any FoLiA document.
folia2annotatedtxt -- Like above, but produces output simple token annotations inline, by appending them directly to the word using a specific delimiter.
folia2columns -- This conversion tool reads a FoLiA XML document and produces a simple columned output format (including CSV) in which each token appears on one line. Note that only simple token annotations are supported and a lot of FoLiA data can not be intuitively expressed in a simple columned format!
folia2html -- Converts a FoLiA document to a semi-interactive HTML document, with limited support for certain token annotations.
folia2dcoi -- Convert FoLiA XML to D-Coi XML (only for annotations supported by D-Coi)
foliatree -- Outputs the hierarchy of a FoLiA document.
foliacat -- Concatenate multiple FoLiA documents.
foliacount -- This script reads a FoLiA XML document and counts certain structure elements.
foliacorrect -- A tool to deal with corrections in FoLiA, can automatically accept suggestions or strip all corrections so parsers that don't know how to handle corrections can process it.
foliaerase -- Erases one or more specified annotation types from the FoLiA document.
folialangid -- Does language detection on FoLiA documents, assigns language identifiers to different substructures
foliaid -- Assigns IDs to elements in FoLiA documents. Use this to automatically generate identifiers on certain (or all) elements.
foliafreqlist -- Output a frequency list on tokenised FoLiA documents.
foliamerge -- Merges annotations from two or more FoLiA documents.
foliatextcontent -- A tool for adding or stripping text redundancy (i.e. text associated with multiple structural levels), supports computing and adding offset information. Use this if you want to have text available on a different level (e.g. the global text level).
foliaupgrade -- Upgrades a document to the latest FoLiA version.
alpino2folia -- Convert Alpino-DS XML to FoLiA XML
dcoi2folia -- Convert D-Coi XML to FoLiA XML
conllu2folia -- Convert files in the CONLL-U format to FoLiA XML.
rst2folia -- Convert ReStructuredText, a lightweight non-intrusive text markup language, to FoLiA, using docutils.
tei2folia -- Convert a subset of TEI to FoLiA. See the extra documentation in the section below.
folia2salt -- Convert FoLiA XML to Salt, which in turn enables further conversions (annis, paula, TCF, TigerXML, and others) through Pepper. See the extra documentation in the dedicated section below.
folia2stam -- Convert FoLiA XML to STAM, a standoff annotation model. Retains FoLiA vocabulary and enables further conversion to e.g. W3C Web Annotations.

All of these tools are written in Python, and thus require a Python 3 installation to run. More tools are added as time progresses.

Installation

The FoLiA tools are published to the Python Package Index and can be installed effortlessly using pip, from the command-line, type:

$ pip install folia-tools

You may need to use pip3 to ensure you have the Python 3 version. Add sudo to install it globally on your system, but we strongly recommend you use virtualenv to make a self-contained Python environment.

The FoLiA tools are also included in our LaMachine distribution .

Installation Troubleshooting

If pip is not yet available, install it as follows:

On Debian/Ubuntu-based systems:

$ sudo apt-get install python3-pip

On RedHat-based systems:

$ yum install python3-pip

On Arch Linux systems:

$ pacman -Syu python-pip

Usage

To obtain help regarding the usage of any of the available FoLiA tools, please pass the -h option on the command line to the tool you intend to use. This will provide a summary on available options and usage examples. Most of the tools can run on both a single FoLiA document, as well as a whole directory of documents, allowing also for recursion. The tools generally take one or more file names or directory names as parameters.

More about FoLiA?

Please consult the FoLiA website at https://proycon.github.io/folia for more!

Specific Tools

This section contains some extra important information for a few of the included tools.

Validating FoLiA documents using foliavalidator

The FoLiA validator is an essential tool for anybody working with FoLiA. It is very important that FoLiA documents are properly validated before they are published, this ensures that tools know what to expect when they get a FoLiA document as input for processing and are not confronted with any nasty surprises that are far too common in the field. The degree of formal validation offered by FoLiA is something that sets it apart from many alternative annotation formats. The key tool to perform validation is foliavalidator (or its alternative C++ implementation folialint as part of FoLiA-utils).

Validation can proceed on two levels:

shallow validation - Validates the full FoLiA document, checks if all elements are valid FoLiA elements, properly used, and if the document structure is valid. Checks if all the proper annotation declarations are present and if there are no inconsistencies in the text if text is specified on multiple levels (text redundancy). Note that shallow validation already does way more than validation against the RelaxNG Schema does.
deep validation - Does all of the above, but in addition it also checks the actual tagsets used. It checks if all declarations refer to valid set definition and if all used classes (aka tags/labels) are valid according to the declared set definitions and if the combination of certain classes is valid according to the set definition.

Note that validation against merely the RelaxNG schema could be called naive validation and is NOT considered sufficient FoLiA validation for most intents and purposes.

Shallow validation is invoked as: $ foliavalidator document.folia.xml. Deep validation invoked as: $ foliavalidator --deep document.folia.xml.

In addition to validating, the foliavalidator tool is capable of automatically fixing certain validation problems when explicitly asked to do so, such as automatically declaring missing annotations.

Another feature of the validator is that it can get as a converter to convert FoLiA documents to explicit form (using the --explicit parameter). Explicit form is a more verbose form of XML serialisation that is easier to parse to certain tools as it makes explicit certain details that are left implicit in normal form.

TEI to FoLiA conversion

The TEI P5 guidelines (Text Encoding Initiative) specify a widely used encoding method for machine-readable texts. It is primarly a format for capture text structure and markup in great detail, but there are some facilities for linguistic annotation too. The sheer flexibility and complexity of TEI leads to many different TEI dialects, and subsequently implementing support for TEI (all-of-it) in a tool is an almost impossible task. FoLiA is more constrained than TEI with regard to structural and markup annotation, but places more focus on linguistic annotation.

The tei2folia tool performs conversion from a (sizable) subset of TEI to FoLiA, but provides no guarantee that all TEI P5 documents can be processed. Some notable things that are supported:

Conversion of text structure including divisions, paragraphs, headers & titles, lists, figures, tables (limited), front matter, back matter
Verse text (limited, no metrical analysis etc), line groups (<lg>)
Gaps
Text markup (highlighting, <hi>), emphasis, foreign, term, mentioned, names and places
- Limited corrections
Conversion of lightweight linguistic annotation.
Linguistic segments: sentences (<s>) & words (w), but not <cl> nor <phr>.
- Basic tokenisation (spacing) information (TEI's @join attribute)
Limited metadata

Specifically not supported (yet), non-exhaustive list:

Graphs and trees
Milestones
Span groups, interpretration groups, link groups (<spanGrp>, <interpGrp>, <linkGrp>)
Speech
Contextual information
Feature structures (<fs>, <f>)

FoLiA to STAM

STAM is a stand-off model for text annotation that. It does not prescribe any vocabulary at all but allows one to reuse existing vocabularies. The folia2stam tool converts FoLiA documents to STAM, preserving the vocabulary that FoLiA predefines regarding annotation types, common attributes etc...

Supported:

Conversion of text structure including divisions, paragraphs, headers & titles, lists, figures, tables (limited), front matter, back matter.
Conversion of inline and span annotation

Not supported yet:

Only tokenised documents (i.e. with word elements) are implemented currently
Conversion of text markup annotation
Certain higher-order annotation is not converted yet
No explicit tree structure is built yet for hierarchical annotations like syntax annotation
Do note that there is no conversion back from STAM to FoLiA XML currently (that would be complicated for multiple reasons, so might never be realized).

Vocabulary conversion:

Both FoLiA and STAM have the notion of a set or annotation dataset. In FoLiA the scope of such a set is to define the vocabulary used for a particular annotation type (e.g. a tagset). FoLiA itself already defines what annotation types exist. In STAM an annotation dataset is a broader notion and all vocabulary, even the notion of a word or sentence, comes from a set, as nothing is predefined at all aside from the STAM model's primitives.

We map most of the vocabulary of FoLiA itself to a STAM dataset with ID https://w3id.org/folia/v2/. All of FoLiA's annotation types, element types, and common attributes are defined in this set.

Each FoLiA set definition maps to a STAM dataset with the same set ID (URI. The STAM set defines class key in that set, that corresponds to FoLiA's class attribute. Any FoLiA subsets (for features) also translate to key identifiers.

The declarations inside a FoLiA document will be explicitly expressed in STAM as well; each STAM dataset will have an annotation that points to it (with a DataSetSelector). This annotation has data with key declaration (set https://w3id.org/folia/v2/) that marks it as a declaration for a specific type, the value is something like pos-annotation and corresponds one-on-one to the declaration element used in FoLiA XML. Additionally, this annotation also has data with key annotationtype (same set as above) that where the value corresponds to the annotation type (lowercased, e.g. pos).

The FoLiA to STAM conversion is RDF-ready. That is, all identifiers are valid IRIs and all FoLiA vocabulary (https://w3id.org/folia/v2/) is backed by a formal ontology using RDF and SKOS.

FoLiA set definitions, if defined, are already in SKOS (or in the legacy format).

Being RDF-ready means that the STAM model produced by folia2stam can in turn be easily be exported to W3C Web Annotations. Tooling for that conversion will be provided in Stam Tools.

FoLiA to Salt

Salt is a graph based annotation model that is designed to act as an intermediate format in the conversion between various annotation formats. It is used by the conversion tool Pepper. Our FoLiA to Salt converter, however, is a standalone tool as part of these FoLiA tools, rather than integrated into pepper. You can use folia2salt to convert FoLiA XML to Salt XML and subsequently use Pepper to do conversions to other formats such as TCF, PAULA, TigerXML, GraF, Annis, etc... (there is no guarantee though that everything can be preserved accurately in each conversion).

The current state of this conversion is summarised below, it is however not likely that this particular tool will be developed any further:

Conversion of FoLiA tokens to salt SToken nodes * The converter only supports tokenised FoLiA documents
Text extraction (from tokens) to STextualDS node and conversion to STextualRelation edges * preserves untokenised text only to a certain degree (using FoLiA's token spacing information only) * not yet supported: multiple text classes
Conversion of FoLiA Inline Annotation (pos, lemma etc) to salt SAnnotation labels
Conversion of FoLiA Structure Annotation (sentences,paragraph, etc) to salt SSpan nodes and SSpanRelation edges * converted structures will directly relate to the underlying token nodes rather than to a structural hierarchy like in FoLiA
Conversion of simple FoLiA Span Annotation (entities etc) to salt SSpan nodes and SSpanRelation edges
- Conversion of nested Span Annotation (syntax etc) to SSpan nodes and SDominanceRelation edges
- not yet supported: Span Annotation including span roles (dependencies etc) to SSpan nodes and SDominanceRelation edges
Grouping of annotation types/sets in salt SLayer nodes
Conversion of FoLiA higher order elements:
- Features
- Comments
- Descriptions
- not yet supported:
  
  Relations
  
  Metrics
  
  Span Relations
  
  String annotation
  
  Alternative annotation
  
  Corrections
Conversion of FoLiA phonetic content (as an extra STextualDS node and STextualRelation edges)
Convert FoLiA native metadata
not yet supported:
- Conversion of FoLiA subtoken annotation (morphology/phonology)
- Conversion of FoLiA references to audio/video sources and timing information

Our Salt conversion tries to preserve as much of the FoLiA as possible, we extensively use salt's capacity for specifying namespaces to hold and group the annotation type and set of an annotation. SLabel elements with the same namespace should often be considered together.

foliatools's People

Contributors

Stargazers

Watchers

Forkers

parkervg birch-group computational-linguistics-research bloemj

foliatools's Issues

[foliaerase] Test if foliaerase is capable of stripping markup properly

[tei2folia] Convert w@norm, w@join, fix list handling and various conversion problems

Some issues arose trying to convert a TEI document to FoLiA: http://www.deutschestextarchiv.de/book/show/wolff_anfangsgruende01_1710

There a lot of elements that can not be converted:

[tei2folia WARNING] Unhandled tag in structure context: s (in p) (I wonder what triggers this because this seems rather basi
[tei2folia WARNING] Unhandled tag in structure context: fw (in div) (we don't handle fw yet)
[tei2folia WARNING] Unknown tag in structure context: s (in item) (list processing seems to go wrong)

It seems this document diverges from the TEI collections we were used to hitherto.

In addition, there are interesting extra attributes in this document that should be converted, such as norm and join (on w)

folivalidator handles feature nodes incorrectly when using EXPLICIT mode

give the file provenance.2.0.0.folia.xml form folia-repo/examples

when using foliavalidator without EXPLICIT mode, the output file contains this fragment:

        <w xml:id="untitled.p.1.s.1.w.1" class="WORD">
          <t>De</t>
          <pos class="LID(bep,stan,rest)" processor="p1.1" confidence="0.999701" head="LID">
            <feat subset="lwtype" class="bep"/>
            <feat subset="naamval" class="stan"/>
            <feat subset="npagr" class="rest"/>
          </pos>
          <lemma class="de"/>
        </w>

when using `-x' too:

        <w xml:id="untitled.p.1.s.1.w.1" typegroup="structure" set="tokconfig-nl" class="WORD" processor="p0" textclass="current">
          <t typegroup="content" set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/text.foliaset.ttl" class="current">De</t>
          <pos typegroup="inline" set="http://ilk.uvt.nl/folia/sets/frog-mbpos-cgn" class="LID(bep,stan,rest)" processor="p1.1" confidence="0.999701" textclass="current">
            <feat subset="head" class="LID"/>
            <feat subset="lwtype" class="bep"/>
            <feat subset="naamval" class="stan"/>
            <feat subset="npagr" class="rest"/>
          </pos>
          <lemma typegroup="inline" set="http://ilk.uvt.nl/folia/sets/frog-mblem-nl" class="de" processor="p1.2" textclass="current"/>
        </w>

There are 2 problems here:

The typegroup attribute is missing from the <feat\> nodes
The 'head' attribute is not inlined as an attribute, but kept as a \<feat\>

folia2html: handle t-ref

requested by @pirolen

folia2txt vs. FoLiA-2text handling of <part> nodes

Given this FoLiA file:

<?xml version="1.0" encoding="UTF-8"?>
<FoLiA xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://ilk.uvt.nl/folia" xml:id="parts" generator="libfolia-v2.8" version="2.4.0">
  <metadata type="native">
    <annotations>
      <paragraph-annotation set="set"/>
      <part-annotation set="set"/>
      <style-annotation set="set"/>
      <hyphenation-annotation set="set"/>
      <text-annotation set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/text.foliaset.ttl"/>
    </annotations>
  </metadata>
  <text xml:id="parts.text">
    <p xml:id="parts.text.p1">
      <feat class="Justified" subset="par_align"/>
      <part xml:id="parts.text.p1.part.1" class="line">
        <part xml:id="parts.text.p1.part.1.part.1" class="fragment">
          <t>
            <t-style>Dit </t-style>
          </t>
        </part>
        <part xml:id="parts.text.p1.part.1.part.2" class="fragment">
          <t>
            <t-style>is een gebroken lijn</t-style>
          </t>
        </part>
      </part>
      <part xml:id="parts.text.p1.part.2" class="line" space="no">
        <t>
          <t-style>met nog een lijn die is af<t-hbr/></t-style>
        </t>
      </part>
      <part xml:id="parts.text.p1.part.3" class="line">
        <part xml:id="parts.text.p1.part.3.part.1" class="fragment">
          <t>
            <t-style>gebroken bij een hyphen.<br/></t-style>
          </t>
          </part>
          <part xml:id="parts.text.p1.part.2.part.2" class="fragment">
            <t>
              <t-style>nieuwe regel</t-style>
            </t>
          </part>
      </part>
    </p>
  </text>
</FoLiA>

Note the space="no" and and the
!

folia2txt produces this text:

            Dit 
           
            is een gebroken lijn
           
          met nog een lijn die is af
        
            gebroken bij een hyphen.

           
              nieuwe regel

OTOH, FoLiA-2text produces:

Dit is een gebroken lijn met nog een lijn die is afgebroken bij een hyphen.
nieuwe regel

I wonder which of both is "correct". folia2txt missing the space="no" seems a problem to me anyway.

I tries the same with on level of <part> nodes less:

<?xml version="1.0" encoding="UTF-8"?>
<FoLiA xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://ilk.uvt.nl/folia" xml:id="parts" generator="libfolia-v2.8" version="2.4.0">
  <metadata type="native">
    <annotations>
      <paragraph-annotation set="set"/>
      <part-annotation set="set"/>
      <style-annotation set="set"/>
      <hyphenation-annotation set="set"/>
      <text-annotation set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/text.foliaset.ttl"/>
    </annotations>
  </metadata>
  <text xml:id="parts.text">
    <p xml:id="parts.text.p1">
      <feat class="Justified" subset="par_align"/>
      <part xml:id="parts.text.p1.part.1." class="line">
        <t>
          <t-style>Dit </t-style>
        </t>
      </part>
      <part xml:id="parts.text.p1.part.2" class="line">
        <t>
          <t-style>is een gebroken lijn</t-style>
        </t>
      </part>
      <part xml:id="parts.text.p1.part.3" class="line" space="no">
        <t>
          <t-style>met nog een lijn die is af<t-hbr/></t-style>
        </t>
      </part>
      <part xml:id="parts.text.p1.part.4" class="line">
        <t>
          <t-style>gebroken bij een hyphen.<br/></t-style>
        </t>
      </part>
      <part xml:id="parts.text.p1.part.5" class="fragment">
        <t>
          <t-style>nieuwe regel</t-style>
        </t>
      </part>
    </p>
  </text>
</FoLiA>

In that case the results are:
folia2txt:

            Dit 
           
            is een gebroken lijn
           
          met nog een lijn die is af
        
            gebroken bij een hyphen.

           
              nieuwe regel

FoLiA-2text:

Dit is een gebroken lijn met nog een lijn die is afgebroken bij een hyphen.
nieuwe regel

(which is the same as before)

So again: which one is "correct"

folia2html: XSL conversion results in extra spaces

Reported by @pirolen:

A space appears around the html spans that hold superscripted text, which are not there in the FoLiA.
E.g. see in the attached file this part

<t-str xml:id="FA-Prototyp_MWG-I-23_147-215_001.text.div1.p4.t-str.9">
            <t-style><feat class="Times New Roman" subset="font_family"/><feat class="10." subset="font_size"/><feat class="{70B504C2-AD38-496F-9F2A-B6E0061724F6}" subset="font_style"/>Aufsatz im Logos IV (1913, S.253ff.</t-style>
            <t-style><feat class="superscript" subset="font_typeface"/><feat class="Times New Roman" subset="font_family"/><feat class="10." subset="font_size"/><feat class="{70B504C2-AD38-496F-9F2A-B6E0061724F6}" subset="font_style"/>a</t-style>
            <t-style><feat class="Times New Roman" subset="font_family"/><feat class="10." subset="font_size"/><feat class="{70B504C2-AD38-496F-9F2A-B6E0061724F6}" subset="font_style"/>)</t-style>
            <t-style><feat class="superscript" subset="font_typeface"/><feat class="Times New Roman" subset="font_family"/><feat class="10." subset="font_size"/><feat class="{70B504C2-AD38-496F-9F2A-B6E0061724F6}" subset="font_style"/>1</t-style>
            <t-style><feat class="Times New Roman" subset="font_family"/><feat class="10." subset="font_size"/><feat class="{70B504C2-AD38-496F-9F2A-B6E0061724F6}" subset="font_style"/> ist die Terminologie tunlichst ver<t-hbr/></t-style>
          </t-str>

Solving this in XSL will be hard so this might need to be handled in a preprocessing step in folia2html itself. This issue relates to proycon/folia#92 , proycon/folia#88 , and LanguageMachines/foliautils#56

processing instruction problem?

FoLiA-tools v2.5.4, using FoLiA v2.5.1 with library FoLiApy v2.5.8

gcnd.test.folia.xml.txt

This file validates when I omit the processing instructions (<?n_elan_annotations 2?> etc), but with the processing instructions:

VALIDATION ERROR on full parse by library (stage 2/3), in data/GCND/gcnd.test.folia.xml.txt
AttributeError: 'cython_function_or_method' object has no attribute 'startswith'

(Of course, I can use a comment or something else for this type of information, so it is not a showstopper.)

[foliatextcontent] propagate markup information to higher/lower levels

If there is markup information in a higher text layer, say on paragraph level, we want to be able to replicate that markup information on lower levels (say sentence or words), if not yet available. We also want the reverse, if there markup information on lower levels, we want to express it also on higher levels.

tei2folia: autodeclare should be enabled?

I pip installed foliatools.
On the attached small test file I run tei2folia and got the below error.

In the main.py of foliapy I see #autodeclare is enabled (default for FoLiA v2).

$ tei2folia  --traceback  /home/pirol/quanti/devel/diagn/collate1.tei.xml -o /home/pirol/quanti/devel/diagn/
Instantiating XML parser
Converting /home/pirol/quanti/devel/diagn/collate1.tei.xml
VALIDATION ERROR on full parse by library in /home/pirol/quanti/devel/diagn/collate1.tei.xml
DeclarationError: Encountered an instance without proper declaration: Comment <comment>!
-- Full traceback follows -->
Traceback (most recent call last):
  File "/home/pirol/quanti/devel/lama/lama/lib/python3.8/site-packages/foliatools/tei2folia.py", line 86, in convert
    doc = folia.Document(tree=transformed, debug=kwargs.get('debug',0))
  File "/home/pirol/quanti/devel/lama/lama/lib/python3.8/site-packages/folia/main.py", line 7427, in __init__
    self.parsexml(kwargs['tree'])
  File "/home/pirol/quanti/devel/lama/lama/lib/python3.8/site-packages/folia/main.py", line 8646, in parsexml
    return Class.parsexml(node,self)
  File "/home/pirol/quanti/devel/lama/lama/lib/python3.8/site-packages/folia/main.py", line 3575, in parsexml
    return super(Comment,Class).parsexml(node, doc, **kwargs)
  File "/home/pirol/quanti/devel/lama/lama/lib/python3.8/site-packages/folia/main.py", line 3416, in parsexml
    instance = Class(doc, *args, **kwargs)
  File "/home/pirol/quanti/devel/lama/lama/lib/python3.8/site-packages/folia/main.py", line 3546, in __init__
    super(Comment,self).__init__(doc, *args, **kwargs)
  File "/home/pirol/quanti/devel/lama/lama/lib/python3.8/site-packages/folia/main.py", line 659, in __init__
    kwargs = self.parsecommonarguments(doc, **kwargs)
  File "/home/pirol/quanti/devel/lama/lama/lib/python3.8/site-packages/folia/main.py", line 787, in parsecommonarguments
    self.checkdeclaration()
  File "/home/pirol/quanti/devel/lama/lama/lib/python3.8/site-packages/folia/main.py", line 1190, in checkdeclaration
    raise DeclarationError("Encountered an instance without proper declaration: " + self.__class__.__name__ + " <" + self.__class__.XMLTAG + ">!")
folia.main.DeclarationError: Encountered an instance without proper declaration: Comment <comment>!
Unable to convert  /home/pirol/quanti/devel/diagn/collate1.tei.xml

collate1.tei.xml.txt

[tei2folia] Text body not getting converted from TEI5 doc

I have a few toy TEI5 XML documents that include <w> and <c> elements, and annotations as <rs type=...> elements.
tei2folia generates output from them, but the document body is empty.
What could be the reason? I am attaching the input/output docs.

The TEI was generated by INCEpTION.
It uses the DKPro Core TEI reader / writer which supports a subset of TEI. The elements are listed here: https://dkpro.github.io/dkpro-core/releases/2.2.0/docs/format-reference.html#format-Tei

N.B. I randomly chose a TEI validation method: https://trafilatura.readthedocs.io/en/latest/tutorial2.html
and the file did not validate.

I understand from the developers that the TEI reader / writer were developed using various TEI files from different sources as test material. If one has particular problems with data not validating, one can report this as an issue in the INCEpTION or DKPro Core GitHub issue trackers.

FA-MBK-4-3_035245008_0019_abpproc_entries.inctei.folia.xml.txt
FA-MBK-4-3_035245008_0019_abpproc_entries.inctei.xml.txt

foliavalidator gives non informative error message when the version attribute is missing

When validating a document without a version attribute, I get:

Error on line 0: Element FoLiA failed to validate attributes
VALIDATION ERROR against RelaxNG schema (stage 1/3), in tests/GRR.xml
Element FoLiA failed to validate attributes

A more helpful message would include a reference to the missing attribute

folia2html: Implement support for outputting based on other text classes

folia2html didn't support handling other textclasses yet, it only looked at the default 'current' text class. We need a configurable parameter to change this behaviour to output for non-default text classes.

As reported by @pirolen

NotImplementedError: <class 'foliatools.rst2folia.FoLiATranslator'> visiting unknown node type: container

Encontered this when using piereling for converting Abbyy-produced HTML file ('.htm'), cf.
https://webservices.cls.ru.nl/piereling/a/output/error.log

Undefined environment variable: CUSTOMHTML_INDEX
Running pandoc --from=html --to=rst "input/kap1_pp1_2_adobeimgpdf_abbyy.htm.html" > "output/kap1_pp1_2_adobeimgpdf_abbyy.rst"

Running rst2folia --docid="kap1_pp1_2_adobeimgpdf_abbyy" "output/kap1_pp1_2_adobeimgpdf_abbyy.rst" "output/kap1_pp1_2_adobeimgpdf_abbyy.folia.xml"
NotImplementedError: <class 'foliatools.rst2folia.FoLiATranslator'> visiting unknown node type: container

folia2txt yields empty output on files generated by (newer?) ucto and FoLiA-abby

Have completely updated LaMachine, and run folia2txt on e.g. this file linked at the FLAT issue today
The input was an empty string.

Works fine with other files (e.g. TICCL output).

txt2folia fails on control character on single line

When a text file has an invalid control characters in a line, it is stripped. But if it's the sole input for the word, we may end up with empty text, which is not allowed. An extra check is needed to prevent his.

[folia2salt] Question: are List annotations supported yet?

In LaMachine I tried out folia2salt, but I got:

Exception: Unable to init layer for element <ListItem at 140407838899560 id=FA-MBK-4-3_035245008_0020_abpproc_partransf.text.1.div.1.p.1.list.1.item.1 set=None class=None>

I wonder if List annotations are supported.

validator rejects folia 2.0 document without a text-annotation declaration

The validator rejects documents without a text-annotation declaration.

ParseError: FoLiA exception in handling of <t> @ line 50: [DeclarationError] Set 'https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/text.foliaset.ttl' is used for TextContent <t>, but has no declaration!

I was under the impression that in such cases a default should be implied, like this:

      <text-annotation set=""https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/text.foliaset.ttl"/>

Also this warning is quite confusing, as it mentions TextContent, and not Text.
Wouldn't 'textcontent-annotation' have been a better idea, to express this relation?
And also 'phoncontent-annotation'...

tei2folia: Adding a value to -i parameter

Hi proycon,

With Ucto, I use the parameter '--id' to which I give as value the relevant part of the file name. This then becomes the first part of the ID of each xml element. If --id is not specified, this by default becomes 'undefined'.

I tried the same in what seemed the similar parameter in tei2folia, i.e. '-i'. Adding a parameter there, breaks it. The help function gives: '-i, --ids Generate IDs for all structural elements (default: False)'. This is unclear and as a consequence I have no idea how it works. When I use it without a value, the result is that element IDs start with 'undefined'. This feature does not seem to be further documented anywhere else either.

I would very much like tei2folia to also have the same functionality in this as Ucto has. Can this be implemented, please?

Thank you!

> I think this is an example of a more fundamental question:

I think this is an example of a more fundamental question:
Do Structure elements carry some textual information, even when they don't contain any text?

For a the answer seems YES. But are there other examples?
Empty Paragraphs? Empty Sentences? I don't know.

This indeed seems the fundamental question which we hadn't considered earlier. The only three I can think of that fit this are <cell> and perhaps <row> and maybe even <item>. We would need an additional mechanism to accommodate this in the libraries.

Originally posted by @proycon in #41 (comment)

Validator should output warnings or notices if things are declared but not used

the validator rejects valid folia using 'alias'-es for set definitions

ml.txt
Given the attached file, I would expect no errors, but the validator states:

VALIDATION ERROR on full parse by library (stage 2/3), in ml.xml
ParseError: FoLiA exception in handling of <w> @ line 52: [DeclarationError] Set 'tokconfig-nld' is used for Word <w>, but has no declaration!

This is not correct, as an declaration exists, using an alias:

      <token-annotation alias="tokconfig-nld" set="https://raw.githubusercontent.com/LanguageMachines/uctodata/master/setdefinitions/tokconfig-nld.foliaset.ttl">
        <annotator processor="ucto.1"/>
      </token-annotation>

[foliatools] add conllu2folia and folia2conllu tools

Provider a converter from/to the heavily used CONLL-U format

Also provide FoLiA Set Definitions for Universal dependencies and Universal PoS tags.

How to handle empty cells in folia2txt and FoLiA-2text

Both folia2txt and FoLiA-2text do not handle empty <cell> nodes in a table correctly, imho. I consider this to be a bug.

see this file: cell_problem.xml.txt

It has 2 rows, with each 3 cells, but the upper-left cell is empty.
Both folia2txt and FoLiA-2text output:

Kop 2 | Kop 3
Rij 2 | Veld 2 | Veld 3

Which is wrong. The correct result should be:

 | Kop 2 | Kop 3
Rij 2 | Veld 2 | Veld 3

(leaving proper layout to a later moment :P )

Entering an empty text in the upper-left cell is impossible, as empty strings are forbidden in FoLiA.
What should we do? Adapt the programs to handle empty cells?
As a last resort we could add some marker in the cell that it is empty, and use that.

FoLiA to STAM conversion

Implement a folia2stam tool to export to
STAM. This is also relevant for
proycon/folia#102 as STAM will acts as the pivot model to convert to Web
Annotation.

This tool will essentially split the text from the annotations (which we
sometimes refer to as 'untangling'). The main challenge is getting the offsets
right, but we already have FoLiA tooling that should help there.

The FoLiA-specific vocabulary will be maintained, this also relates to proycon/folia#4 .

foliaupgrade is way too slow

(reported by @ceramisch)

foliaupgrade is excessively slow, taking up to 15 minutes for a document that validates in 2 seconds! Something is clearly wrong.

This affects loading times of old (to be upgraded) documents in FLAT.

conllu2folia + folia2annotatedtxt

I'm testing the conversion from conllu to folia and next to annotatedtext as follows on the following conllu file called traindata.conllu

# newdoc id = doc1
# newpar
# sent_id = 1
# text = Ik ben de weg kwijt, kunt u me zeggen waar de Lange Wapper ligt?
1	Ik	ik	PRON	Pron|per|1|ev|nom	Case=Nom|Number=Sing|Person=1|PronType=Prs	5	nsubj	_	_
2	ben	ben	AUX	V|hulpofkopp|ott|1|ev	Aspect=Imp|Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin	5	cop	_	_
3	de	de	DET	Art|bep|zijdofmv|neut	Definite=Def|PronType=Art	4	det	_	_
4	weg	weg	NOUN	N|soort|ev|neut	Number=Sing	5	obj	_	_
5	kwijt	kwijt	ADJ	Adj|attr|stell|onverv	Degree=Pos	0	root	_	SpaceAfter=No
6	,	,	PUNCT	Punc|komma	PunctType=Comm	5	punct	_	_
7	kunt	kan	VERB	V|hulp|ott|2|ev	Aspect=Imp|Mood=Ind|Number=Sing|Person=2|Tense=Pres|VerbForm=Fin|VerbType=Mod	5	parataxis	_	_
8	u	u	PRON	Pron|per|2|ev|nom	Case=Nom|Number=Sing|Person=2|PronType=Prs	7	nsubj	_	_
9	me	me	PRON	Pron|per|1|ev|datofacc	Case=Acc,Dat|Number=Sing|Person=1|PronType=Prs	10	obj	_	_
10	zeggen	zeg	VERB	V|trans|inf	Subcat=Tran|VerbForm=Inf	7	xcomp	_	_
11	waar	waar	ADV	Adv|gew|vrag	PronType=Int	15	mark	_	_
12	de	de	DET	Art|bep|zijdofmv|neut	Definite=Def|PronType=Art	13	det	_	_
13	Lange	Lange	PROPN	N_N|eigen|ev|neut_eigen|ev|neut	_	15	nsubj	_	_
14	Wapper	Wapper	PROPN	PROPN	_	13	flat	_	_
15	ligt	lig	VERB	V|intrans|ott|3|ev	Aspect=Imp|Mood=Ind|Number=Sing|Person=3|Subcat=Intr|Tense=Pres|VerbForm=Fin	10	acl	_	SpaceAfter=No
16	?	?	PUNCT	Punc|vraag	PunctType=Qest	5	punct	_	_

# sent_id = 2
# text = Jazeker meneer
1	Jazeker	zeker	ADJ	Adj|attr|stell|onverv	Degree=Pos	2	amod	_	_
2	meneer	meneer	NOUN	N|soort|ev|neut	Number=Sing	0	root	_	SpacesAfter=\n

# newdoc id = doc2
# newpar
# sent_id = 1
# text = Het gaat vooruit, het gaat verbazend goed vooruit
1	Het	het	PRON	Pron|onbep|neut|zelfst	PronType=Ind	2	nsubj	_	_
2	gaat	ga	VERB	V|intrans|ott|3|ev	Aspect=Imp|Mood=Ind|Number=Sing|Person=3|Subcat=Intr|Tense=Pres|VerbForm=Fin	0	root	_	_
3	vooruit	vooruit	ADV	Adv|gew|geenfunc|stell|onverv	Degree=Pos	2	advmod	_	SpaceAfter=No
4	,	,	PUNCT	Punc|komma	PunctType=Comm	2	punct	_	_
5	het	het	PRON	Pron|onbep|neut|zelfst	PronType=Ind	6	nsubj	_	_
6	gaat	ga	VERB	V|intrans|ott|3|ev	Aspect=Imp|Mood=Ind|Number=Sing|Person=3|Subcat=Intr|Tense=Pres|VerbForm=Fin	2	parataxis	_	_
7	verbazend	verbazend	VERB	V|intrans|tegdw|onverv	Subcat=Intr|Tense=Pres|VerbForm=Part	6	advcl	_	_
8	goed	goed	ADJ	Adj|adv|stell|onverv	Degree=Pos|Variant=Short	6	obl	_	_
9	vooruit	vooruit	ADV	Adv|gew|geenfunc|stell|onverv	Degree=Pos	6	compound:prt	_	SpacesAfter=\n

Jan@bnosac MINGW64 ~/Dropbox/Work/RForgeBNOSAC/BNOSAC/udpipe/inst/dummydata (master)
$ conllu2folia traindata.conllu
Wrote doc1.folia.xml
Wrote doc2.folia.xml

Jan@bnosac MINGW64 ~/Dropbox/Work/RForgeBNOSAC/BNOSAC/udpipe/inst/dummydata (master)
$ folia2annotatedtxt -c text,pos,lemma doc1.folia.xml > test.tmp
Processing doc1.folia.xml
Traceback (most recent call last):
  File "c:\python39\lib\runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "c:\python39\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\Python39\Scripts\folia2annotatedtxt.exe\__main__.py", line 7, in <module>
  File "c:\python39\lib\site-packages\foliatools\folia2annotatedtxt.py", line 117, in main
    process(x, outputfile)
  File "c:\python39\lib\site-packages\foliatools\folia2annotatedtxt.py", line 174, in process
    if w.paragraph() != prevpar and i > 0:
  File "c:\python39\lib\site-packages\folia\main.py", line 3844, in paragraph
    return self.ancestor(Paragraph)
  File "c:\python39\lib\site-packages\folia\main.py", line 2528, in ancestor
    raise NoSuchAnnotation
folia.main.NoSuchAnnotation

File causing the failure here (doc1.folia.xml) looks like this

<?xml version='1.0' encoding='utf-8'?>
<FoLiA xmlns="http://ilk.uvt.nl/folia" xmlns:xlink="http://www.w3.org/1999/xlink" xml:id="doc1" version="2.5.1" generator="foliapy-v2.5.6">
  <metadata type="native">
    <annotations>
      <text-annotation set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/text.foliaset.ttl">
        <annotator processor="proc.conllu2folia.5e385a4e"/>
      </text-annotation>
      <sentence-annotation>
        <annotator processor="proc.conllu2folia.5e385a4e"/>
      </sentence-annotation>
      <token-annotation>
        <annotator processor="proc.conllu2folia.5e385a4e"/>
      </token-annotation>
      <pos-annotation set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/universal-pos.foliaset.ttl">
        <annotator processor="proc.conllu2folia.5e385a4e"/>
      </pos-annotation>
      <pos-annotation set="undefined">
        <annotator processor="proc.conllu2folia.5e385a4e"/>
      </pos-annotation>
      <lemma-annotation set="undefined">
        <annotator processor="proc.conllu2folia.5e385a4e"/>
      </lemma-annotation>
      <dependency-annotation set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/universal-dependencies.foliaset.ttl">
        <annotator processor="proc.conllu2folia.5e385a4e"/>
      </dependency-annotation>
    </annotations>
    <provenance>
      <processor xml:id="proc.conllu2folia.5e385a4e" name="conllu2folia" type="auto" version="2.5.2" folia_version="2.5.1" command="conllu2folia traindata.conllu" host="bnosac" begindatetime="2021-08-25T17:34:37">
        <processor xml:id="proc.conllu2folia.5e385a4e.generator" name="foliapy" type="generator" version="2.5.6" folia_version="2.5.1" src="https://github.com/proycon/foliapy"/>
      </processor>
    </provenance>
  </metadata>
  <text xml:id="doc1.text">
    <s xml:id="doc1.s.1">
      <t class="original">Ik ben de weg kwijt, kunt u me zeggen waar de Lange Wapper ligt?</t>
      <w xml:id="doc1.s.1.w.1">
        <t>Ik</t>
        <pos set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/universal-pos.foliaset.ttl" class="PRON">
          <feat subset="Case" class="Nom"/>
          <feat subset="Number" class="Sing"/>
          <feat subset="Person" class="1"/>
          <feat subset="PronType" class="Prs"/>
        </pos>
        <pos set="undefined" class="Pron|per|1|ev|nom"/>
        <lemma class="ik"/>
      </w>
      <w xml:id="doc1.s.1.w.2">
        <t>ben</t>
        <pos set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/universal-pos.foliaset.ttl" class="AUX">
          <feat subset="Aspect" class="Imp"/>
          <feat subset="Mood" class="Ind"/>
          <feat subset="Number" class="Sing"/>
          <feat subset="Person" class="1"/>
          <feat subset="Tense" class="Pres"/>
          <feat subset="VerbForm" class="Fin"/>
        </pos>
        <pos set="undefined" class="V|hulpofkopp|ott|1|ev"/>
        <lemma class="ben"/>
      </w>
      <w xml:id="doc1.s.1.w.3">
        <t>de</t>
        <pos set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/universal-pos.foliaset.ttl" class="DET">
          <feat subset="Definite" class="Def"/>
          <feat subset="PronType" class="Art"/>
        </pos>
        <pos set="undefined" class="Art|bep|zijdofmv|neut"/>
        <lemma class="de"/>
      </w>
      <w xml:id="doc1.s.1.w.4">
        <t>weg</t>
        <pos set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/universal-pos.foliaset.ttl" class="NOUN">
          <feat subset="Number" class="Sing"/>
        </pos>
        <pos set="undefined" class="N|soort|ev|neut"/>
        <lemma class="weg"/>
      </w>
      <w xml:id="doc1.s.1.w.5" space="no">
        <t>kwijt</t>
        <pos set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/universal-pos.foliaset.ttl" class="ADJ">
          <feat subset="Degree" class="Pos"/>
        </pos>
        <pos set="undefined" class="Adj|attr|stell|onverv"/>
        <lemma class="kwijt"/>
      </w>
      <w xml:id="doc1.s.1.w.6">
        <t>,</t>
        <pos set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/universal-pos.foliaset.ttl" class="PUNCT">
          <feat subset="PunctType" class="Comm"/>
        </pos>
        <pos set="undefined" class="Punc|komma"/>
        <lemma class=","/>
      </w>
      <w xml:id="doc1.s.1.w.7">
        <t>kunt</t>
        <pos set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/universal-pos.foliaset.ttl" class="VERB">
          <feat subset="Aspect" class="Imp"/>
          <feat subset="Mood" class="Ind"/>
          <feat subset="Number" class="Sing"/>
          <feat subset="Person" class="2"/>
          <feat subset="Tense" class="Pres"/>
          <feat subset="VerbForm" class="Fin"/>
          <feat subset="VerbType" class="Mod"/>
        </pos>
        <pos set="undefined" class="V|hulp|ott|2|ev"/>
        <lemma class="kan"/>
      </w>
      <w xml:id="doc1.s.1.w.8">
        <t>u</t>
        <pos set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/universal-pos.foliaset.ttl" class="PRON">
          <feat subset="Case" class="Nom"/>
          <feat subset="Number" class="Sing"/>
          <feat subset="Person" class="2"/>
          <feat subset="PronType" class="Prs"/>
        </pos>
        <pos set="undefined" class="Pron|per|2|ev|nom"/>
        <lemma class="u"/>
      </w>
      <w xml:id="doc1.s.1.w.9">
        <t>me</t>
        <pos set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/universal-pos.foliaset.ttl" class="PRON">
          <feat subset="Case" class="Acc,Dat"/>
          <feat subset="Number" class="Sing"/>
          <feat subset="Person" class="1"/>
          <feat subset="PronType" class="Prs"/>
        </pos>
        <pos set="undefined" class="Pron|per|1|ev|datofacc"/>
        <lemma class="me"/>
      </w>
      <w xml:id="doc1.s.1.w.10">
        <t>zeggen</t>
        <pos set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/universal-pos.foliaset.ttl" class="VERB">
          <feat subset="Subcat" class="Tran"/>
          <feat subset="VerbForm" class="Inf"/>
        </pos>
        <pos set="undefined" class="V|trans|inf"/>
        <lemma class="zeg"/>
      </w>
      <w xml:id="doc1.s.1.w.11">
        <t>waar</t>
        <pos set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/universal-pos.foliaset.ttl" class="ADV">
          <feat subset="PronType" class="Int"/>
        </pos>
        <pos set="undefined" class="Adv|gew|vrag"/>
        <lemma class="waar"/>
      </w>
      <w xml:id="doc1.s.1.w.12">
        <t>de</t>
        <pos set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/universal-pos.foliaset.ttl" class="DET">
          <feat subset="Definite" class="Def"/>
          <feat subset="PronType" class="Art"/>
        </pos>
        <pos set="undefined" class="Art|bep|zijdofmv|neut"/>
        <lemma class="de"/>
      </w>
      <w xml:id="doc1.s.1.w.13">
        <t>Lange</t>
        <pos set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/universal-pos.foliaset.ttl" class="PROPN"/>
        <pos set="undefined" class="N_N|eigen|ev|neut_eigen|ev|neut"/>
        <lemma class="Lange"/>
      </w>
      <w xml:id="doc1.s.1.w.14">
        <t>Wapper</t>
        <pos set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/universal-pos.foliaset.ttl" class="PROPN"/>
        <pos set="undefined" class="PROPN"/>
        <lemma class="Wapper"/>
      </w>
      <w xml:id="doc1.s.1.w.15" space="no">
        <t>ligt</t>
        <pos set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/universal-pos.foliaset.ttl" class="VERB">
          <feat subset="Aspect" class="Imp"/>
          <feat subset="Mood" class="Ind"/>
          <feat subset="Number" class="Sing"/>
          <feat subset="Person" class="3"/>
          <feat subset="Subcat" class="Intr"/>
          <feat subset="Tense" class="Pres"/>
          <feat subset="VerbForm" class="Fin"/>
        </pos>
        <pos set="undefined" class="V|intrans|ott|3|ev"/>
        <lemma class="lig"/>
      </w>
      <w xml:id="doc1.s.1.w.16">
        <t>?</t>
        <pos set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/universal-pos.foliaset.ttl" class="PUNCT">
          <feat subset="PunctType" class="Qest"/>
        </pos>
        <pos set="undefined" class="Punc|vraag"/>
        <lemma class="?"/>
      </w>
      <dependencies>
        <dependency class="nsubj">
          <dep>
            <wref id="doc1.s.1.w.1" t="Ik"/>
          </dep>
          <hd>
            <wref id="doc1.s.1.w.5" t="kwijt"/>
          </hd>
        </dependency>
        <dependency class="cop">
          <dep>
            <wref id="doc1.s.1.w.2" t="ben"/>
          </dep>
          <hd>
            <wref id="doc1.s.1.w.5" t="kwijt"/>
          </hd>
        </dependency>
        <dependency class="det">
          <dep>
            <wref id="doc1.s.1.w.3" t="de"/>
          </dep>
          <hd>
            <wref id="doc1.s.1.w.4" t="weg"/>
          </hd>
        </dependency>
        <dependency class="obj">
          <dep>
            <wref id="doc1.s.1.w.4" t="weg"/>
          </dep>
          <hd>
            <wref id="doc1.s.1.w.5" t="kwijt"/>
          </hd>
        </dependency>
        <dependency class="punct">
          <hd>
            <wref id="doc1.s.1.w.5" t="kwijt"/>
          </hd>
          <dep>
            <wref id="doc1.s.1.w.6" t=","/>
          </dep>
        </dependency>
        <dependency class="parataxis">
          <hd>
            <wref id="doc1.s.1.w.5" t="kwijt"/>
          </hd>
          <dep>
            <wref id="doc1.s.1.w.7" t="kunt"/>
          </dep>
        </dependency>
        <dependency class="nsubj">
          <hd>
            <wref id="doc1.s.1.w.7" t="kunt"/>
          </hd>
          <dep>
            <wref id="doc1.s.1.w.8" t="u"/>
          </dep>
        </dependency>
        <dependency class="obj">
          <dep>
            <wref id="doc1.s.1.w.9" t="me"/>
          </dep>
          <hd>
            <wref id="doc1.s.1.w.10" t="zeggen"/>
          </hd>
        </dependency>
        <dependency class="xcomp">
          <hd>
            <wref id="doc1.s.1.w.7" t="kunt"/>
          </hd>
          <dep>
            <wref id="doc1.s.1.w.10" t="zeggen"/>
          </dep>
        </dependency>
        <dependency class="mark">
          <dep>
            <wref id="doc1.s.1.w.11" t="waar"/>
          </dep>
          <hd>
            <wref id="doc1.s.1.w.15" t="ligt"/>
          </hd>
        </dependency>
        <dependency class="det">
          <dep>
            <wref id="doc1.s.1.w.12" t="de"/>
          </dep>
          <hd>
            <wref id="doc1.s.1.w.13" t="Lange"/>
          </hd>
        </dependency>
        <dependency class="nsubj">
          <dep>
            <wref id="doc1.s.1.w.13" t="Lange"/>
          </dep>
          <hd>
            <wref id="doc1.s.1.w.15" t="ligt"/>
          </hd>
        </dependency>
        <dependency class="flat">
          <hd>
            <wref id="doc1.s.1.w.13" t="Lange"/>
          </hd>
          <dep>
            <wref id="doc1.s.1.w.14" t="Wapper"/>
          </dep>
        </dependency>
        <dependency class="acl">
          <hd>
            <wref id="doc1.s.1.w.10" t="zeggen"/>
          </hd>
          <dep>
            <wref id="doc1.s.1.w.15" t="ligt"/>
          </dep>
        </dependency>
        <dependency class="punct">
          <hd>
            <wref id="doc1.s.1.w.5" t="kwijt"/>
          </hd>
          <dep>
            <wref id="doc1.s.1.w.16" t="?"/>
          </dep>
        </dependency>
      </dependencies>
    </s>
    <s xml:id="doc1.s.2">
      <t class="original">Jazeker meneer</t>
      <w xml:id="doc1.s.2.w.1">
        <t>Jazeker</t>
        <pos set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/universal-pos.foliaset.ttl" class="ADJ">
          <feat subset="Degree" class="Pos"/>
        </pos>
        <pos set="undefined" class="Adj|attr|stell|onverv"/>
        <lemma class="zeker"/>
      </w>
      <w xml:id="doc1.s.2.w.2">
        <t>meneer</t>
        <pos set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/universal-pos.foliaset.ttl" class="NOUN">
          <feat subset="Number" class="Sing"/>
        </pos>
        <pos set="undefined" class="N|soort|ev|neut"/>
        <lemma class="meneer"/>
      </w>
      <dependencies>
        <dependency class="amod">
          <dep>
            <wref id="doc1.s.2.w.1" t="Jazeker"/>
          </dep>
          <hd>
            <wref id="doc1.s.2.w.2" t="meneer"/>
          </hd>
        </dependency>
      </dependencies>
    </s>
  </text>
</FoLiA>

foliatextcontent: Add offsets for existing elements

Allows adding offsets after running ucto for example (ucto itself can't add offsets).

May be useful for use case knaw-huc/golden-agents-htr#1 .

Implement a way to use CorrectionHandling.ORIGINAL in folia2txt

At the moment it is impossible to use another method than CorrectionHandling.CURRENT.
I would like to get this possibility.

[tei2folia] Ensure the ID is suitable for use in FoLiA (ValueError: Invalid XML NCName identifier)

Sometimes a document ID is extracted that is not a valid XML NCName, for example when converting http://worldviews.gei.de/rest/content/tei/CM_1989_FomenkyEtAl_HistoireDuCameroun_52/fre/ , as reported by @dietervu. More checks need to be implemented.

[rst2folia] Unable to add object of type Entry to ListItem

Error encountered by Piroska when converting from docx (via pandoc to rst) to FoLiA:

Running rst2folia --docid="kap1_pp_1_2_png2docx_abbyy_wo_endhyphens" "output/kap1_pp_1_2_png2docx_abbyy_wo_endhyphens.rst" "output/kap1_pp_1_2_png2docx_abbyy_wo_endhyphens.folia.xml"
ValueError: Unable to add object of type Entry to ListItem . Type not allowed as child.

See how we can make this more resilient.

foliavalidator: PluginException

How could I solve this error? (Noticed when trying to upload to FLAT)

$ foliavalidator myfile.folia.xml
Validated successfully

$ foliavalidator --deep myfile.folia.xml
PluginException: No plugin registered for (rdf, <class 'rdflib.parser.Parser'>)

foliavalidator gives Uninformative message on missing version

test file:

<?xml version="1.0" encoding="UTF-8"?>
<FoLiA xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://ilk.uvt.nl/folia" xml:id="doc" generator="libfolia-v0.11">
  <metadata type="native">
    <annotations/>
  </metadata>
  <text xml:id="text">
    <p xml:id="p1">
      <t>Een tekst.</t>
    </p>
  </text>
</FoLiA>

The validator complains:

Error on line 0: Element FoLiA failed to validate attributes
VALIDATION ERROR against RelaxNG schema (stage 1/3), in empty2.xml
Element FoLiA failed to validate attributes

libfolia/folialint DOES accept this file, but will add a bogus version value of 1.4.987

I would expect a more informative message, like:
Element FoLiA failed: missing 'version' attribute

[folia2html] Implement superscript

Although classes for styles are not predefined in FoLiA, folia2html does interpret a few classes like "bold", "italic". An extra implementation is needed for "superscript", as requested by @pirolen. This would map nicely to HTML's <sup> element. And whilst we're at it we should do subscript too.

On an unrelated note: FLAT doesn't visualize this either currently and there it would be less trivial to implement.

[foliasplit] Split a FoLiA document into multiple

New: Implement a tool that splits a FoLiA document into multiple, on the basis of:

specific IDs of structural elements to take as the new roots
after processing x instances of a particular element (for splitting e.g. every 1000 sentences)
sections that have associated submetadata, the submetadata will then become the metadata of the new document

tei2folia failure on CLARIAH files VOC General Missives

Hi proycon,

I would very much like to convert all 589 TEI files produced by DANS Dirk Roorda from the OCR-ed VOC 'Generale Missiven' (13 volumes)(http://resources.huygens.knaw.nl/retroboeken/generalemissiven/#page=0&accessor=toc&view=homePane)
to FoLiA.

I get the following errors on this one:

https://github.com/Dans-labs/clariah-gm/blob/master/xml/01/p0099.xml

as well as on others from the same source.

I have no idea what is wrong, hope you can help!

Error:

(LMdev) reynaert@violet:FOLIA$ tei2folia p0099.xml
Instantiating XML parser
Converting p0099.xml
VALIDATION ERROR on full parse by library in p0099.xml
DeclarationError: Encountered an instance without proper declaration: Comment !
Unable to convert p0099.xml

Looking forward to your response! Thanks!

Martin

Conversion failure due to unclear cause

Hi,

I get this:

(LMdev) reynaert@violet:MARXENGELS$ tei2folia -i -o '-' MarxEngels-A-2003_01-The_Collected_Works_of_Karl_Marx_and_Frederick_Engels_Early_Works_1835_1844_Volume_1-V0.xml
Instantiating XML parser
Converting MarxEngels-A-2003_01-The_Collected_Works_of_Karl_Marx_and_Frederick_Engels_Early_Works_1835_1844_Volume_1-V0.xml
VALIDATION ERROR on full parse by library in MarxEngels-A-2003_01-The_Collected_Works_of_Karl_Marx_and_Frederick_Engels_Early_Works_1835_1844_Volume_1-V0.xml
ParseError: FoLiA exception in handling of

@ line None (in parent

@ parent line None) : [ValueError] Unable to add object of type Caption to Division . Type not allowed as child.
Unable to convert MarxEngels-A-2003_01-The_Collected_Works_of_Karl_Marx_and_Frederick_Engels_Early_Works_1835_1844_Volume_1-V0.xml

I get a similar failure with another file, there the offending object is of type Table.

In this case, I have no idea what 'type Caption' is. Or how I might avoid that this conversion fails.

Please advise.

I attach the input file:

TEST.tei2folia.zip

[tei2folia] added support for tokenised TEI (s/w element) and linguistic annotation

The converted didn't go into tokenized TEI yet: https://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-att.linguistic.html

Related to #12

Make folia2html importable via Python

The tool is now optimized as a command line tool and invoking it from Python is cumbersome and inconsistent with most other tools.

Issue first raised in proycon/folia#106

Write inter-annotator agreement tool

(needed for CLIN shared task)

[foliatextcontent] Implement adding markup information in the text that points to the substrings

This is needed for proycon/flat#92 . There is already an option for this in foliatextcontent but it doesn't seem to work yet in all cases , most specifically, the case where the text content is already present rather than generated by foliatextcontent.

folivalidator should return a error code 0 on error

when running:
foliavalidator examples/erroneous/set_and_setless_explicit_b.2.1.0.folia.xml
there is an error detected (right so!)

But:
echo $? 1
This is wrong, it should be 0

folia2txt on soft hyphens

I have the impression that folia2txt silently removes soft hyphens. In forliautils, FoLiA-2text has an option --restore-formatting which reproduces them.
Both folia2txt and FoLiA-2text keep linebreaks when producing plain text from FoLiA -- I actually naively imagined these converters simply yield running text, but it's surely fine.

folia2columns - Paragraph annotation extraction fails on special double quote and a comma

When using folia2columns with the new paragraph extraction mode, when text contains a combination of a special double quote and a comma, any annotations for words in a paragraph after that point are not returned. This issue is observed when extracting lemma sequences but not word sequences, so it probably only occurs when extracting annotation attributes rather than text.

[foliatextcontent] allow adding offset information to existing elements

validation error on document with 2 correction processors

given the attached new_bug.xml.txt document

foliavalidator comes up with this error:

foliavalidator new_bug.xml
VALIDATION ERROR on full parse by library (stage 2/3), in new_bug.xml
ParseError: FoLiA exception in handling of @ line 34 (in parent @ parent line 33) : [DeclarationError] Encountered an instance without proper declaration: New !

This can be resolved by removing 1 of the correction-annotation declarations. But of course this is an excerpt form a document where more corrector were in charge.

A new-annotation doesn't exist. And also adding a set or a processor to <new> is not allowed (and undesirable)

So how to fix this document? Or does foliavalidator need fixing?

folia2html: <br> tag not converted

Seems to me that the <br> tag is not converted from the .folia.xml to the .html.

Question: Converting between FoLiA and UIMA CAS XMI XML

Would it be an idea to investigate the interoperability between the FoLiA and the "UIMA CAS XMI XML" formats?
If I understand it right, this would allow data exchange between the FoLiA and the UIMA ecosystems.

Would it be of interest to the community, and would foliapy and dkpro-cassis (https://github.com/dkpro/dkpro-cassis) be instrumental for this?

Many thanks for any pointers!

folia2html: Command line options for a single file

Not sure how -o is meant to be used when a single input file is to be converted with folia2html.

Clear:
If I do
`folia2html inputfile1 inputfile2'
both output html files get written in the resp. dirs of the inputfiles.

Just mentioning that I was a bit confused what to do, if I want to convert a single file and write the output html to a file,
because then the above syntax does not work, i.e.
folia2html inputfile1 must be used with redirection ('>').

I then checked the --help and thought that the -o option would be what I was after, but somehow I might have missed something.
If I do
folia2html inputfile -o target_htmlfile,
I get
ERROR: File or directory not found: -o
and it seems no conversion takes place.(?)

If I do
folia2html inputfile -o (and nothing else),
then the html file is written to the same dir as the inputfile, but there is also the complaint.
"ERROR: File or directory not found: -o".

Is foliavalidator right in rejecting this ?

given this FoLiA file:
question.xml.txt

I wonder if this isn't 'acceptable' FoLiA.
folialint accepts it, but foliavalidator says:

foliavalidator question.xml
VALIDATION ERROR on full parse by library (stage 2/3), in question.xml
ParseError: FoLiA exception in handling of <s> @ line 38 (in parent <p> @ parent line 36) : [DeclarationError] Processor ucto.1 is used for annotationtype SENTENCE, set None, but has no corresponding <annotator> referring to it from the annotations declaration block!

This message is very confusing at least. as there IS in fact a sentence-annotation block with annotator ucto.1.

There is an "empty" sentence-annotation too, which is not needed and confusing. But still.
I think it is valid FoLiA, and if not, the message is dead wrong.