dopefishh / pympi Goto Github PK

View Code? Open in Web Editor NEW

93.0 16.0 38.0 2.19 MB

A python module for processing ELAN and Praat annotation files

License: MIT License

Python 100.00%

pympi's People

Contributors

Stargazers

Watchers

pympi's Issues

KeyError when retrieving annotations

I am trying to retrieve the annotations for each tier in an .eaf file, using Elan.py, and there seems to be an error for the lex@CHI tier in the file that I am trying to process (also happens with mwu@CHI in other files). When I look at self.tiers[self.annotations[ref]] in def get_ref_annotation_data_for_tier(self, id_tier), it tries to retrieve the first element of that structure, which is an empty dictionary for lex@CHI but not for other tiers. The second element seems to contain the data that I am looking for, but only for lex@CHI.

I am attaching the .eaf file and a python script that triggers this error.
Is it something wrong with the file or is it the code?

Thank you!

issue.zip

Proper error raising on encoding error

Hello,

I'm currently working on some a pretty inconsistent dataset of TextGrid files (with @Rachine which you might have had a contact with). I had some troubles with some files because they were encoded in utf-16be (and even some in iso-8859-1), while most files where encoded in ascii. I had no idea of this inconsistency when I started to process the dataset, and although it is obviously not really your fault, I had some troubles figuring out why some TextGrid file wouldn't open with pympi.

The errors I got depending on the encoding weren't even the same, for instance, while trying to open and utf-16be file i got an AttributeError, whereas the iso-8859-1 files gave me a UnicodeDecodeError.

It would be nice to raise a proper error when the parsing fails because of encoding errors. I don't know if it's possible, but since i've dug into the TextGrid parsing function pretty far, I could PR a potential fix if you have an idea.

Extraction function is bugged

The Eaf.extract function has a bug that renders it completely useless:

Line 478 of the elan.py file should be

eaf_out.remove_annotation(t, (ae - ab)//2, False)

Correct me if i'm mistaken, I'll pull-request a fix if you wish and if it makes sense to you.

Issue in add_annotation

Hi, I get the next error when I use the method add_annotation:

ValueError: Tier already contains ref annotations...

But only when I work with tiers with ref annotations, is it not possible use this method with this kind of tiers or I have to write something else in the parameter svg_ref?

Thank you so much.

Add CI or test scripts to the repo to test various python versions

pypi installation broken

For the current pypi version the installer is broken. You can still install by downloading the file and installing by hand by running python setup.py install.

The next release will have this fixed

eaf_from_chat hang

I'd like to be able to batch convert .cha files to .eaf format using your wonderful library, pympi. I've used pympi for other purposes with great success, but I'm having trouble getting it to interact with .cha files.

When I call the pympi.Elan.eaf_from_chat function, it hangs on the line where it checks the utf8 codec and continues.

In your documentation, you mention using older codecs for older files -- any help on how to track down the codec if that information isn't readily available? This may help me debug eaf_from_chat. The chat files I'm working with do have @UTF8 on line 1.

Also, have you considered opening up a gitter forum for your library? That would be a helpful place for folks to share code, generally easing the learning curve of using pympi, which is a really great tool!

Thank you for your work on this library!
-Steven

Order in method get_ref_annotation_data_for_tier()

Hi again,

when I use the method get_annotation_data_for_tier() I get the information in the same order that it is in the eaf file whereas when I use the method get_ref_annotation_data_for_tier() I don't get the information in the same order that it is in the eaf file so how is the order of getting this information?

Thank you so much again.

Functionality for multiple reference annotations of a single parent

There is a constraint that allows multiple child annotations for a parent, if that's the case the prev field will contain the order.
This needs to be implemented still.

Standardise arguments

Expected behaviour
Praat.init uses file_path as keyword argument while Praat.to_file uses filepath. It would be better to use the same keyword argument (with underscore for both, or without for both) out of consistency.

It would be advisable to keep an optional legacy filepath or file_path argument (depending on which is chosen as standard) so as not to break existing code.

Add argument to suppress "unknown version" warnings

I load an eaf file:

eaf = pympi.Elan.Eaf(eaf_path)

but get a warning:

Parsing unknown version of ELAN spec... This could result in errors...

From: https://github.com/dopefishh/pympi/blob/master/pympi/Elan.py#L1465

Could you please add an argument to suppress version warning?

eaf = pympi.Elan.Eaf(eaf_path, supress_version_warning=True)

Nested references are not parsed

Might be related to #11

When parsing an eaf file, obj.get_annotation_data_for_tier (and probably other methods too) will fail if the selected tier is a reference tier that references an other reference tier instead of an alignment tier.

Made a PR that addresses the issue: #13

Difference in number of removed annotations

I just wanted to remove empty annotations in one eaf. Surprisingly the method remove_annotation remove more annotations then ELAN does with TIER>REMOVE ANNOTATIONS>EMPTY ANNOTATIONS. I've tried removing rows in data.frame made of annotations from elan and it worked exactly as in ELAN. But still I'm not sure how the method remove_annotation works.

Try

#R
library(reticulate)
library(magrittr, lib.loc = "/Library/Frameworks/R.framework/Versions/3.6/Resources/library")
conda_list()[[1]][1] %>% 
  use_condaenv(required = TRUE)
#### PYTHON ####
# coding: utf-8
# -*- coding: utf-8 -*-
import codecs
import pympi    # Import pympi to work with elan files
import os, fnmatch
import glob
import json
import csv
import sys
import re
import numpy as np
import pandas as pd
setwd("/Volumes/MAXI RUGGED/Google Drive/2020UAM/INFORMATYKA/scRiPting/Py/!PYMPI!/TRANS2020")

eaffile02235 = "000-22-35-S1.mp3.audioenhance.eaf"
eaffile12235 = "001-22-35-S1.mp3.audioenhance.eaf"
eaf_file = pympi.Eaf(eaffile02235) 
eaf_tiers = eaf_file.get_tier_names()
eaf_tiers

t = 'COACH'
anotacje_COACH = eaf_file.get_annotation_data_for_tier(t)
len(anotacje_COACH)
eaf_file.to_file(eaffile02235)

for a in range(0,len(anotacje_COACH)):
  if len(anotacje_COACH[a][2])==0:
    eaf_file.remove_annotation(t,anotacje_COACH[a][0]+1,anotacje_COACH[a][1]-1)
    
anotacje_COACH = eaf_file.get_annotation_data_for_tier('COACH')
len(anotacje_COACH) #64

aupd = pd.DataFrame(anotacje_uczestnik)
aupu = filter(aupd,aupd[2]=='')
aupu = aupd[aupd[2].map(len) > 0]
aupu = aupu.to_records(index=False)
aupu = list(aupu)
len(aupu) # 181

eaf_file.remove_tier(t)
eaf_file.add_tier(t)

for a in range(0,len(aupu)):
  eaf_file.add_annotation(t,aupu[a][0],aupu[a][1], value= aupu[a][2])
eaf_file.to_file(eaffile02235)

000-22-35-S1.mp3.audioenhance.eaf.zip

Release version 1.7 has a lower version number than the previous release (1.69)

When specifying a dependency on pympi-ling in a setup.cfg file for a python package, if include the following:

install_requires =
    pympi-ling>=1.69

…and version 1.7 is already installed, it will be removed in favour of 1.69 when installing the dependent package with pip. Alternatively, with this:

install_requires =
    pympi-ling>=1.7

…version 1.69 will satisfy that dependency, and will not be replaced with version 1.7.

The correct semantic version number should be 1.70, not 1.7.

Parsing unknown version of ELAN spec

Sample Code Script

import pympi
path_to_eaf = "4001.eaf"
EAF = pympi.Elan.Eaf(file_path=path_to_eaf, author='pympi')

Output

Parsing unknown version of ELAN spec... This could result in errors...

System information

python version: 3.9.0
os: Windows 10
are you up to date with the latest master?: Yes

Additional context
I'm using Elan v6.1 to produce the *.eaf file.

add_annotation does not check if the time stamp (left/right) does already exist

I would expect if I add an annotation that first, the add_annotation(...) method would check if the timestamp does already exist in a linked tier. And if so, it would use the already existing time stamp, before adding a new one. However, that is not what I see. Is this on purpose?

I am using pympi 1.69

IOError: [Errno 2] No such file or directory: 'README.rst' when installing from pip

Hi there,

I seem to be having issues installing the latest version (v1.59) using pip.

The log output is as follows:

$ pip install pympi-ling
Collecting pympi-ling
  Using cached pympi-ling-1.59.tar.gz
    Traceback (most recent call last):
      File "<string>", line 20, in <module>
      File "/tmp/pip-build-pA3iNI/pympi-ling/setup.py", line 13, in <module>
        long_description=open('README.rst').read(),
    IOError: [Errno 2] No such file or directory: 'README.rst'
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):

      File "<string>", line 20, in <module>

      File "/tmp/pip-build-pA3iNI/pympi-ling/setup.py", line 13, in <module>

        long_description=open('README.rst').read(),

    IOError: [Errno 2] No such file or directory: 'README.rst'

    ----------------------------------------
    Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-pA3iNI/pympi-ling

This issue doesn't seem to affect the install when doing a manual install from the repository.

However, I notice in the tarball hosted on PyPI the file README.rst is missing from the package's root directory so when setup.py tries to read it in it can't be found.

Timestamp type not checked when added

Expected behaviour
When adding an annotation, the datatype of start and end time is not checked
Using to_eaf() produces an eaf file but it cannot be parsed with parse_eaf because of the datatype of the timestamps

Actual behaviour
There should be an error when the datatype is not int

System information

python version: 3.8.8
os: Windows 10

to_textgrid fails for Eaf from sample file

I noticed the last release of this package on pypi is from 2016. While it still installs (e.g. under py37), the test suite does not pass - and I noticed bugs (e.g. under py3.5), like

>>> e = pympi.Eaf('test/sample_2.7.eaf')
>>> tg = e.to_textgrid()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/forkel/venvs/tpympi/lib/python3.5/site-packages/pympi/Elan.py", line 1328, in to_textgrid
    _, end = self.get_full_time_interval()
  File "/home/forkel/venvs/tpympi/lib/python3.5/site-packages/pympi/Elan.py", line 650, in get_full_time_interval
    (min(self.timeslots.values()), max(self.timeslots.values()))
TypeError: unorderable types: int() < NoneType()

So it doesn't look like it is fully supported under py3 - yet?

Support pathlib.Path objects in addition to str file paths

I propose to additionally accept pathlib.Path objects wherever file paths as str are accepted now. This makes tests simpler (when using pytest's tmp_path) and is also becoming the standard behaviour of most python stdlib modules.

Writing an eaf object to stdout doesn't work in python 3

example:

igor@bgmlv ~/git_projects/pympi $ cat stdout_test.py

import pympi
elan_obj = pympi.Elan.Eaf()
elan_obj.to_file("-")

igor@bgmlv ~/git_projects/pympi $ . env_py2/bin/activate
(env_py2) igor@bgmlv ~/git_projects/pympi $ python stdout_test.py

<?xml version='1.0' encoding='UTF-8'?>
<ANNOTATION_DOCUMENT AUTHOR="pympi" DATE="2016-12-08T13:26:51-05:00" FORMAT="2.8" VERSION="2.8" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://www.mpi.nl/tools/elan/EAFv2.8.xsd">
	<HEADER>
		<PROPERTY NAME="lastUsedAnnotation">0</PROPERTY>
		</HEADER>
	<TIME_ORDER />
	<TIER LINGUISTIC_TYPE_REF="default-lt" TIER_ID="default" />
	<LINGUISTIC_TYPE GRAPHIC_REFERENCES="false" LINGUISTIC_TYPE_ID="default-lt" TIME_ALIGNABLE="true" />
	<CONSTRAINT DESCRIPTION="Time subdivision of parent annotation's time interval, no time gaps allowed within this interval" STEREOTYPE="Time_Subdivision" />
	<CONSTRAINT DESCRIPTION="Symbolic subdivision of a parent annotation. Annotations refering to the same parent are ordered" STEREOTYPE="Symbolic_Subdivision" />
	<CONSTRAINT DESCRIPTION="1-1 association with a parent annotation" STEREOTYPE="Symbolic_Association" />
	<CONSTRAINT DESCRIPTION="Time alignable annotations within the parent annotation's time interval, gaps are allowed" STEREOTYPE="Included_In" />
	</ANNOTATION_DOCUMENT>

(env_py2) igor@bgmlv ~/git_projects/pympi $ deactivate
igor@bgmlv ~/git_projects/pympi $ . env_py3/bin/activate
(env_py3) igor@bgmlv ~/git_projects/pympi $ python stdout_test.py

Traceback (most recent call last):
  File "stdout_test.py", line 3, in <module>
    elan_obj.to_file("-")
  File "/home/igor/git_projects/pympi/env_py3/lib/python3.4/site-packages/pympi/Elan.py", line 1315, in to_file
    to_eaf(file_path, self, pretty)
  File "/home/igor/git_projects/pympi/env_py3/lib/python3.4/site-packages/pympi/Elan.py", line 1688, in to_eaf
    file_path, xml_declaration=True, encoding='UTF-8')
  File "/usr/lib/python3.4/xml/etree/ElementTree.py", line 778, in write
    short_empty_elements=short_empty_elements)
  File "/usr/lib/python3.4/contextlib.py", line 66, in __exit__
    next(self.gen)
  File "/usr/lib/python3.4/xml/etree/ElementTree.py", line 837, in _get_writer
    yield file.write
  File "/usr/lib/python3.4/contextlib.py", line 336, in __exit__
    raise exc_details[1]
  File "/usr/lib/python3.4/contextlib.py", line 321, in __exit__
    if cb(*exc_details):
  File "/usr/lib/python3.4/contextlib.py", line 267, in _exit_wrapper
    callback(*args, **kwds)
TypeError: must be str, not bytes

Compatibility with ACT R library

Expected behaviour
The Pympi's exported ELAN file should be opened by the Annotated Corpus Toolkit (ACT) or should be formatted as original ELAN file.

Actual behaviour
The exported ELAN file should be able to be processed by ACT or should be formatted as original ELAN.

System information

python version: 3.10
os: Linux Mint 21.3
are you up to date with the latest master?: yes 1.70.2

Additional context
I work both with Pympi and Oliver Ehmer's Annotated Corpus Tollkit for R (ACT) that are too great pieces of code for linguists working with ELAN.
I noticed that the ELAN files exported with pympi (with or without "pretty" parameter) could not be processed directly by ACT (see below).
However, they can if this file has been opened then saved in ELAN.
So I took a look at diffs between the pympi's fresh export and the ELAN overwrite and found these two located issues when importing pympi file in ACT :

the file would not be loaded at all : apparently this error is due to the EAF version statement of the file for the attribute xsi:noNamespaceSchemaLocation (3.0 will be loaded, not 2.8).
if issue 1 is corrected (2.8>3.0), the file is loaded but then the time values are not found by ACT : however it works if the "space" character before the TIME_SLOT closing tag is removed.

Workaround found
If I bulk replace version number (2.8>3.0) and if I bulk remove the space character before every closing XML tag, then the file is successfully processed by ACT.
Since the original ELAN files are not formatted as such, I though it was more a "pympi" issue rather than an "ACT" issue.
So maybe some slight export modifications are welcome in pympi ?

Thank you for your work,
Lucien

Problem saving EAF files

Hi again,

I have found a problem, when I do simply:

eaf = pympi.Elan.Eaf(input_file)

eaf.to_file (output_file)

input_file and output_file have not the same order in the tiers, what can I do?

Thank you so much again.

Remove ref annotations

Hi again,

I'm trying remove some reference annotations with the methods remove_annotation() or clean_time_slots() but they have no effect. When I tried with time annotations it worked properly... Is anything that I could do?

Thank you so much, I should send you a couple of beers more ;)

XML Namespace problem when opening and saving a file

When opening and saving an EAF file with pympi, the resulting XML file fails to validate due to namespace information being added twice to the output file.

An MWE is:

import pympi
filename = "minimalExample.eaf"
eafObj = pympi.Elan.Eaf(filename)
eafObj.to_file(filename)

A very easy and short eaf file with which the error can be reproduced is attached (needed to replace the extension from eaf to txt to to upload it): minimalExample.txt

Merging transcriptions and subtracting tiers

I wonder how can I use pympi to merge two different eaf files. Would putting all tiers of each file together in a new file work? Has anybody tried that?

Substraction is more complicated: elan generates tier from Substraction (menu>tier>generate tier from Substraction), but I need to do it on 80 files, so again I wonder how can I use pympi for this?

I can generate tier1 from one file and tier2 from another as a dataframe, pympi object, then substract timestamps and add new tier to one of the files. Has anyone tried that?

Associating a controlled vocabulary with a linguistic type?

I've looked through the API documentation and haven't found any means to do this.

Problem with the order of the annotations in the tiers

Hi (I'm here again, I'm sorry...),

I have found another issue, when I do simply:

eaf = pympi.Elan.Eaf(input_file)

eaf.to_file (output_file)

input_file and output_file have not the same order in the annotations in the tiers so it triggers that when I use the add_ref_annotation() method, there are some values that are repeated incorrectly.

Thank you so much.

Can you please update the version for pip install?

Expected behaviour
I love working with this package, and expected that pip install would install the most recent version, especially the Elan version error message suppression fix.

Actual behaviour
Installing with pip installs an old version.

System information

python version: 3
os: osx
are you up to date with the latest master?: n/a

copy tier within the same file

Expected behaviour
I need to copy (duplicate) tier in one file or from one file to another and I don't understand the syntax: how copy function is supposed to work?

Actual behaviour
I'd like to copy tier and change its name within one file. Next, I'd like to loop it over tens of other ear files. I thought copy_tier would work, but it doesn't. I've tried to copy tier from one object to another, but then got errors (see screenshot).

System information

python version:
os:
are you up to date with the latest master?:

Additional context
Add any other context about the problem here.

unorderable types in python3

Everything has worked fine for me in python 2.7, and I certainly owe you a beer if we every cross paths!
However, I've been working with some unicode annotation lately so I thought it would be better to use python3. In python3, I've been unable to get things working, because I get a TypeError when I try to load my .eaf file. It seems python 2 allows for comparison of int & NoneType, but python3 does not.
I'm trying to poke around in your code and get it working in python3. If I come up with a decent solution, I'll pass it along.
Here is the error:

myeaf = pympi.Elan.Eaf('120604_00.eaf', 'LK')
Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/pympi/Elan.py", line 117, in __init__
        parse_eaf(file_path, self)
    File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/pympi/Elan.py", line 1322, in parse_eaf
        if int(elem1.attrib['TIME_SLOT_ID'][2:]) > eaf_obj.maxts:
TypeError: unorderable types: int() > NoneType()```

EAF 3.0 file format support

Currently only 2.7 and 2.8 are officially supported and a warning is given when another version is imported.

The scheme is available here: http://www.mpi.nl/tools/elan/EAFv3.0.xsd

A human readable explanation is available here:
https://www.mpi.nl/tools/elan/EAF_Annotation_Format_3.0_and_ELAN.pdf

It should be added to the supported list after checking and implementing the differences in spec.

problem with to_eaf

Hi,

I'm trying to save an elan file using the function pympi.Elan.to_eaf().
The saved file cannot be opened in ELAN, the screen remains blank.
I checked and all of the information seems to be present in the Eaf object. Any clue what could be wrong?

Thank you!
Best,
Eva

Can't remove tiers in loop

I have tens of eaf files with different tiers to be removed and I've writtten script to remove them using a loop. Surprisingly, it doesn't work properly. I attach two sample eaf files and the script.
SAMPLE.zip

TextGrids cannot be read if they contain special/IPA characters

Expected behaviour
Read in a textgrid (long format) using: tg = pympi.Praat.TextGrid(path_to_textgrid)

Actual behaviour
Throws an AttributeError (included below) and halts if the contents of any interval tier contain non-ASCII characters such as ɪ or ŋ or ɛ. All other TextGrids are imported without issues as expected.

System information

python version: 3.x (Jupyter Notebook kernel)
os: Mac OS 13.4.1 (Ventura)
are you up to date with the latest master?: Yes

Offending notebook cell (which imports any TGs not containing ɛ or ɪ just fine):

for subj in os.listdir(corpus):
    for file in os.listdir(os.path.join(corpus,subj)):
        if not file.endswith(".TextGrid"):
            continue
        print(file)
        tg = pympi.Praat.TextGrid(os.path.join(corpus,subj,file))

Full traceback of the issue I am encountering is included below.

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[40], line 11
      9     continue
     10 print(file)
---> 11 tg = pympi.Praat.TextGrid(os.path.join(corpus,subj,file))
     12 for tier in tg.get_tiers():
     13     print(tier.name)

File ~/miniconda3/envs/cameroon/lib/python3.11/site-packages/pympi/Praat.py:44, in TextGrid.__init__(self, file_path, xmin, xmax, codec)
     42 else:
     43     with open(file_path, 'rb') as f:
---> 44         self.from_file(f, codec)

File ~/miniconda3/envs/cameroon/lib/python3.11/site-packages/pympi/Praat.py:101, in TextGrid.from_file(self, ifile, codec)
     99 # Skip the Headers and empty line
    100 next(ifile), next(ifile), next(ifile)
--> 101 self.xmin = float(nn(ifile, regfloat))
    102 self.xmax = float(nn(ifile, regfloat))
    103 # Skip <exists>

File ~/miniconda3/envs/cameroon/lib/python3.11/site-packages/pympi/Praat.py:94, in TextGrid.from_file.<locals>.nn(ifile, pat)
     92 def nn(ifile, pat):
     93     line = next(ifile).decode(codec)
---> 94     return pat.search(line).group(1)

AttributeError: 'NoneType' object has no attribute 'group'

Deeper nested child annotations

Hi! I have been trying to automatize some of our project workflows with Pympi, but I have run into problems with the tier structure I desire. So the structure is:

- reference tier (refT, independent)
    \- transcription tier (orthT, symbolic association)
        \- token tier (wordT, symbolic subdivision)

The starting point looks like this:

And with the token tier populated it will be like this:

The problem I have is that it doesn't seem to be possible to create the annotations on word level here so that they would be correctly referenced to the transcription tier, but it seems necessary to set the references to the highermost tier.

If I do:

elan_file = pympi.Eaf(file_path='example.eaf')
elan_file.add_ref_annotation(id_tier='word@Niko', tier2='ref@Niko', time=10, value='Words')
elan_file.add_ref_annotation(id_tier='word@Niko', tier2='ref@Niko', time=10, value='here', prev='a' + str(elan_file.maxaid))
elan_file.add_ref_annotation(id_tier='word@Niko', tier2='ref@Niko', time=10, value='.', prev='a' + str(elan_file.maxaid))

Things go fine, and the file works, but the internal arrangement of references is quite different from what is got when the annotations are added in ELAN, for example by tokenizing the transcription tier with "Tokenize tier…".

If I try to refer directly into transcription level tiers I get an error:

elan_file.add_ref_annotation(id_tier='word@Niko', tier2='orth@Niko', time=10, value='Words')
elan_file.add_ref_annotation(id_tier='word@Niko', tier2='orth@Niko', time=10, value='here', prev='a' + str(elan_file.maxaid))
elan_file.add_ref_annotation(id_tier='word@Niko', tier2='orth@Niko', time=10, value='.', prev='a' + str(elan_file.maxaid))
...
/Users/niko/.local/lib/python3.6/site-packages/pympi/Elan.py in add_ref_annotation(self, id_tier, tier2, time, value, prev, svg)
    332                 break
    333         if not ann:
--> 334             raise ValueError('There is no annotation to reference to.')
    335         aid = self.generate_annotation_id()
    336         self.annotations[aid] = id_tier

ValueError: There is no annotation to reference to.

In ELAN XML the problem looks like this, the question mark points where the annotation should, as far as I understand, refer to:

Is there some way to add the lower-level tiers with correct references? Of course it seems that the current arrangement also works, but it is bit dangerous in the longer run as programmatically manipulated files may have different structure from the ones which have been edited manually, and differences in tier structures make it impossible to parse the content correctly just by looking into tier relations and id's. Or the logic would be different between files.

Script working fine until I save file in ELAN 5.9 and EAF file gets corrupted

Hi, I'm trying to add some tiers and not-overlapping segments to my EAF file.

I'm using the following code:

    eaf = pympi.Eaf(fullPath)
    #si ya existe el tier no pasa nada
    eaf.add_tier("code")
    eaf.add_tier("code_num")
    eaf.add_tier("on_off")
    eaf.add_tier("context")
    eaf.add_tier("note")
    
    i = 0
    for segmento in row["Tiempos en milisegundos"].split(" "):
        segmento = segmento.split("-")
        timeFrom = int(segmento[0])
        timeTo = int(segmento[1])

        eaf.add_annotation("code", timeFrom, timeTo, value="")
        eaf.add_annotation("code_num", timeFrom, timeTo, value=str(i))
        eaf.add_annotation("on_off", timeFrom, timeTo, value=f"{timeFrom}_{timeTo}")
        eaf.add_annotation("context", timeFrom - 120000, timeTo + 60000, value=" ")
        eaf.add_annotation("note", timeFrom - 120000, timeTo + 60000, value="RandomSampling para variation sets con ACLEW")
        eaf.to_file(f"{targetDir}/{filename}")
        
        i += 1

The EAF files are created and I can open them with ELAN 5.9. I can see selected segments and everything seems to be working fine.

The problem is when I add a new segment from ELAN and save, the file gets corrupted and cannot be opened any more.

Examining the EAF file I can see that for instance these lines:

<TIER TIER_ID="code" LINGUISTIC_TYPE_REF="dependency">
    <ANNOTATION>
        <ALIGNABLE_ANNOTATION ANNOTATION_ID="a3327" TIME_SLOT_REF1="ts6653" TIME_SLOT_REF2="ts6654">
	        <ANNOTATION_VALUE />
        </ALIGNABLE_ANNOTATION>
    </ANNOTATION>

become:

 <TIER LINGUISTIC_TYPE_REF="dependency" TIER_ID="code">
    <ANNOTATION>
        <ALIGNABLE_ANNOTATION ANNOTATION_ID="a3332" TIME_SLOT_REF1="" TIME_SLOT_REF2="">
            <ANNOTATION_VALUE></ANNOTATION_VALUE>
        </ALIGNABLE_ANNOTATION>
    </ANNOTATION>

TIME_SLOT_REF1 and 2 are empty! :(

Original EAF files where created using chat2elan from CLAN project. Opening and editing this EAF files using ELAN 5.9 works just fine.

System information

python version: 3.8.1
os: Linux Mint 19.3 based on Ubuntu 18.04
are you up to date with the latest master?: I tried with pip install version and also cloning this repo

Which minimal python version should we target?

py3.5 is already EOL - but still in use (e.g. by me :) and fairly easy to support). py3.4 has some limitations in pathlib, e.g. read_text is missing. So I'd vote for >=3.5.

Extracting new object

Expected behaviour
Please provide the exact means of reproducing the bug
I'd like to trim annotations within certain boundaries and I thougt the extract method would help me to do that. I've tried to it with big and small files, including test file with few annotations on one tier and extract method do not trim annotations. Is that normal behaviour? Is there any other method for that? Suppose I have 10minutes long eaf file and I'd like to have 5minutes of annotations from that on all tiers.
Actual behaviour
What did you expect?

System information

python version:
os:
are you up to date with the latest master?:

Additional context
Add any other context about the problem here.

Fix failing tests

Some tests are failing:

test_clean_time_slots
test_extract
test_shift_annotations

dopefishh / pympi Goto Github PK

pympi's People

Contributors

Stargazers

Watchers

Forkers

pympi's Issues

Recommend Projects

Recommend Topics

Recommend Org