audeering / audformat Goto Github PK

View Code? Open in Web Editor NEW

10.0 10.0 0.0 32.19 MB

Format to store media files and annotations

Home Page: https://audeering.github.io/audformat/

License: Other

Python 100.00%

audio format machine-learning video

audformat's People

Contributors

Stargazers

Watchers

audformat's Issues

audformat.testing.create_audio_files does not set db.root

Example:

import audformat.testing

db = audformat.testing.create_db(minimal=True)
db.raters['rater'] = audformat.Rater()
db.schemes['str'] = audformat.Scheme(str)
audformat.testing.add_table(db, 'table', 'segmented', columns={'str': ('str', 'rater')})
audformat.testing.create_audio_files(db, './database', file_duration='0.1s')

db.root

results in

But it should return audeer.safe_path('./database') instead.

audformat.define.Usage is not a type

As discussed in #34 (comment) we should not write:

usage: define.Usage = define.Usage.UNRESTRICTED

type(audformat.define.Usage.COMMERCIAL) != audformat.define.Usage

This holds for other define entries as well.

to_segmented_index() returns wrong index type for end

Compare the output of:

>>> idx = audformat.segmented_index('a')
>>> idx
MultiIndex([('a', '0 days', NaT)],
           names=['file', 'start', 'end'])
>>> idx.get_level_values('end')
TimedeltaIndex([NaT], dtype='timedelta64[ns]', name='end', freq=None)

with

>>> idx = audformat.utils.to_segmented_index(audformat.filewise_index('a'))
>>> idx
MultiIndex([('a', '0 days', 'NaT')],
           names=['file', 'start', 'end'])
>>> idx.get_level_values('end')
DatetimeIndex(['NaT'], dtype='datetime64[ns]', name='end', freq=None)

This is very unfortunate as it makes it much harder to work with the indices in other applications, e.g. for example how to easily calculate a duration if the type of end can be different.

Progress bar in pick/drop_files() and map_files()

On a large Database object, using the methods pick_files(), drop_files() and map_files() may take quite some time to complete. We should consider adding a verbose argument to display a progress bar.

Speed up conversion to segmented table without NaT

When converting from a filewise to a segmented index we usually call utils.to_segmented_index() where we have the option to set the end of the segments to the file duration. This can be a bottleneck for large tables since the file duration has to be calculated for every file. However, usually we get our tables from a database that we load with audb where the duration of every file is stored in the dependency table. So I wonder if we find a way to benefit from this information to speed up the conversion to segmented tables.

Possible solution might be that audb attaches a table with fie duration to the Database object it returns and we add a segmented option to Table.get() and Column.get(). If set to True, we return a segmented index where we can now access the file duration directly from the attached duration table.

audformat.Table.update() too restrictive

At the moment we require that the scheme of a table that is use to update an existing one has to match, but there are certain scenarios where it makes totally sense that it is only included, but does not match.

E.g. consider the following scenario that matches speaker ID labels:

import audformat.testing


db = audformat.testing.create_db(minimal=True)
db.schemes['s'] = audformat.Scheme(str, labels=['1', '2'])
db['t'] = audformat.Table(audformat.filewise_index(['a', 'b']))
db['t']['s'] = audformat.Column(scheme_id='s')
db['t']['s'].set(['1', '2'])

db_new = audformat.testing.create_db(minimal=True)
db_new.schemes['s'] = audformat.Scheme(str, labels=['1'])
db_new['t'] = audformat.Table(audformat.filewise_index(['c']))
db_new['t']['s'] = audformat.Column(scheme_id='s')
db_new['t']['s'].set(['1'])

db.update(db_new)

this fails with

...
ValueError: Cannot update database, found different value for 'db.schemes['s']':
dtype: str
labels: ['1', '2']
!=
dtype: str
labels: ['1']

How to combine all tables of a database

We have the nice addition feature in audformat:

import audb

db = audb.load('emodb', full_path=False)
(db['emotion'] + db['files']).get()

which results in

                   emotion  @emotion               duration speaker transcription
file                                                                             
wav/03a01Fa.wav  happiness      0.90 0 days 00:00:01.898250       3           a01
wav/03a01Nc.wav    neutral      1.00 0 days 00:00:01.611250       3           a01
wav/03a01Wa.wav      anger      0.95 0 days 00:00:01.877812       3           a01
wav/03a02Fc.wav  happiness      0.85 0 days 00:00:02.006250       3           a02
wav/03a02Nc.wav    neutral      1.00 0 days 00:00:01.439812       3           a02
...                    ...       ...                    ...     ...           ...
wav/16b10Lb.wav    boredom      1.00 0 days 00:00:03.442687      16           b10
wav/16b10Tb.wav    sadness      0.90 0 days 00:00:03.500625      16           b10
wav/16b10Td.wav    sadness      0.95 0 days 00:00:03.934187      16           b10
wav/16b10Wa.wav      anger      1.00 0 days 00:00:02.414125      16           b10
wav/16b10Wb.wav      anger      1.00 0 days 00:00:02.522499      16           b10

[535 rows x 5 columns]

But the following does not work:

sum([db[table] for table in db.tables]).get()

this results in

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-9-db0f88d6bede> in <module>
----> 1 sum([db[table] for table in db.tables]).get()

TypeError: unsupported operand type(s) for +: 'int' and 'Table'

I'm also not sure if it should work, but as __add__ is working I thought sum() should be as well?

Populate string-based scheme with NaN values

I want to add data to a Column that corresponds to a string scheme from a pandas.DataFrame.Column, where the dataframe column has NaN entries.

Code sample to reproduce error:

import pandas as pd

import audformat

db = audformat.Database(name='foo')

df = pd.DataFrame()
df['file'] = ['A', 'B', 'C']
df['bar'] = ['C', 'D', None]
df.set_index('file', inplace=True)

db.schemes['bar'] = audformat.Scheme(labels=['C', 'D'])
db.tables['bar'] = audformat.Table(index=df.index)
db.tables['bar']['bar'] = audformat.Column(scheme_id='bar')
db['bar']['bar'].set(df['bar'])

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/audeering.local/atriant/envs/devaice/lib/python3.6/site-packages/audformat/core/column.py", line 251, in set
    assert_values(values, scheme)
  File "/home/audeering.local/atriant/envs/devaice/lib/python3.6/site-packages/audformat/core/column.py", line 41, in assert_values
    values = np.unique(values)
  File "<__array_function__ internals>", line 6, in unique
  File "/home/audeering.local/atriant/envs/devaice/lib/python3.6/site-packages/numpy/lib/arraysetops.py", line 261, in unique
    ret = _unique1d(ar, return_index, return_inverse, return_counts)
  File "/home/audeering.local/atriant/envs/devaice/lib/python3.6/site-packages/numpy/lib/arraysetops.py", line 322, in _unique1d
    ar.sort()
TypeError: '<' not supported between instances of 'NoneType' and 'str'

This used to work with audata. The offending line is this: https://github.com/audeering/audformat/blob/master/audformat/core/column.py#L41 as np.unique does not work with strings and floats. See:

import numpy as np

np.unique(['A', 'B', None])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<__array_function__ internals>", line 6, in unique
  File "/home/audeering.local/atriant/envs/devaice/lib/python3.6/site-packages/numpy/lib/arraysetops.py", line 261, in unique
    ret = _unique1d(ar, return_index, return_inverse, return_counts)
  File "/home/audeering.local/atriant/envs/devaice/lib/python3.6/site-packages/numpy/lib/arraysetops.py", line 322, in _unique1d
    ar.sort()
TypeError: '<' not supported between instances of 'NoneType' and 'str'

If I pass in the data as db['bar']['bar'].set(df['bar'].dropna()) instead, then I get a different error because the indices do not match.

@frankenjoe @hagenw if this is confirmed on your side as unwanted behavior, I propose to change the offending line from np.unique(values) to set(values) to solve it:

>>> set(['A', 'B', None])
{None, 'A', 'B'}

Make objects like Table and Column independent of database

At the moment a Table stores inside its object that it is connected to a database.
This leads to several problems when doing something like table1 + table2 which would return a table that is not connected to any database.
This might be ok as users just want to access the data. The downside of detached tables is that we will loose scheme informations of the underlying columns.
See also the discussion in #49 (comment)

If we would change the current behavior, it would break the API.

Summation of tables is not commutative

We allow for addition of tables, in which case I would suspect that the order in which we list the tables does not matter, but this is not directly the case:

import audformat


files = ['file1.wav', 'file2.wav', 'file3.wav']
starts = [0, 1]
ends = [1, 2]
set1 = [True, False, False]
set2 = [False, True, False]
set3 = [False, False, True]
rating1 = [1, 0]
rating2 = [0, 1]

db = audformat.Database(name='test')
db.schemes['rating'] = audformat.Scheme(dtype='int')
db.schemes['sets'] = audformat.Scheme(dtype='str')
db.splits['train'] = audformat.Split(type='train')
db.splits['test'] = audformat.Split(type='test')

index = audformat.filewise_index(files)
db['files'] = audformat.Table(index)
db['files']['set1'] = audformat.Column(scheme_id='sets')
db['files']['set1'].set(set1)
db['files']['set2'] = audformat.Column(scheme_id='sets')
db['files']['set2'].set(set2)
db['files']['set3'] = audformat.Column(scheme_id='sets')
db['files']['set3'].set(set3)

index = audformat.segmented_index((files[0], files[0]), starts, ends)
db['rating.train'] = audformat.Table(index, split_id='train')
db['rating.train']['rating'] = audformat.Column(scheme_id='rating')
db['rating.train']['rating'].set(rating1)
index = audformat.segmented_index((files[1], files[1]), starts, ends)
db['rating.test'] = audformat.Table(index, split_id='test')
db['rating.test']['rating'] = audformat.Column(scheme_id='rating')
db['rating.test']['rating'].set(rating2)

Then do:

>>> (db['files'] + db['rating.train'] + db['rating.test']).get()
                                            set1   set2   set3  rating
file      start           end
file1.wav 0 days 00:00:00 0 days 00:00:01   True  False  False       1
          0 days 00:00:01 0 days 00:00:02   True  False  False       0
file2.wav 0 days 00:00:00 0 days 00:00:01    NaN    NaN    NaN       0
                          NaT              False   True  False    <NA>
          0 days 00:00:01 0 days 00:00:02    NaN    NaN    NaN       1
file3.wav 0 days 00:00:00 NaT              False  False   True    <NA>

>>> (db['rating.train'] + db['rating.test'] + db['files']).get()
                                           rating   set1   set2   set3
file      start           end
file1.wav 0 days 00:00:00 0 days 00:00:01       1   True  False  False
          0 days 00:00:01 0 days 00:00:02       0   True  False  False
file2.wav 0 days 00:00:00 0 days 00:00:01       0  False   True  False
          0 days 00:00:01 0 days 00:00:02       1  False   True  False
file3.wav 0 days 00:00:00 NaT                <NA>  False  False   True

>>> (db['rating.train'] + db['files'] + db['rating.test']).get()
                                           rating   set1   set2   set3
file      start           end                                         
file1.wav 0 days 00:00:00 0 days 00:00:01       1   True  False  False
          0 days 00:00:01 0 days 00:00:02       0   True  False  False
file2.wav 0 days 00:00:00 0 days 00:00:01       0    NaN    NaN    NaN
                          NaT                <NA>  False   True  False
          0 days 00:00:01 0 days 00:00:02       1    NaN    NaN    NaN
file3.wav 0 days 00:00:00 NaT                <NA>  False  False   True

If would say the result in the middle is what we would expect, because it allows for:

>>> df = (db['rating.train'] + db['rating.test'] + db['files']).get()
>>> df[df.set2 == True]['rating']
file       start            end            
file2.wav  0 days 00:00:00  0 days 00:00:01    0
           0 days 00:00:01  0 days 00:00:02    1
Name: rating, dtype: Int64

which is identical to

>>> (db['rating.train'] + db['rating.test']).get(index=db['files'].df[db['files'].df.set2 == True].index)
                                           rating
file      start           end                    
file2.wav 0 days 00:00:00 0 days 00:00:01       0
          0 days 00:00:01 0 days 00:00:02       1

But those will not work for the first and last example, e.g.

>>> df = (db['rating.train'] + db['files'] + db['rating.test']).get()
>>> df[df.set2 == True]['rating']
file       start   end
file2.wav  0 days  NaT    <NA>
Name: rating, dtype: Int64

I'm not completely sure yet if this is the cause of the error we discussed in the chat on Friday, but I think we should tackle this one first.

Add error message if not an index is used for Table.get(index=)?

First, create an example database with a filewise and a segmented table:

import audformat


files = ['file1.wav']
starts = [0, 1]
ends = [1, 2]
duration = [1]
rating = [1, 0]

db = audformat.Database(name='test')
db.schemes['rating'] = audformat.Scheme(dtype='int')
db.schemes['duration'] = audformat.Scheme(dtype='time')

index = audformat.filewise_index(files)
db['files'] = audformat.Table(index)
db['files']['duration'] = audformat.Column(scheme_id='duration')

index = audformat.segmented_index([files[0], files[0]], starts, ends)
db['rating'] = audformat.Table(index)
db['rating']['rating'] = audformat.Column(scheme_id='rating')
db['rating']['rating'].set(rating)

Then the follwoing works nicely:

>>> db['rating'].get(index=db['files'].index)
                                           rating
file      start           end                    
file1.wav 0 days 00:00:00 0 days 00:00:01       1
          0 days 00:00:01 0 days 00:00:02       0

and

>>> db['rating'].get(index=db['files'].df.index)
                                           rating
file      start           end                    
file1.wav 0 days 00:00:00 0 days 00:00:01       1
          0 days 00:00:01 0 days 00:00:02       0

but not

>>> db['rating'].get(index=db['files'].df)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-30-d0e28b6c3201> in <module>
----> 1 db['rating'].get(index=db['files'].df)

~/.envs/audformat/lib/python3.6/site-packages/audformat/core/table.py in get(self, index, map, copy)
    495                 result = self._df.loc[index]
    496             else:
--> 497                 files = index.get_level_values(define.IndexField.FILE)
    498                 if self.is_filewise:  # index is segmented
    499                     result = pd.DataFrame(

~/.envs/audformat/lib/python3.6/site-packages/pandas/core/generic.py in __getattr__(self, name)
   5139             if self._info_axis._can_hold_identifiers_and_holds_name(name):
   5140                 return self[name]
-> 5141             return object.__getattribute__(self, name)
   5142 
   5143     def __setattr__(self, name: str, value) -> None:

AttributeError: 'DataFrame' object has no attribute 'get_level_values'

This is expected as in the documentation it is written that you have to provide an index, but I'm wondering if we add an error message pointing into that direction as well?

Avoid index.empty

In #107 and #108 we showed that we can significantly speed up loading a database by avoiding index.empty. There are two more uses of index.empty in utils.duration() and utils.concat(). We should fix it there, too.

Speedup by specifying dtype for index columns when reading CSV files

Reading CSV files with pandas.read_csv() is faster if you provide data types of the columns, e.g. https://towardsdatascience.com/️-load-the-same-csv-file-10x-times-faster-and-with-10x-less-memory-️-e93b485086c7

At the moment we specify data types for all the data columns,
but not for the index columns.

I'm also wondering if dtype = schemes[column.scheme_id].to_pandas_dtype() is sufficient to detect categorical data types instead of strings.

Allow Database.load() to not load the tables into memory

For databases with lots of annotations it can happen that we have lots of CSV files and there size is altogether >5GB.
When loading such a database it can take a very long time (>30 minutes).

One solution would be to not load all the CSV files to the corresponding dataframes, but only when those are requested by Column.get(), Table.get() or by Table.df.

Difference between TIME and DATE in audformat.define.DataType

Let's assume you would like to store the exact date and time of a recording using datetime.datetime, then it is not obvious from the current documentation what the data type of the scheme should be: TIME or DATE.

I would assume that TIME should cover duration values rather then datetime values, correct?

/cc @agfcrespi

Improve format specification headings

Our current naming of the sections in the documentation makes totally sense for the navigation menu:

But it becomes less obvious if you navigate on a small screen where you don't see that menu and use the next prev buttons, because then you will see a page called Introduction in the middle of the documentation and you have no clue that you are now in a new chapter.

This problem is not urgent, and I also don't have a good idea how to fix this.

Request table by index with additional files

Currently, this is not working:

db = audformat.testing.create_db(minimal=True)
db.schemes['int'] = audformat.Scheme(int)
audformat.testing.add_table(
    db,
    'files',
    'filewise',
    num_files=[0, 1],
)
print(db['files'].get())
audformat.testing.add_table(
    db,
    'segments',
    'segmented',
    num_files=[1, 2],
)
print(db['segments'].get())
df = db['files'].get(index=db['segments'].index)
print(df)

               int
file              
audio/000.wav   42
audio/001.wav   33
                                                                   int
file          start                     end                           
audio/001.wav 0 days 00:00:00.112525598 0 days 00:00:00.651033666   29
              0 days 00:00:00.774044425 0 days 00:00:01.252506888   53
              0 days 00:00:02.059782689 0 days 00:00:02.436929941   91
              0 days 00:00:02.506858415 0 days 00:00:02.967737843   97
              0 days 00:00:03.548951851 0 days 00:00:04.280189899   40
audio/002.wav 0 days 00:00:01.050809893 0 days 00:00:01.472755921   70
              0 days 00:00:01.583978939 0 days 00:00:01.859228829   82
              0 days 00:00:02.071727758 0 days 00:00:03.610085480   97
              0 days 00:00:03.890518902 0 days 00:00:03.916600049   67
              0 days 00:00:04.290822547 0 days 00:00:04.882961055   20
Traceback (most recent call last):
...
    "Passing list-likes to .loc or [] with any missing labels "
KeyError: "Passing list-likes to .loc or [] with any missing labels is no longer supported. The following labels were missing: Index(['audio/002.wav', 'audio/002.wav', 'audio/002.wav', 'audio/002.wav',\n       'audio/002.wav'],\n      dtype='object', name='file'). See https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike"

As we can see, pandas is complaining that db['segments'].index has a reference to a file 'audio/002.wav' that is not in db['files'].

Question: should we add support for this case? And if so, how should we handle it? Just Ignore those files?

cc @hagenw

Forbid duplicated index entries?

As some methods like audformat.Table.drop_files() will fail if the database contains duplicated index entries, we should maybe forbid this in the first place.

The question is if we can find a good place to do this. The first places that come to mind are Column.set() and maybe Database.save(). I guess, this will still not completely safe as you can also directly assign the dataframe, but maybe this is then not a bug, but a feature for the power user.

audformat.Database docstring not explicit on how to provide languages

I looked at the languages that are included in mozillacommonvoice 2.0.0 and saw German, fra. I was wondering howto provide the language value in order to avoid such problems, but looking at https://audeering.github.io/audformat/api.html#database it just states

Would be nice if we add a statement there in which format you should provide the languages.

In the example above I provided fra as fr.

Add conversion examples for common databases

I think we should add example conversion scripts to audformat for the following databases:

emodb
iemocap
librispeech
mozillacommonvoice
msppodcast

And maybe

audioset
cmu-mosei
timit
voxceleb1
voxceleb2

The question is how to do this best. We could add a single repository where we collect those conversion scripts, e.g. audformat-examples. Or we could create one repository per database. I'm also not sure how to best combine it with the code of publishing databases with audb for those where we plan to do it, e.g. emodb.

DOC: update conventions on file duration

We currently recommend to store file duration:

https://audeering.github.io/audformat/data-conventions.html#file-duration-and-temporal-data

However, as discussed in audb we can get the file duration from the dependencies. So maybe we should remove file duration from the conventions.

Utility function for computing EWE

The topic of computing the EWE has come up several times. As its computation is straightforward but not trivial, I would offer to add a utility function here to compute it, then a user can automatically compute it for their dataset and easily add it to a conversion script.

The approach I have in mind is to add two functions, one to compute annotator confidence, and the other to compute the EWE. Those will roughly look as follows:

class ComputeEWE:
    def __init__(self, confidences):
        self.confidences = confidences

    def __call__(self, row):
        raters = [x for x in self.confidences if row[x] == row[x]]
        total = sum([row[x] * self.confidences[x] for x in raters])
        total /= sum([self.confidences[x] for x in raters])
        return total

def compute_ewe(df, confidences):
    rater_names = list(set(confidences.keys()) & set(df.columns))
    valid_confidences = {}
    for key in rater_names:
        valid_confidences[key] = confidences[key]
    return pd.DataFrame(
        data=df.apply(ComputeEWE(valid_confidences), axis=1),
        index=df.index,
        columns=['EWE']
    )

def rater_confidence(df, raters = None):
     if raters is None:
             raters = df.columns
     confidences = {}
        for rater in raters:
            df_rater = df[rater].dropna().astype(float)
            df_others = df.drop(rater, axis=1).mean(axis=1).dropna()
            indices = df_rater.index.intersection(df_others.index)
            confidences[rater] = audbenchmark.metric.pearson_cc(
                df_rater.loc[indices],
                df_others.loc[indices]
            )
      return confidences

@hagenw @frankenjoe what do you think about this?

Table.drop_index is not working as expected

Try:

import audformat
import audformat.testing


db = audformat.testing.create_db(minimal=True)
db.name = 'testing'
db.schemes['scheme'] = audformat.Scheme(
    labels=['positive', 'neutral', 'negative']
)   
audformat.testing.add_table(
    db, 
    'emotion',
    audformat.define.IndexType.SEGMENTED,
    num_files=5,
    columns={'emotion': ('scheme', None)}
)   
db.schemes['speaker'] = audformat.Scheme(
    labels=['adam', 'eve']
)   
db['files'] = audformat.Table(db.files)
db['files']['speaker'] = audformat.Column(scheme_id='speaker')
db['files']['speaker'].set(
    ['adam', 'adam', 'eve', 'eve'],
    index=audformat.filewise_index(db.files[:4]),
)

This results in:

>>> db.files
Index(['audio/001.wav', 'audio/002.wav', 'audio/003.wav', 'audio/004.wav',
       'audio/005.wav'],
      dtype='object', name='file')

Now, let's try to remove one entry:

>>> remove_file = os.path.join('audio', '001.wav')
>>> db['files'].drop_index(audformat.filewise_index(remove_file), inplace=True)
>>> db.files
Index(['audio/001.wav', 'audio/002.wav', 'audio/003.wav', 'audio/004.wav',
       'audio/005.wav'],
      dtype='object', name='file')

The file 'audio/001.wav' is still part of the index, but it shouldn't.

Implement eq for Database

Try:

>>> import audformat
>>> import copy
>>> db1 = audformat.testing.create_db(minimal=True)
>>> db2 = copy.deepcopy(db1)
>>> db1 == db2
False

But it should return True. This is not the case as we have not implement __eq__ for database objects.

Database.update() with copy_media=True ends up with tmp folder

I have not created a minimal example yet, but when trying to add new data to mozillacommonvoice into the build dir I run the following command:

db.update(db_new, copy_media=True)

Afterwards, all the new data should also be stored inside build. Instead it is stored under build~.

Enhance output of audformat.Database.description

At the moment we get the following:

>>> import audb
>>> db = audb.load('emodb', version='1.1.0')
>>> db.description
'Berlin Database of Emotional Speech. A German database of emotional utterances spoken by actors recorded as a part of the DFG funded research project SE462/3-1 in 1997 and 1999. Recordings took place in the anechoic chamber of the Technical University Berlin, department of Technical Acoustics. It contains about 500 utterances from ten different actors expressing basic six emotions and neutral.'

which is not very nice to read.

It get's even worse if you have some real formatting in the description string.
For example, for audioset the description contains:

AudioSet ontology categories of the two top hierarchies:

Human sounds            Animal                   Music
|-Human voice           |-Domestic animals, pets |-Musical instrument
|-Whistling             |-Livestock, farm        |-Music genre
|-Respiratory sounds    | animals, working       |-Musical concepts
|-Human locomotion      | animals                |-Music role
|-Digestive             \-Wild animals           \-Music mood
|-Hands
|-Heart sounds,         Sounds of things         Natural sounds
| heartbeat             |-Vehicle                |-Wind
|-Otoacoustic emission  |-Engine                 |-Thunderstorm
\-Human group actions   |-Domestic sounds,       |-Water
                        | home sounds            \-Fire
Source-ambiguous sounds |-Bell
|-Generic impact sounds |-Alarm                  Channel, environment
|-Surface contact       |-Mechanisms             and background
|-Deformable shell      |-Tools                  |-Acoustic environment
|-Onomatopoeia          |-Explosion              |-Noise
|-Silence               |-Wood                   \-Sound reproduction
\-Other sourceless      |-Glass
                        |-Liquid
                        |-Miscellaneous sources
                        \-Specific impact sounds

which would be nice if we could preserve it when printing to screen.

Where to store license of a database?

We don't have any db.license entry at the moment. Before it was handled inside the Gradle settings, but I think we should add an option to store it also in the database.
Of course, it can be done already by using the meta field or adding it to the description, but I think we should add an extra entry for it.

Some licenses might come with an URL, so maybe we should add such an option as well.

Make "is_portable" a property

In #73 a new method Database.is_portable() was introduced, but since it does not take arguments we could actually make it a property.

@hagenw what do you think?

audformat.utils.union() takes too long

This function is used in audinterface to combine segments detected by a Segment object, like VAD.
But it can take very long for a typical database, see audeering/audinterface#26

To verify this I created 100 MultiIndex indexes that each contain 10.000 entries (1.000 files with 10 start and end times each).
I then benchmarked how long it takes to join a certain number of those index entries.

import time

import numpy as np
import pandas as pd

import audformat


def chunks(lst, n):
    """Yield successive n-sized chunks from lst."""
    for i in range(0, len(lst), n):
        yield lst[i:i + n]


def measure(idx):
    start = time.time()
    i = audformat.utils.union(idx)
    end = time.time()
    return end - start


files = [f'a{n}' for n in range(1000)]
starts = [pd.to_timedelta(t, unit='s') for t in range(1000)]
ends = [pd.to_timedelta(n + 10 * np.random.rand()) for n in range(1000)]
# Split into lists of 10 entries each
starts = list(chunks(starts, 10))
ends = list(chunks(ends, 10))

# Create tuples
idx_tuples = []
for start, end in zip(starts, ends):
    idx_tuples.append([(f, s, e) for s, e in zip(start, end) for f in files])
idx = []
for idx_tuple in idx_tuples:
    idx.append(pd.MultiIndex.from_tuples(idx_tuple, names=['file', 'start', 'end']))

Then we get:

>>> measure(idx[:10])
9.416870594024658
>>> measure(idx[:20])
32.61396503448486
>>> measure(idx[:30])
71.54374980926514
>>> measure(idx[:40])
128.81993174552917
>>> measure(idx[:50])
203.0398144721985
>>> measure(idx[:60])
274.86833333969116

First I test how fast an alternative method would work, that don't handles the union aspect:

def measure(idx):
    start = time.time()
    idx = pd.concat([i.to_frame() for i in idx])
    i = pd.MultiIndex.from_frame(idx)
    end = time.time()
    return end - start

This returns:

>>>  measure(idx[:10])
0.03077840805053711
>>> measure(idx[:20])
0.08491826057434082
>>> measure(idx[:30])
0.11141705513000488
>>> measure(idx[:40])
0.13216853141784668
>>> measure(idx[:50])
0.1430981159210205
>>> measure(idx[:60])
0.15899324417114258

We clearly need to improve on this. Hopefully we can find a solution that also handles the union part nicely.

NaT is not returned when indexing an index

To me the following looks like a pandas bug:

>>> idx = audformat.segmented_index(['a'], [0], [pd.NaT])
>>> idx
MultiIndex([('a', '0 days', NaT)],
           names=['file', 'start', 'end'])
>>> idx.get_level_values('end')
TimedeltaIndex([NaT], dtype='timedelta64[ns]', name='end', freq=None)
>>> idx[0]
('a', Timedelta('0 days 00:00:00'), nan)

I would have expected that idx[0] should return

('a', Timedelta('0 days 00:00:00'), Timedelta(NaT))

or if this is possible

('a', Timedelta('0 days 00:00:00'), NaT)

I don't think that this will have big implications, but it took me a while to fix a test that was failing due to the returned nan value.

It might be that pandas don't consider it as a bug as the following still works:

>>> pd.isnull(idx[0][2])
True

so it doesn't matter if it is NaT or nan.

But the problem is that we loose information of the object as timedelta and datetime return the same and some other functions might then interpret it as a datetime index and fail.

setting values including `nan` and strings to a audformat table, containing string labels, needed to be as a list

I wanted to set values to a audformat table whose possible labels were string labels as below:

 labels={
           "h": {'category': 'happy'},
           "n": {'category': 'neutral'},
           "b": {'category': 'bored'},
           "a": {'category': 'angry'},
            },

The values from dataframe contain [nan 'b' 'h' 'a' 'n']. I was getting the error below:

Traceback (most recent call last):
  File "create.py", line 409, in <module>
    main()
  File "create.py", line 376, in main
    db['segments']['supposed_emotion'].set(df_segments.supposed_emotion.values)
  File "/home/audeering.local/mmadadi/.local/lib/python3.6/site-packages/audformat/core/column.py", line 251, in set
    assert_values(values, scheme)
  File "/home/audeering.local/mmadadi/.local/lib/python3.6/site-packages/audformat/core/column.py", line 41, in assert_values
    values = np.unique(values)
  File "<__array_function__ internals>", line 6, in unique
  File "/home/audeering.local/mmadadi/.local/lib/python3.6/site-packages/numpy/lib/arraysetops.py", line 261, in unique
    ret = _unique1d(ar, return_index, return_inverse, return_counts)
  File "/home/audeering.local/mmadadi/.local/lib/python3.6/site-packages/numpy/lib/arraysetops.py", line 322, in _unique1d
    ar.sort()
TypeError: '<' not supported between instances of 'str' and 'float'

I was able to solve it by, assigning the values to be a list as :

db['segments']['supposed_emotion'].set(list(df_segments.supposed_emotion.values))

Make schemes safe against user changes

At the moment a user can change the dtype or labels of a scheme, but this will not automatically update the related tables.
So we should make those attributes properties that return a copy and provide setter functions. BTW, you can add a setter function to a Python property, so a user should be able to do something like schemes['my-scheme'].labels[0] = 'a' if we want to allow for it.

Add method to move files between tables

Let's say you would like to move a list of files from an existing train table to an existing test table.
The easiest solution I found so far would be:

db['tmp'] = db['train'].pick_files(files)
db['tmp'].split_id = db['test'].split_id
db['train'].drop_files(files, inplace=True)
db['test'].update(db['tmp'])
db.drop_tables('tmp')

so it might be easier to have something like:

db['test'].move_files(db['train'], files)

Of course it will only work if your columns match, but the same is true for update().

Link usage examples in docstring

In the documentation we provide a couple of nice usage examples. To make it easier for a user to find those, we could directly link them in the docstring of the according methods. E.g. in Column.get() and Table.get() we should add a link to https://audeering.github.io/audformat/map-scheme.html

Loading can fail with missing timedelta unit

When loading a database that was stored before with audformat the following error can arise:

>>> db = audformat.Database.load('db')
...
~/git/audeering/audformat/audformat/core/table.py in <lambda>(x)                                                                                                                               
   1036         if self.type == define.IndexType.SEGMENTED:                                                                                                                                   
   1037             converters[define.IndexField.START] = \                                                                                                                                   
-> 1038                 lambda x: pd.to_timedelta(x)                                                                                                                                           
   1039             converters[define.IndexField.END] = \                                                                                                                                     
   1040                 lambda x: pd.to_timedelta(x)                                                                                                                                           
                                                                                                                                                                                               
~/.envs/test/lib/python3.6/site-packages/pandas/core/tools/timedeltas.py in to_timedelta(arg, unit, errors)                                                                                   
    120                                                                                                                                                                                        
    121     # ...so it must be a scalar value. Return scalar.                                                                                                                                  
--> 122     return _coerce_scalar_to_timedelta_type(arg, unit=unit, errors=errors)                                                                                                            
    123                                                                                                                                                                   
    124                                                                                                                                                                                        
                                                                                                                                                                                               
~/.envs/test/lib/python3.6/site-packages/pandas/core/tools/timedeltas.py in _coerce_scalar_to_timedelta_type(r, unit, errors)                                                                      126     """Convert string 'r' to a timedelta object."""                                                                                                                                    
    127     try:                                                                                                                                                                               
--> 128         result = Timedelta(r, unit)                                                                                                                                                    
    129     except ValueError:                                                                                                                                                                 
    130         if errors == "raise":                                                                                                                                                          
                                                                                                                                                                                             
pandas/_libs/tslibs/timedeltas.pyx in pandas._libs.tslibs.timedeltas.Timedelta.__new__()                                                                                                       
                                                                                                                                                                                               
pandas/_libs/tslibs/timedeltas.pyx in pandas._libs.tslibs.timedeltas.parse_timedelta_string()                                                                                                  
                                                                                                                                                      
ValueError: no units specified

To further trakc this down I added unit='s' to both pd.to_timedelta() calls and repeated the command. Then we get:

~/git/audeering/audformat/audformat/core/table.py in <lambda>(x)
   1036         if self.type == define.IndexType.SEGMENTED:
   1037             converters[define.IndexField.START] = \
-> 1038                 lambda x: pd.to_timedelta(x, unit='s')
   1039             converters[define.IndexField.END] = \
   1040                 lambda x: pd.to_timedelta(x, unit='s')

~/.envs/test/lib/python3.6/site-packages/pandas/core/tools/timedeltas.py in to_timedelta(arg, unit, errors)
    117 
    118     if isinstance(arg, str) and unit is not None:
--> 119         raise ValueError("unit must not be specified if the input is/contains a str")
    120 
    121     # ...so it must be a scalar value. Return scalar.

ValueError: unit must not be specified if the input is/contains a str

So it seems to be that the start and end values contain strings or are read as strings.
If I look in the corresponding entries in the CSV files I see only valid floats there.

Input type for audformat.segmented_index()

In the documentation the following input types are allowed for starts and ends:

But it also works for integers:

>>> audformat.segmented_index(['a.wav', 'a.wav'], [0, 1], [1, 2])
MultiIndex([('a.wav',           '0 days 00:00:00', ...),
            ('a.wav', '0 days 00:00:00.000000001', ...)],
           names=['file', 'start', 'end'])

and floats as well (besides that the position after the comma is no longer considered):

>>> audformat.segmented_index(['a.wav', 'a.wav'], [0., 1.1], [1., 2.1])
MultiIndex([('a.wav',           '0 days 00:00:00', ...),
            ('a.wav', '0 days 00:00:00.000000001', ...)],
           names=['file', 'start', 'end'])

I would propose that we add int as type as well and add a word to the documentation that this is then handled as ns.

Broken error message with get(map=...)

import audformat.testing


db = audformat.testing.create_db()
db['files']['label_map_str'].get(map='bad')

ValueError: Cannot map '{}' to 'label_map_str'. Expected one of ['prop1', 'prop2'].

assert_index() can be a bottleneck

See audeering/audb#149 (comment)

Possible solutions:

Removing it everywhere and leave it completely to the user to call assert_index().
Separate obj.has_duplicates from assert_index() (e.g. add assert_no_duplicates()) and call it only if really needed.
Add an argument to assert_index() (and functions that call it) to disable checking for duplicates.

Problem when concatenating non-nullable dtypes

utils.concat() does not properly work for dtypes that are not nullable:

E.g.:

try:
    y1 = pd.Series(1, audformat.filewise_index('f1'))
    y2 = pd.Series(1, audformat.filewise_index('f2'))
    audformat.utils.concat([y1, y2])
except Exception as ex:
    print(ex)

Found overlapping data in column 'None':
      left  right
file             
f2       0      1

and:

try:
    y1 = pd.Series(True, audformat.filewise_index('f1'))
    y2 = pd.Series(True, audformat.filewise_index('f2'))
    audformat.utils.concat([y1, y2])
except Exception as ex:
    print(ex)

Found overlapping data in column 'None':
       left  right
file              
f2    False   True

The problem is that we first create an empty table in which we then insert the data. However, before we insert the data we check for an overlap, which is a problem if the empty table does not contain NA values, but e.g. 0 in case of int or False in case of bool. The function then detects an overlap and raises an error.

The issue can be avoided by switching to a nullable dtype. E.g. the following works:

y1 = pd.Series(1., audformat.filewise_index('f1'), dtype='Int64')
y2 = pd.Series(1., audformat.filewise_index('f2'), dtype='Int64')
audformat.utils.concat([y1, y2])

y1 = pd.Series(True, audformat.filewise_index('f1'), dtype='boolean')
y2 = pd.Series(True, audformat.filewise_index('f2'), dtype='boolean')
audformat.utils.concat([y1, y2])

Introduce Table.update()

Let's have a look at the following example:

db1 = audformat.testing.create_db(minimal=True)
db1.schemes['int'] = audformat.Scheme(int)
audformat.testing.add_table(db1, 'table', 'filewise')

db2 = audformat.testing.create_db(minimal=True)
db2.schemes['str'] = audformat.Scheme(str)
audformat.testing.add_table(db2, 'table', 'filewise')

audformat.utils.concat(
    [
        db1['table'].df,
        db2['table'].df
    ],
)

               int         str
file                          
audio/001.wav   47  NhDdzJX0YA
audio/002.wav   43  qdd5Onf25M
audio/003.wav   76  UIUHk0dGdi
audio/004.wav   97  WCXPkAxL5M
audio/005.wav   25  Aos20DyoUo

As we see we have successfully combined tables from two different database using audformat.utils.concat().

Hence, a user might be surprised to learn that adding the two tables with + raises an error:

(db1['table'] + db2['table']).get()

Bad column ID 'str', expected one of ['int']

The reason is that the new Table we create is actually assigned to db1:

audformat/audformat/core/table.py

Line 795 in c045b92

table._df = df

But at the same time, we assign the the second column to scheme str, which is only known by db2:

audformat/audformat/core/table.py

Line 790 in c045b92

scheme_id=scheme_ids[column_id],

So currently we can only safely combine tables with + from the same database. To overcome this limitation I suggest to not assign the new table to db1 and set the scheme_id and rater_id of all columns to None. Then we can safely do:

(db1['table'] + db2['table']).get()

               int         str
file                          
audio/001.wav   49  vaRA2y1rj4
audio/002.wav   42  IhpThA81cj
audio/003.wav   51  2oRdTHlYNr
audio/004.wav   31  ePZfabVHvd
audio/005.wav    5  FiZYPYcQZI

The only downside is that if we do:

db['table1'] = db['table1'] + db['table2']  # removes scheme and rater from table1

we actually remove the scheme and rater information from table1.

Therefore I suggest to introduce a new function Table.update() for this use-case:

db['table1'].update(db['table2'])

The advantage here is that we don't have a detached table as an intermediate result so we can keep scheme and rater information, and we can even copy missing schemes/raters from the other table to db so that also the following works:

db1['table'].update(db2['table'])  # copy missing schemes and raters used in db2['table'] to db1

@hagenw please comment and if you agree I will prepare a MR

Iterating over db.tables can raise RuntimeError error

Try the following:

import audformat.testing

db = audformat.testing.create_db()
for table in db.tables:
    db[f'{table}.new'] = audformat.Table()

This results in:

RuntimeError                              Traceback (most recent call last)
<ipython-input-4-a72911a3944b> in <module>
----> 1 for table in db.tables:
      2     db[f'{table}.new'] = audformat.Table()
      3 

RuntimeError: OrderedDict mutated during iteration

A work around is to first get a list of tables:

import audformat.testing

db = audformat.testing.create_db()
tables = list(db.tables)
for table in tables:
    db[f'{table}.new'] = audformat.Table()

This works as expected.

As this is very hard to debug for a user, I think we should try to fix it to also work when directly iterating through db.tables.

I haven't tested it for other entries, but I guess it might be the same for db.schemes etc.

Setting labels to an extended table does not work

db = audformat.Database('test')
db.schemes['set'] = audformat.Scheme('str')
db.schemes['set'].labels = {}
db['sets'] = audformat.Table()
db['sets']['set'] = audformat.Column(scheme_id='set')
set_id = 'abc'
db.schemes['set'].labels[set_id] = 'extra info'
idx = audformat.filewise_index(['a', 'b', 'c'])
db['sets'].extend_index(idx, inplace=True)
db['sets']['set'].set(set_id, index=idx)
db['sets'].df

this returns

      set
file     
a     NaN
b     NaN
c     NaN

But the output should be

      set
file     
a     abc
b     abc
c     abc

Speed up caching of tables

As shown in audeering/audb#38 (comment) using uncompressed pickle files is much faster when storing large dataframes. As we don't care about size of the cached tables, but only about speed for loading them we should change the behavior for caching tables as well.

I think we can even implement it in a backward compatible way. Loading a compressed pickle file without specifying the compression, it should fail and we can use try-except to catch that and then load using `compression='xz'.

Only ISO 639-3 languages are supported

I want to add zh-tw (Taiwanese Chinese) to mozillacommonvoice, but there is no ISO 639-3 language code for it, compare https://en.wikipedia.org/wiki/Taiwanese_Mandarin

At the moment we can only add languages that are ISO 639-3 as we call audformat.utils.map_language() on them, which fails if the language code is not available.

Error saving NaN for a boolean scheme

This is a minimal example with one table that has a column linked to boolean scheme:

import audformat


db = audformat.testing.create_db(minimal=True)
db.schemes['scheme'] = audformat.Scheme(audformat.define.DataType.BOOL)
db['t'] = audformat.Table(audformat.filewise_index(['f1']))
db['t']['c'] = audformat.Column(scheme_id='scheme')

By default, the labels in the column are initialized with NaN and we get the following expected output:

db['t'].get()

        c
file     
f1    NaN

However, saving and loading the database as CSV fails (using pickle works):

db.save('db')
audformat.Database.load('db')

ValueError: Bool column has NA values in column 1

There are two solutions:

we do not support NaN for boolean schemes and initalize with False
we try to find a fix for loading empty values to a boolean scheme from CSV

Tests fail for pandas>=1.3

I created a dummy pull request to check the current master: #92

And it seems that something had broken our tests: https://github.com/audeering/audformat/pull/92/checks?check_run_id=3162446359

Add data from a Series with non-matching dtype

At the moment you can experience the following error if you try to add data from a pd.Series that has a dtype that does not match the one specified in the corresponding scheme of the column:

>>> db['answer']['rating'].set(df.rating)
...
TypeError: '<' not supported between instances of 'NoneType'/'float' and 'str'

The problem is that a user might not be able to figure out what is wrong by herself/himself.
I see two solutions:

Add a custom error message that points to the workaround of using list() or setting the correct dtype
Do some kind of conversion (e.g. using list()) internally and ignore the dtype setting of the pd.Series

Allow for more index columns?

At the moment we have the problem, that we don't cover some use cases with our available tables in audformat.
If you have as input the comparison between two files, or a file combined with another column as combination and a third with the result, we can not handle it as we do not allow for duplicate index entries.

One example would be a table used for verification experiments:

                                                   verification file  same speaker
file                                                                              
wav/id10270/x6uYqmx31kE/00001.wav  wav/id10300/ize_eiCFEg0/00003.wav         False
wav/id10270/x6uYqmx31kE/00001.wav  wav/id10270/GWXujl-xAVM/00017.wav          True
wav/id10270/x6uYqmx31kE/00001.wav  wav/id10273/0OCW1HUxZyg/00001.wav         False
wav/id10270/x6uYqmx31kE/00001.wav  wav/id10270/8jEAjG6SegY/00022.wav          True

Bug in Table.add

Bug 1

In #42 the following problem was discovered:

db = audformat.testing.create_db(minimal=True)
db.schemes['int'] = audformat.Scheme(int)
audformat.testing.add_table(
    db,
    'files',
    'filewise',
    num_files=[0, 1],
)
print(db['files'].get())
audformat.testing.add_table(
    db,
    'segments',
    'segmented',
    num_files=[1, 2],
)
db['segments'].df.drop(columns='int', inplace=True)
df = (db['files'] + db['segments']).get()
print(df.dropna())

               int
file              
audio/000.wav    9
audio/001.wav    8
                                                                   int
file          start                     end                           
audio/000.wav 0 days 00:00:00           NaT                          8
audio/001.wav 0 days 00:00:00.008223082 0 days 00:00:00.436778253    8
              0 days 00:00:00.741062295 0 days 00:00:02.177251004    8
              0 days 00:00:02.348365842 0 days 00:00:02.634602780    8
              0 days 00:00:02.778845133 0 days 00:00:03.194477961    8
audio/002.wav 0 days 00:00:04.135094273 0 days 00:00:04.583586552    9

Expected output is:

                                                                   int
file          start                     end                           
audio/000.wav 0 days 00:00:00           NaT                          9
audio/001.wav 0 days 00:00:00.008223082 0 days 00:00:00.436778253    8
              0 days 00:00:00.741062295 0 days 00:00:02.177251004    8
              0 days 00:00:02.348365842 0 days 00:00:02.634602780    8
              0 days 00:00:02.778845133 0 days 00:00:03.194477961    8
audio/002.wav 0 days 00:00:04.135094273 0 days 00:00:04.583586552    nan

It's probably related to the fact that db['segments'] is empty.

Bug 2

db = audformat.testing.create_db(minimal=True)
db.schemes['int'] = audformat.Scheme(int)
audformat.testing.add_table(
    db,
    'files',
    'filewise',
    num_files=[0, 1],
)
print(db['files'].get())

db.schemes['float'] = audformat.Scheme(float)
audformat.testing.add_table(
    db,
    'segments',
    'segmented',
    num_files=[1, 2],
    num_segments_per_file=2,
    columns='float',
)
print(db['segments'].get())
print((db['files'] + db['segments']).get())

               int
file              
audio/000.wav   56
audio/001.wav   80
                                                                      float
file          start                     end                                
audio/001.wav 0 days 00:00:00.707328684 0 days 00:00:01.122211418  0.932964
              0 days 00:00:02.551719975 0 days 00:00:03.259764254  0.891229
audio/002.wav 0 days 00:00:01.756307851 0 days 00:00:02.541227282  0.616572
              0 days 00:00:04.928252041 0 days 00:00:04.977779637  0.432051
                                                                    int     float
file          start                     end                                      
audio/000.wav 0 days 00:00:00           NaT                          80       NaN
audio/001.wav 0 days 00:00:00.707328684 0 days 00:00:01.122211418    80  0.932964
              0 days 00:00:02.551719975 0 days 00:00:03.259764254  <NA>  0.891229
audio/002.wav 0 days 00:00:01.756307851 0 days 00:00:02.541227282  <NA>  0.616572
              0 days 00:00:04.928252041 0 days 00:00:04.977779637    56  0.432051

Expected result is:

file          start                     end                                      
audio/000.wav 0 days 00:00:00           NaT                          56       NaN
audio/001.wav 0 days 00:00:00           NaT                          80       NaN
audio/001.wav 0 days 00:00:00.707328684 0 days 00:00:01.122211418    <NA>  0.932964
              0 days 00:00:02.551719975 0 days 00:00:03.259764254    <NA>  0.891229
audio/002.wav 0 days 00:00:01.756307851 0 days 00:00:02.541227282    <NA>  0.616572
              0 days 00:00:04.928252041 0 days 00:00:04.977779637    <NA>  0.432051

Enable caching of emob when building documentation

At the moment the 40MB source file of emodb has to be donwloaded for every documentation test.
There is no need for this and we should enable caching of that file.