audeering / audformat Goto Github PK
View Code? Open in Web Editor NEWFormat to store media files and annotations
Home Page: https://audeering.github.io/audformat/
License: Other
Format to store media files and annotations
Home Page: https://audeering.github.io/audformat/
License: Other
Example:
import audformat.testing
db = audformat.testing.create_db(minimal=True)
db.raters['rater'] = audformat.Rater()
db.schemes['str'] = audformat.Scheme(str)
audformat.testing.add_table(db, 'table', 'segmented', columns={'str': ('str', 'rater')})
audformat.testing.create_audio_files(db, './database', file_duration='0.1s')
db.root
results in
But it should return audeer.safe_path('./database')
instead.
As discussed in #34 (comment) we should not write:
usage: define.Usage = define.Usage.UNRESTRICTED
as
type(audformat.define.Usage.COMMERCIAL) != audformat.define.Usage
This holds for other define
entries as well.
Compare the output of:
>>> idx = audformat.segmented_index('a')
>>> idx
MultiIndex([('a', '0 days', NaT)],
names=['file', 'start', 'end'])
>>> idx.get_level_values('end')
TimedeltaIndex([NaT], dtype='timedelta64[ns]', name='end', freq=None)
with
>>> idx = audformat.utils.to_segmented_index(audformat.filewise_index('a'))
>>> idx
MultiIndex([('a', '0 days', 'NaT')],
names=['file', 'start', 'end'])
>>> idx.get_level_values('end')
DatetimeIndex(['NaT'], dtype='datetime64[ns]', name='end', freq=None)
This is very unfortunate as it makes it much harder to work with the indices in other applications, e.g. for example how to easily calculate a duration if the type of end can be different.
On a large Database
object, using the methods pick_files()
, drop_files()
and map_files()
may take quite some time to complete. We should consider adding a verbose
argument to display a progress bar.
When converting from a filewise
to a segmented
index we usually call utils.to_segmented_index()
where we have the option to set the end of the segments to the file duration. This can be a bottleneck for large tables since the file duration has to be calculated for every file. However, usually we get our tables from a database that we load with audb where the duration of every file is stored in the dependency table. So I wonder if we find a way to benefit from this information to speed up the conversion to segmented tables.
Possible solution might be that audb
attaches a table with fie duration to the Database
object it returns and we add a segmented
option to Table.get()
and Column.get()
. If set to True
, we return a segmented index where we can now access the file duration directly from the attached duration table.
At the moment we require that the scheme of a table that is use to update an existing one has to match, but there are certain scenarios where it makes totally sense that it is only included, but does not match.
E.g. consider the following scenario that matches speaker ID labels:
import audformat.testing
db = audformat.testing.create_db(minimal=True)
db.schemes['s'] = audformat.Scheme(str, labels=['1', '2'])
db['t'] = audformat.Table(audformat.filewise_index(['a', 'b']))
db['t']['s'] = audformat.Column(scheme_id='s')
db['t']['s'].set(['1', '2'])
db_new = audformat.testing.create_db(minimal=True)
db_new.schemes['s'] = audformat.Scheme(str, labels=['1'])
db_new['t'] = audformat.Table(audformat.filewise_index(['c']))
db_new['t']['s'] = audformat.Column(scheme_id='s')
db_new['t']['s'].set(['1'])
db.update(db_new)
this fails with
...
ValueError: Cannot update database, found different value for 'db.schemes['s']':
dtype: str
labels: ['1', '2']
!=
dtype: str
labels: ['1']
We have the nice addition feature in audformat
:
import audb
db = audb.load('emodb', full_path=False)
(db['emotion'] + db['files']).get()
which results in
emotion @emotion duration speaker transcription
file
wav/03a01Fa.wav happiness 0.90 0 days 00:00:01.898250 3 a01
wav/03a01Nc.wav neutral 1.00 0 days 00:00:01.611250 3 a01
wav/03a01Wa.wav anger 0.95 0 days 00:00:01.877812 3 a01
wav/03a02Fc.wav happiness 0.85 0 days 00:00:02.006250 3 a02
wav/03a02Nc.wav neutral 1.00 0 days 00:00:01.439812 3 a02
... ... ... ... ... ...
wav/16b10Lb.wav boredom 1.00 0 days 00:00:03.442687 16 b10
wav/16b10Tb.wav sadness 0.90 0 days 00:00:03.500625 16 b10
wav/16b10Td.wav sadness 0.95 0 days 00:00:03.934187 16 b10
wav/16b10Wa.wav anger 1.00 0 days 00:00:02.414125 16 b10
wav/16b10Wb.wav anger 1.00 0 days 00:00:02.522499 16 b10
[535 rows x 5 columns]
But the following does not work:
sum([db[table] for table in db.tables]).get()
this results in
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-9-db0f88d6bede> in <module>
----> 1 sum([db[table] for table in db.tables]).get()
TypeError: unsupported operand type(s) for +: 'int' and 'Table'
I'm also not sure if it should work, but as __add__
is working I thought sum()
should be as well?
I want to add data to a Column that corresponds to a string scheme from a pandas.DataFrame.Column
, where the dataframe column has NaN
entries.
Code sample to reproduce error:
import pandas as pd
import audformat
db = audformat.Database(name='foo')
df = pd.DataFrame()
df['file'] = ['A', 'B', 'C']
df['bar'] = ['C', 'D', None]
df.set_index('file', inplace=True)
db.schemes['bar'] = audformat.Scheme(labels=['C', 'D'])
db.tables['bar'] = audformat.Table(index=df.index)
db.tables['bar']['bar'] = audformat.Column(scheme_id='bar')
db['bar']['bar'].set(df['bar'])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/audeering.local/atriant/envs/devaice/lib/python3.6/site-packages/audformat/core/column.py", line 251, in set
assert_values(values, scheme)
File "/home/audeering.local/atriant/envs/devaice/lib/python3.6/site-packages/audformat/core/column.py", line 41, in assert_values
values = np.unique(values)
File "<__array_function__ internals>", line 6, in unique
File "/home/audeering.local/atriant/envs/devaice/lib/python3.6/site-packages/numpy/lib/arraysetops.py", line 261, in unique
ret = _unique1d(ar, return_index, return_inverse, return_counts)
File "/home/audeering.local/atriant/envs/devaice/lib/python3.6/site-packages/numpy/lib/arraysetops.py", line 322, in _unique1d
ar.sort()
TypeError: '<' not supported between instances of 'NoneType' and 'str'
This used to work with audata
. The offending line is this: https://github.com/audeering/audformat/blob/master/audformat/core/column.py#L41 as np.unique
does not work with strings and floats. See:
import numpy as np
np.unique(['A', 'B', None])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<__array_function__ internals>", line 6, in unique
File "/home/audeering.local/atriant/envs/devaice/lib/python3.6/site-packages/numpy/lib/arraysetops.py", line 261, in unique
ret = _unique1d(ar, return_index, return_inverse, return_counts)
File "/home/audeering.local/atriant/envs/devaice/lib/python3.6/site-packages/numpy/lib/arraysetops.py", line 322, in _unique1d
ar.sort()
TypeError: '<' not supported between instances of 'NoneType' and 'str'
If I pass in the data as db['bar']['bar'].set(df['bar'].dropna())
instead, then I get a different error because the indices do not match.
@frankenjoe @hagenw if this is confirmed on your side as unwanted behavior, I propose to change the offending line from np.unique(values)
to set(values)
to solve it:
>>> set(['A', 'B', None])
{None, 'A', 'B'}
At the moment a Table
stores inside its object that it is connected to a database.
This leads to several problems when doing something like table1 + table2
which would return a table that is not connected to any database.
This might be ok as users just want to access the data. The downside of detached tables is that we will loose scheme informations of the underlying columns.
See also the discussion in #49 (comment)
If we would change the current behavior, it would break the API.
We allow for addition of tables, in which case I would suspect that the order in which we list the tables does not matter, but this is not directly the case:
import audformat
files = ['file1.wav', 'file2.wav', 'file3.wav']
starts = [0, 1]
ends = [1, 2]
set1 = [True, False, False]
set2 = [False, True, False]
set3 = [False, False, True]
rating1 = [1, 0]
rating2 = [0, 1]
db = audformat.Database(name='test')
db.schemes['rating'] = audformat.Scheme(dtype='int')
db.schemes['sets'] = audformat.Scheme(dtype='str')
db.splits['train'] = audformat.Split(type='train')
db.splits['test'] = audformat.Split(type='test')
index = audformat.filewise_index(files)
db['files'] = audformat.Table(index)
db['files']['set1'] = audformat.Column(scheme_id='sets')
db['files']['set1'].set(set1)
db['files']['set2'] = audformat.Column(scheme_id='sets')
db['files']['set2'].set(set2)
db['files']['set3'] = audformat.Column(scheme_id='sets')
db['files']['set3'].set(set3)
index = audformat.segmented_index((files[0], files[0]), starts, ends)
db['rating.train'] = audformat.Table(index, split_id='train')
db['rating.train']['rating'] = audformat.Column(scheme_id='rating')
db['rating.train']['rating'].set(rating1)
index = audformat.segmented_index((files[1], files[1]), starts, ends)
db['rating.test'] = audformat.Table(index, split_id='test')
db['rating.test']['rating'] = audformat.Column(scheme_id='rating')
db['rating.test']['rating'].set(rating2)
Then do:
>>> (db['files'] + db['rating.train'] + db['rating.test']).get()
set1 set2 set3 rating
file start end
file1.wav 0 days 00:00:00 0 days 00:00:01 True False False 1
0 days 00:00:01 0 days 00:00:02 True False False 0
file2.wav 0 days 00:00:00 0 days 00:00:01 NaN NaN NaN 0
NaT False True False <NA>
0 days 00:00:01 0 days 00:00:02 NaN NaN NaN 1
file3.wav 0 days 00:00:00 NaT False False True <NA>
>>> (db['rating.train'] + db['rating.test'] + db['files']).get()
rating set1 set2 set3
file start end
file1.wav 0 days 00:00:00 0 days 00:00:01 1 True False False
0 days 00:00:01 0 days 00:00:02 0 True False False
file2.wav 0 days 00:00:00 0 days 00:00:01 0 False True False
0 days 00:00:01 0 days 00:00:02 1 False True False
file3.wav 0 days 00:00:00 NaT <NA> False False True
>>> (db['rating.train'] + db['files'] + db['rating.test']).get()
rating set1 set2 set3
file start end
file1.wav 0 days 00:00:00 0 days 00:00:01 1 True False False
0 days 00:00:01 0 days 00:00:02 0 True False False
file2.wav 0 days 00:00:00 0 days 00:00:01 0 NaN NaN NaN
NaT <NA> False True False
0 days 00:00:01 0 days 00:00:02 1 NaN NaN NaN
file3.wav 0 days 00:00:00 NaT <NA> False False True
If would say the result in the middle is what we would expect, because it allows for:
>>> df = (db['rating.train'] + db['rating.test'] + db['files']).get()
>>> df[df.set2 == True]['rating']
file start end
file2.wav 0 days 00:00:00 0 days 00:00:01 0
0 days 00:00:01 0 days 00:00:02 1
Name: rating, dtype: Int64
which is identical to
>>> (db['rating.train'] + db['rating.test']).get(index=db['files'].df[db['files'].df.set2 == True].index)
rating
file start end
file2.wav 0 days 00:00:00 0 days 00:00:01 0
0 days 00:00:01 0 days 00:00:02 1
But those will not work for the first and last example, e.g.
>>> df = (db['rating.train'] + db['files'] + db['rating.test']).get()
>>> df[df.set2 == True]['rating']
file start end
file2.wav 0 days NaT <NA>
Name: rating, dtype: Int64
I'm not completely sure yet if this is the cause of the error we discussed in the chat on Friday, but I think we should tackle this one first.
First, create an example database with a filewise and a segmented table:
import audformat
files = ['file1.wav']
starts = [0, 1]
ends = [1, 2]
duration = [1]
rating = [1, 0]
db = audformat.Database(name='test')
db.schemes['rating'] = audformat.Scheme(dtype='int')
db.schemes['duration'] = audformat.Scheme(dtype='time')
index = audformat.filewise_index(files)
db['files'] = audformat.Table(index)
db['files']['duration'] = audformat.Column(scheme_id='duration')
index = audformat.segmented_index([files[0], files[0]], starts, ends)
db['rating'] = audformat.Table(index)
db['rating']['rating'] = audformat.Column(scheme_id='rating')
db['rating']['rating'].set(rating)
Then the follwoing works nicely:
>>> db['rating'].get(index=db['files'].index)
rating
file start end
file1.wav 0 days 00:00:00 0 days 00:00:01 1
0 days 00:00:01 0 days 00:00:02 0
and
>>> db['rating'].get(index=db['files'].df.index)
rating
file start end
file1.wav 0 days 00:00:00 0 days 00:00:01 1
0 days 00:00:01 0 days 00:00:02 0
but not
>>> db['rating'].get(index=db['files'].df)
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-30-d0e28b6c3201> in <module>
----> 1 db['rating'].get(index=db['files'].df)
~/.envs/audformat/lib/python3.6/site-packages/audformat/core/table.py in get(self, index, map, copy)
495 result = self._df.loc[index]
496 else:
--> 497 files = index.get_level_values(define.IndexField.FILE)
498 if self.is_filewise: # index is segmented
499 result = pd.DataFrame(
~/.envs/audformat/lib/python3.6/site-packages/pandas/core/generic.py in __getattr__(self, name)
5139 if self._info_axis._can_hold_identifiers_and_holds_name(name):
5140 return self[name]
-> 5141 return object.__getattribute__(self, name)
5142
5143 def __setattr__(self, name: str, value) -> None:
AttributeError: 'DataFrame' object has no attribute 'get_level_values'
This is expected as in the documentation it is written that you have to provide an index, but I'm wondering if we add an error message pointing into that direction as well?
Reading CSV files with pandas.read_csv()
is faster if you provide data types of the columns, e.g. https://towardsdatascience.com/๏ธ-load-the-same-csv-file-10x-times-faster-and-with-10x-less-memory-๏ธ-e93b485086c7
At the moment we specify data types for all the data columns,
but not for the index columns.
I'm also wondering if dtype = schemes[column.scheme_id].to_pandas_dtype()
is sufficient to detect categorical data types instead of strings.
For databases with lots of annotations it can happen that we have lots of CSV files and there size is altogether >5GB.
When loading such a database it can take a very long time (>30 minutes).
One solution would be to not load all the CSV files to the corresponding dataframes, but only when those are requested by Column.get()
, Table.get()
or by Table.df
.
Let's assume you would like to store the exact date and time of a recording using datetime.datetime
, then it is not obvious from the current documentation what the data type of the scheme should be: TIME
or DATE
.
I would assume that TIME
should cover duration values rather then datetime values, correct?
/cc @agfcrespi
Our current naming of the sections in the documentation makes totally sense for the navigation menu:
But it becomes less obvious if you navigate on a small screen where you don't see that menu and use the next prev buttons, because then you will see a page called Introduction
in the middle of the documentation and you have no clue that you are now in a new chapter.
This problem is not urgent, and I also don't have a good idea how to fix this.
Currently, this is not working:
db = audformat.testing.create_db(minimal=True)
db.schemes['int'] = audformat.Scheme(int)
audformat.testing.add_table(
db,
'files',
'filewise',
num_files=[0, 1],
)
print(db['files'].get())
audformat.testing.add_table(
db,
'segments',
'segmented',
num_files=[1, 2],
)
print(db['segments'].get())
df = db['files'].get(index=db['segments'].index)
print(df)
int
file
audio/000.wav 42
audio/001.wav 33
int
file start end
audio/001.wav 0 days 00:00:00.112525598 0 days 00:00:00.651033666 29
0 days 00:00:00.774044425 0 days 00:00:01.252506888 53
0 days 00:00:02.059782689 0 days 00:00:02.436929941 91
0 days 00:00:02.506858415 0 days 00:00:02.967737843 97
0 days 00:00:03.548951851 0 days 00:00:04.280189899 40
audio/002.wav 0 days 00:00:01.050809893 0 days 00:00:01.472755921 70
0 days 00:00:01.583978939 0 days 00:00:01.859228829 82
0 days 00:00:02.071727758 0 days 00:00:03.610085480 97
0 days 00:00:03.890518902 0 days 00:00:03.916600049 67
0 days 00:00:04.290822547 0 days 00:00:04.882961055 20
Traceback (most recent call last):
...
"Passing list-likes to .loc or [] with any missing labels "
KeyError: "Passing list-likes to .loc or [] with any missing labels is no longer supported. The following labels were missing: Index(['audio/002.wav', 'audio/002.wav', 'audio/002.wav', 'audio/002.wav',\n 'audio/002.wav'],\n dtype='object', name='file'). See https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike"
As we can see, pandas is complaining that db['segments'].index
has a reference to a file 'audio/002.wav'
that is not in db['files']
.
Question: should we add support for this case? And if so, how should we handle it? Just Ignore those files?
cc @hagenw
As some methods like audformat.Table.drop_files()
will fail if the database contains duplicated index entries, we should maybe forbid this in the first place.
The question is if we can find a good place to do this. The first places that come to mind are Column.set()
and maybe Database.save()
. I guess, this will still not completely safe as you can also directly assign the dataframe, but maybe this is then not a bug, but a feature for the power user.
I looked at the languages that are included in mozillacommonvoice
2.0.0 and saw German, fra
. I was wondering howto provide the language value in order to avoid such problems, but looking at https://audeering.github.io/audformat/api.html#database it just states
Would be nice if we add a statement there in which format you should provide the languages.
In the example above I provided fra
as fr
.
I think we should add example conversion scripts to audformat
for the following databases:
And maybe
The question is how to do this best. We could add a single repository where we collect those conversion scripts, e.g. audformat-examples
. Or we could create one repository per database. I'm also not sure how to best combine it with the code of publishing databases with audb
for those where we plan to do it, e.g. emodb.
We currently recommend to store file duration:
https://audeering.github.io/audformat/data-conventions.html#file-duration-and-temporal-data
However, as discussed in audb we can get the file duration from the dependencies. So maybe we should remove file duration from the conventions.
The topic of computing the EWE has come up several times. As its computation is straightforward but not trivial, I would offer to add a utility function here to compute it, then a user can automatically compute it for their dataset and easily add it to a conversion script.
The approach I have in mind is to add two functions, one to compute annotator confidence, and the other to compute the EWE. Those will roughly look as follows:
class ComputeEWE:
def __init__(self, confidences):
self.confidences = confidences
def __call__(self, row):
raters = [x for x in self.confidences if row[x] == row[x]]
total = sum([row[x] * self.confidences[x] for x in raters])
total /= sum([self.confidences[x] for x in raters])
return total
def compute_ewe(df, confidences):
rater_names = list(set(confidences.keys()) & set(df.columns))
valid_confidences = {}
for key in rater_names:
valid_confidences[key] = confidences[key]
return pd.DataFrame(
data=df.apply(ComputeEWE(valid_confidences), axis=1),
index=df.index,
columns=['EWE']
)
def rater_confidence(df, raters = None):
if raters is None:
raters = df.columns
confidences = {}
for rater in raters:
df_rater = df[rater].dropna().astype(float)
df_others = df.drop(rater, axis=1).mean(axis=1).dropna()
indices = df_rater.index.intersection(df_others.index)
confidences[rater] = audbenchmark.metric.pearson_cc(
df_rater.loc[indices],
df_others.loc[indices]
)
return confidences
@hagenw @frankenjoe what do you think about this?
Try:
import audformat
import audformat.testing
db = audformat.testing.create_db(minimal=True)
db.name = 'testing'
db.schemes['scheme'] = audformat.Scheme(
labels=['positive', 'neutral', 'negative']
)
audformat.testing.add_table(
db,
'emotion',
audformat.define.IndexType.SEGMENTED,
num_files=5,
columns={'emotion': ('scheme', None)}
)
db.schemes['speaker'] = audformat.Scheme(
labels=['adam', 'eve']
)
db['files'] = audformat.Table(db.files)
db['files']['speaker'] = audformat.Column(scheme_id='speaker')
db['files']['speaker'].set(
['adam', 'adam', 'eve', 'eve'],
index=audformat.filewise_index(db.files[:4]),
)
This results in:
>>> db.files
Index(['audio/001.wav', 'audio/002.wav', 'audio/003.wav', 'audio/004.wav',
'audio/005.wav'],
dtype='object', name='file')
Now, let's try to remove one entry:
>>> remove_file = os.path.join('audio', '001.wav')
>>> db['files'].drop_index(audformat.filewise_index(remove_file), inplace=True)
>>> db.files
Index(['audio/001.wav', 'audio/002.wav', 'audio/003.wav', 'audio/004.wav',
'audio/005.wav'],
dtype='object', name='file')
The file 'audio/001.wav'
is still part of the index, but it shouldn't.
Try:
>>> import audformat
>>> import copy
>>> db1 = audformat.testing.create_db(minimal=True)
>>> db2 = copy.deepcopy(db1)
>>> db1 == db2
False
But it should return True
. This is not the case as we have not implement __eq__
for database objects.
I have not created a minimal example yet, but when trying to add new data to mozillacommonvoice
into the build
dir I run the following command:
db.update(db_new, copy_media=True)
Afterwards, all the new data should also be stored inside build
. Instead it is stored under build~
.
At the moment we get the following:
>>> import audb
>>> db = audb.load('emodb', version='1.1.0')
>>> db.description
'Berlin Database of Emotional Speech. A German database of emotional utterances spoken by actors recorded as a part of the DFG funded research project SE462/3-1 in 1997 and 1999. Recordings took place in the anechoic chamber of the Technical University Berlin, department of Technical Acoustics. It contains about 500 utterances from ten different actors expressing basic six emotions and neutral.'
which is not very nice to read.
It get's even worse if you have some real formatting in the description string.
For example, for audioset
the description contains:
AudioSet ontology categories of the two top hierarchies:
Human sounds Animal Music
|-Human voice |-Domestic animals, pets |-Musical instrument
|-Whistling |-Livestock, farm |-Music genre
|-Respiratory sounds | animals, working |-Musical concepts
|-Human locomotion | animals |-Music role
|-Digestive \-Wild animals \-Music mood
|-Hands
|-Heart sounds, Sounds of things Natural sounds
| heartbeat |-Vehicle |-Wind
|-Otoacoustic emission |-Engine |-Thunderstorm
\-Human group actions |-Domestic sounds, |-Water
| home sounds \-Fire
Source-ambiguous sounds |-Bell
|-Generic impact sounds |-Alarm Channel, environment
|-Surface contact |-Mechanisms and background
|-Deformable shell |-Tools |-Acoustic environment
|-Onomatopoeia |-Explosion |-Noise
|-Silence |-Wood \-Sound reproduction
\-Other sourceless |-Glass
|-Liquid
|-Miscellaneous sources
\-Specific impact sounds
which would be nice if we could preserve it when printing to screen.
We don't have any db.license
entry at the moment. Before it was handled inside the Gradle settings, but I think we should add an option to store it also in the database.
Of course, it can be done already by using the meta
field or adding it to the description, but I think we should add an extra entry for it.
Some licenses might come with an URL, so maybe we should add such an option as well.
This function is used in audinterface
to combine segments detected by a Segment
object, like VAD.
But it can take very long for a typical database, see audeering/audinterface#26
To verify this I created 100 MultiIndex
indexes that each contain 10.000 entries (1.000 files with 10 start and end times each).
I then benchmarked how long it takes to join a certain number of those index entries.
import time
import numpy as np
import pandas as pd
import audformat
def chunks(lst, n):
"""Yield successive n-sized chunks from lst."""
for i in range(0, len(lst), n):
yield lst[i:i + n]
def measure(idx):
start = time.time()
i = audformat.utils.union(idx)
end = time.time()
return end - start
files = [f'a{n}' for n in range(1000)]
starts = [pd.to_timedelta(t, unit='s') for t in range(1000)]
ends = [pd.to_timedelta(n + 10 * np.random.rand()) for n in range(1000)]
# Split into lists of 10 entries each
starts = list(chunks(starts, 10))
ends = list(chunks(ends, 10))
# Create tuples
idx_tuples = []
for start, end in zip(starts, ends):
idx_tuples.append([(f, s, e) for s, e in zip(start, end) for f in files])
idx = []
for idx_tuple in idx_tuples:
idx.append(pd.MultiIndex.from_tuples(idx_tuple, names=['file', 'start', 'end']))
Then we get:
>>> measure(idx[:10])
9.416870594024658
>>> measure(idx[:20])
32.61396503448486
>>> measure(idx[:30])
71.54374980926514
>>> measure(idx[:40])
128.81993174552917
>>> measure(idx[:50])
203.0398144721985
>>> measure(idx[:60])
274.86833333969116
First I test how fast an alternative method would work, that don't handles the union aspect:
def measure(idx):
start = time.time()
idx = pd.concat([i.to_frame() for i in idx])
i = pd.MultiIndex.from_frame(idx)
end = time.time()
return end - start
This returns:
>>> measure(idx[:10])
0.03077840805053711
>>> measure(idx[:20])
0.08491826057434082
>>> measure(idx[:30])
0.11141705513000488
>>> measure(idx[:40])
0.13216853141784668
>>> measure(idx[:50])
0.1430981159210205
>>> measure(idx[:60])
0.15899324417114258
We clearly need to improve on this. Hopefully we can find a solution that also handles the union part nicely.
To me the following looks like a pandas
bug:
>>> idx = audformat.segmented_index(['a'], [0], [pd.NaT])
>>> idx
MultiIndex([('a', '0 days', NaT)],
names=['file', 'start', 'end'])
>>> idx.get_level_values('end')
TimedeltaIndex([NaT], dtype='timedelta64[ns]', name='end', freq=None)
>>> idx[0]
('a', Timedelta('0 days 00:00:00'), nan)
I would have expected that idx[0]
should return
('a', Timedelta('0 days 00:00:00'), Timedelta(NaT))
or if this is possible
('a', Timedelta('0 days 00:00:00'), NaT)
I don't think that this will have big implications, but it took me a while to fix a test that was failing due to the returned nan
value.
It might be that pandas
don't consider it as a bug as the following still works:
>>> pd.isnull(idx[0][2])
True
so it doesn't matter if it is NaT
or nan
.
But the problem is that we loose information of the object as timedelta and datetime return the same and some other functions might then interpret it as a datetime index and fail.
I wanted to set values to a audformat table whose possible labels were string labels as below:
labels={
"h": {'category': 'happy'},
"n": {'category': 'neutral'},
"b": {'category': 'bored'},
"a": {'category': 'angry'},
},
The values from dataframe contain [nan 'b' 'h' 'a' 'n']
. I was getting the error below:
Traceback (most recent call last):
File "create.py", line 409, in <module>
main()
File "create.py", line 376, in main
db['segments']['supposed_emotion'].set(df_segments.supposed_emotion.values)
File "/home/audeering.local/mmadadi/.local/lib/python3.6/site-packages/audformat/core/column.py", line 251, in set
assert_values(values, scheme)
File "/home/audeering.local/mmadadi/.local/lib/python3.6/site-packages/audformat/core/column.py", line 41, in assert_values
values = np.unique(values)
File "<__array_function__ internals>", line 6, in unique
File "/home/audeering.local/mmadadi/.local/lib/python3.6/site-packages/numpy/lib/arraysetops.py", line 261, in unique
ret = _unique1d(ar, return_index, return_inverse, return_counts)
File "/home/audeering.local/mmadadi/.local/lib/python3.6/site-packages/numpy/lib/arraysetops.py", line 322, in _unique1d
ar.sort()
TypeError: '<' not supported between instances of 'str' and 'float'
I was able to solve it by, assigning the values to be a list as :
db['segments']['supposed_emotion'].set(list(df_segments.supposed_emotion.values))
At the moment a user can change the dtype or labels of a scheme, but this will not automatically update the related tables.
So we should make those attributes properties that return a copy and provide setter functions. BTW, you can add a setter function to a Python property, so a user should be able to do something like schemes['my-scheme'].labels[0] = 'a'
if we want to allow for it.
Let's say you would like to move a list of files
from an existing train
table to an existing test
table.
The easiest solution I found so far would be:
db['tmp'] = db['train'].pick_files(files)
db['tmp'].split_id = db['test'].split_id
db['train'].drop_files(files, inplace=True)
db['test'].update(db['tmp'])
db.drop_tables('tmp')
so it might be easier to have something like:
db['test'].move_files(db['train'], files)
Of course it will only work if your columns match, but the same is true for update()
.
In the documentation we provide a couple of nice usage examples. To make it easier for a user to find those, we could directly link them in the docstring of the according methods. E.g. in Column.get()
and Table.get()
we should add a link to https://audeering.github.io/audformat/map-scheme.html
When loading a database that was stored before with audformat
the following error can arise:
>>> db = audformat.Database.load('db')
...
~/git/audeering/audformat/audformat/core/table.py in <lambda>(x)
1036 if self.type == define.IndexType.SEGMENTED:
1037 converters[define.IndexField.START] = \
-> 1038 lambda x: pd.to_timedelta(x)
1039 converters[define.IndexField.END] = \
1040 lambda x: pd.to_timedelta(x)
~/.envs/test/lib/python3.6/site-packages/pandas/core/tools/timedeltas.py in to_timedelta(arg, unit, errors)
120
121 # ...so it must be a scalar value. Return scalar.
--> 122 return _coerce_scalar_to_timedelta_type(arg, unit=unit, errors=errors)
123
124
~/.envs/test/lib/python3.6/site-packages/pandas/core/tools/timedeltas.py in _coerce_scalar_to_timedelta_type(r, unit, errors) 126 """Convert string 'r' to a timedelta object."""
127 try:
--> 128 result = Timedelta(r, unit)
129 except ValueError:
130 if errors == "raise":
pandas/_libs/tslibs/timedeltas.pyx in pandas._libs.tslibs.timedeltas.Timedelta.__new__()
pandas/_libs/tslibs/timedeltas.pyx in pandas._libs.tslibs.timedeltas.parse_timedelta_string()
ValueError: no units specified
To further trakc this down I added unit='s'
to both pd.to_timedelta()
calls and repeated the command. Then we get:
~/git/audeering/audformat/audformat/core/table.py in <lambda>(x)
1036 if self.type == define.IndexType.SEGMENTED:
1037 converters[define.IndexField.START] = \
-> 1038 lambda x: pd.to_timedelta(x, unit='s')
1039 converters[define.IndexField.END] = \
1040 lambda x: pd.to_timedelta(x, unit='s')
~/.envs/test/lib/python3.6/site-packages/pandas/core/tools/timedeltas.py in to_timedelta(arg, unit, errors)
117
118 if isinstance(arg, str) and unit is not None:
--> 119 raise ValueError("unit must not be specified if the input is/contains a str")
120
121 # ...so it must be a scalar value. Return scalar.
ValueError: unit must not be specified if the input is/contains a str
So it seems to be that the start and end values contain strings or are read as strings.
If I look in the corresponding entries in the CSV files I see only valid floats there.
In the documentation the following input types are allowed for starts
and ends
:
But it also works for integers:
>>> audformat.segmented_index(['a.wav', 'a.wav'], [0, 1], [1, 2])
MultiIndex([('a.wav', '0 days 00:00:00', ...),
('a.wav', '0 days 00:00:00.000000001', ...)],
names=['file', 'start', 'end'])
and floats as well (besides that the position after the comma is no longer considered):
>>> audformat.segmented_index(['a.wav', 'a.wav'], [0., 1.1], [1., 2.1])
MultiIndex([('a.wav', '0 days 00:00:00', ...),
('a.wav', '0 days 00:00:00.000000001', ...)],
names=['file', 'start', 'end'])
I would propose that we add int
as type as well and add a word to the documentation that this is then handled as ns
.
import audformat.testing
db = audformat.testing.create_db()
db['files']['label_map_str'].get(map='bad')
ValueError: Cannot map '{}' to 'label_map_str'. Expected one of ['prop1', 'prop2'].
See audeering/audb#149 (comment)
Possible solutions:
obj.has_duplicates
from assert_index()
(e.g. add assert_no_duplicates()
) and call it only if really needed.assert_index()
(and functions that call it) to disable checking for duplicates.utils.concat()
does not properly work for dtypes that are not nullable:
E.g.:
try:
y1 = pd.Series(1, audformat.filewise_index('f1'))
y2 = pd.Series(1, audformat.filewise_index('f2'))
audformat.utils.concat([y1, y2])
except Exception as ex:
print(ex)
Found overlapping data in column 'None':
left right
file
f2 0 1
and:
try:
y1 = pd.Series(True, audformat.filewise_index('f1'))
y2 = pd.Series(True, audformat.filewise_index('f2'))
audformat.utils.concat([y1, y2])
except Exception as ex:
print(ex)
Found overlapping data in column 'None':
left right
file
f2 False True
The problem is that we first create an empty table in which we then insert the data. However, before we insert the data we check for an overlap, which is a problem if the empty table does not contain NA values, but e.g. 0 in case of int
or False
in case of bool
. The function then detects an overlap and raises an error.
The issue can be avoided by switching to a nullable dtype. E.g. the following works:
y1 = pd.Series(1., audformat.filewise_index('f1'), dtype='Int64')
y2 = pd.Series(1., audformat.filewise_index('f2'), dtype='Int64')
audformat.utils.concat([y1, y2])
y1 = pd.Series(True, audformat.filewise_index('f1'), dtype='boolean')
y2 = pd.Series(True, audformat.filewise_index('f2'), dtype='boolean')
audformat.utils.concat([y1, y2])
Let's have a look at the following example:
db1 = audformat.testing.create_db(minimal=True)
db1.schemes['int'] = audformat.Scheme(int)
audformat.testing.add_table(db1, 'table', 'filewise')
db2 = audformat.testing.create_db(minimal=True)
db2.schemes['str'] = audformat.Scheme(str)
audformat.testing.add_table(db2, 'table', 'filewise')
audformat.utils.concat(
[
db1['table'].df,
db2['table'].df
],
)
int str
file
audio/001.wav 47 NhDdzJX0YA
audio/002.wav 43 qdd5Onf25M
audio/003.wav 76 UIUHk0dGdi
audio/004.wav 97 WCXPkAxL5M
audio/005.wav 25 Aos20DyoUo
As we see we have successfully combined tables from two different database using audformat.utils.concat()
.
Hence, a user might be surprised to learn that adding the two tables with +
raises an error:
(db1['table'] + db2['table']).get()
Bad column ID 'str', expected one of ['int']
The reason is that the new Table
we create is actually assigned to db1
:
audformat/audformat/core/table.py
Line 795 in c045b92
But at the same time, we assign the the second column to scheme str
, which is only known by db2
:
audformat/audformat/core/table.py
Line 790 in c045b92
So currently we can only safely combine tables with +
from the same database. To overcome this limitation I suggest to not assign the new table to db1
and set the scheme_id
and rater_id
of all columns to None
. Then we can safely do:
(db1['table'] + db2['table']).get()
int str
file
audio/001.wav 49 vaRA2y1rj4
audio/002.wav 42 IhpThA81cj
audio/003.wav 51 2oRdTHlYNr
audio/004.wav 31 ePZfabVHvd
audio/005.wav 5 FiZYPYcQZI
The only downside is that if we do:
db['table1'] = db['table1'] + db['table2'] # removes scheme and rater from table1
we actually remove the scheme and rater information from table1
.
Therefore I suggest to introduce a new function Table.update()
for this use-case:
db['table1'].update(db['table2'])
The advantage here is that we don't have a detached table as an intermediate result so we can keep scheme and rater information, and we can even copy missing schemes/raters from the other table to db
so that also the following works:
db1['table'].update(db2['table']) # copy missing schemes and raters used in db2['table'] to db1
@hagenw please comment and if you agree I will prepare a MR
Try the following:
import audformat.testing
db = audformat.testing.create_db()
for table in db.tables:
db[f'{table}.new'] = audformat.Table()
This results in:
RuntimeError Traceback (most recent call last)
<ipython-input-4-a72911a3944b> in <module>
----> 1 for table in db.tables:
2 db[f'{table}.new'] = audformat.Table()
3
RuntimeError: OrderedDict mutated during iteration
A work around is to first get a list of tables:
import audformat.testing
db = audformat.testing.create_db()
tables = list(db.tables)
for table in tables:
db[f'{table}.new'] = audformat.Table()
This works as expected.
As this is very hard to debug for a user, I think we should try to fix it to also work when directly iterating through db.tables
.
I haven't tested it for other entries, but I guess it might be the same for db.schemes
etc.
db = audformat.Database('test')
db.schemes['set'] = audformat.Scheme('str')
db.schemes['set'].labels = {}
db['sets'] = audformat.Table()
db['sets']['set'] = audformat.Column(scheme_id='set')
set_id = 'abc'
db.schemes['set'].labels[set_id] = 'extra info'
idx = audformat.filewise_index(['a', 'b', 'c'])
db['sets'].extend_index(idx, inplace=True)
db['sets']['set'].set(set_id, index=idx)
db['sets'].df
this returns
set
file
a NaN
b NaN
c NaN
But the output should be
set
file
a abc
b abc
c abc
As shown in audeering/audb#38 (comment) using uncompressed pickle files is much faster when storing large dataframes. As we don't care about size of the cached tables, but only about speed for loading them we should change the behavior for caching tables as well.
I think we can even implement it in a backward compatible way. Loading a compressed pickle file without specifying the compression, it should fail and we can use try
-except
to catch that and then load using `compression='xz'.
I want to add zh-tw
(Taiwanese Chinese) to mozillacommonvoice
, but there is no ISO 639-3 language code for it, compare https://en.wikipedia.org/wiki/Taiwanese_Mandarin
At the moment we can only add languages that are ISO 639-3 as we call audformat.utils.map_language()
on them, which fails if the language code is not available.
This is a minimal example with one table that has a column linked to boolean scheme:
import audformat
db = audformat.testing.create_db(minimal=True)
db.schemes['scheme'] = audformat.Scheme(audformat.define.DataType.BOOL)
db['t'] = audformat.Table(audformat.filewise_index(['f1']))
db['t']['c'] = audformat.Column(scheme_id='scheme')
By default, the labels in the column are initialized with NaN
and we get the following expected output:
db['t'].get()
c
file
f1 NaN
However, saving and loading the database as CSV fails (using pickle works):
db.save('db')
audformat.Database.load('db')
ValueError: Bool column has NA values in column 1
There are two solutions:
NaN
for boolean schemes and initalize with False
I created a dummy pull request to check the current master: #92
And it seems that something had broken our tests: https://github.com/audeering/audformat/pull/92/checks?check_run_id=3162446359
At the moment you can experience the following error if you try to add data from a pd.Series
that has a dtype
that does not match the one specified in the corresponding scheme of the column:
>>> db['answer']['rating'].set(df.rating)
...
TypeError: '<' not supported between instances of 'NoneType'/'float' and 'str'
The problem is that a user might not be able to figure out what is wrong by herself/himself.
I see two solutions:
list()
or setting the correct dtype
list()
) internally and ignore the dtype
setting of the pd.Series
At the moment we have the problem, that we don't cover some use cases with our available tables in audformat
.
If you have as input the comparison between two files, or a file combined with another column as combination and a third with the result, we can not handle it as we do not allow for duplicate index entries.
One example would be a table used for verification experiments:
verification file same speaker
file
wav/id10270/x6uYqmx31kE/00001.wav wav/id10300/ize_eiCFEg0/00003.wav False
wav/id10270/x6uYqmx31kE/00001.wav wav/id10270/GWXujl-xAVM/00017.wav True
wav/id10270/x6uYqmx31kE/00001.wav wav/id10273/0OCW1HUxZyg/00001.wav False
wav/id10270/x6uYqmx31kE/00001.wav wav/id10270/8jEAjG6SegY/00022.wav True
In #42 the following problem was discovered:
db = audformat.testing.create_db(minimal=True)
db.schemes['int'] = audformat.Scheme(int)
audformat.testing.add_table(
db,
'files',
'filewise',
num_files=[0, 1],
)
print(db['files'].get())
audformat.testing.add_table(
db,
'segments',
'segmented',
num_files=[1, 2],
)
db['segments'].df.drop(columns='int', inplace=True)
df = (db['files'] + db['segments']).get()
print(df.dropna())
int
file
audio/000.wav 9
audio/001.wav 8
int
file start end
audio/000.wav 0 days 00:00:00 NaT 8
audio/001.wav 0 days 00:00:00.008223082 0 days 00:00:00.436778253 8
0 days 00:00:00.741062295 0 days 00:00:02.177251004 8
0 days 00:00:02.348365842 0 days 00:00:02.634602780 8
0 days 00:00:02.778845133 0 days 00:00:03.194477961 8
audio/002.wav 0 days 00:00:04.135094273 0 days 00:00:04.583586552 9
Expected output is:
int
file start end
audio/000.wav 0 days 00:00:00 NaT 9
audio/001.wav 0 days 00:00:00.008223082 0 days 00:00:00.436778253 8
0 days 00:00:00.741062295 0 days 00:00:02.177251004 8
0 days 00:00:02.348365842 0 days 00:00:02.634602780 8
0 days 00:00:02.778845133 0 days 00:00:03.194477961 8
audio/002.wav 0 days 00:00:04.135094273 0 days 00:00:04.583586552 nan
It's probably related to the fact that db['segments']
is empty.
db = audformat.testing.create_db(minimal=True)
db.schemes['int'] = audformat.Scheme(int)
audformat.testing.add_table(
db,
'files',
'filewise',
num_files=[0, 1],
)
print(db['files'].get())
db.schemes['float'] = audformat.Scheme(float)
audformat.testing.add_table(
db,
'segments',
'segmented',
num_files=[1, 2],
num_segments_per_file=2,
columns='float',
)
print(db['segments'].get())
print((db['files'] + db['segments']).get())
int
file
audio/000.wav 56
audio/001.wav 80
float
file start end
audio/001.wav 0 days 00:00:00.707328684 0 days 00:00:01.122211418 0.932964
0 days 00:00:02.551719975 0 days 00:00:03.259764254 0.891229
audio/002.wav 0 days 00:00:01.756307851 0 days 00:00:02.541227282 0.616572
0 days 00:00:04.928252041 0 days 00:00:04.977779637 0.432051
int float
file start end
audio/000.wav 0 days 00:00:00 NaT 80 NaN
audio/001.wav 0 days 00:00:00.707328684 0 days 00:00:01.122211418 80 0.932964
0 days 00:00:02.551719975 0 days 00:00:03.259764254 <NA> 0.891229
audio/002.wav 0 days 00:00:01.756307851 0 days 00:00:02.541227282 <NA> 0.616572
0 days 00:00:04.928252041 0 days 00:00:04.977779637 56 0.432051
Expected result is:
file start end
audio/000.wav 0 days 00:00:00 NaT 56 NaN
audio/001.wav 0 days 00:00:00 NaT 80 NaN
audio/001.wav 0 days 00:00:00.707328684 0 days 00:00:01.122211418 <NA> 0.932964
0 days 00:00:02.551719975 0 days 00:00:03.259764254 <NA> 0.891229
audio/002.wav 0 days 00:00:01.756307851 0 days 00:00:02.541227282 <NA> 0.616572
0 days 00:00:04.928252041 0 days 00:00:04.977779637 <NA> 0.432051
At the moment the 40MB source file of emodb has to be donwloaded for every documentation test.
There is no need for this and we should enable caching of that file.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.