audeering / audb Goto Github PK

View Code? Open in Web Editor NEW

23.0 23.0 1.0 30.4 MB

Manage audio and video databases

Home Page: https://audeering.github.io/audb/

License: Other

Python 99.87% Shell 0.13%

annotation audio data mlops

audb's People

Contributors

Stargazers

Watchers

Forkers

yuan-manx

audb's Issues

Split storage of metadata and media in cache?

At the moment we create a single folder for each flavor and store media, tables, header and the dependency files in it.
In principle this is not needed as only the media files will change between the flavors (and in the current implementation, the db.meta['audb'] entry in the header.

So it might things easier to solve #10 and #11.

Support for flavors on Windows broken

audresample does not work on Windows at the moment and needs to be fixed in order to support mixing or resampling on Windows, see audeering/audresample#5

DOC: Add example how to get duration from dependencies

Since we store file duration in the dependencies, we don't have to it again in the tables of a database. However, we should add a usage example how to get the duration from there. Currently we only show how to get the total duration:

https://audeering.github.io/audb/load.html#metadata-and-header-only

Add media and tables argument to some audb.info functions

For all audb.info functions that use the dependency file under the hood, it would be great to add a media and tables option as we have in audb.load() in order to filter for only parts of the database. The following functions would be affected:

audb.info.bit_depths()
audb.info.channels()
audb.info.duration()
audb.info.formats()
audb.info.sampling_rates()

Mirror database to another repository

Currently, it's not possible to publish a database on two different repositories with the same version. This prevents ending up with different databases published under the same version. However, we maybe want to mirror a database to another repository. I propose to implement audb.mirror() for this use-case.

Missing link in documentation

It seems a link it missing after "You can find the example code at ..."

Copy other version of files inside the cache

Say you have two versions of a database 1.0.0 and 1.0.1 and the second just changes something in the header.
If I load the second version, it will again download the audio data. Instead we could copy them from the cache of the first version.
This would be especially meaningful for databases that are growing over time.

Add a check that the path to the data is not absolute

At the moment it seems to be possible to publish a database that contains absolute paths in the tables to the data without raising an error during publication (see https://gitlab.audeering.com/data/myai/-/issues/12).

This is a little bit unfortunate and we should see if we can add a check that only relative paths are added.

audb.info.duration() can fail for large databases

Try:

>>> audb.info.duration('audioset')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-141-86306742f7d6> in <module>
----> 1 audb.info.duration('audioset')

~/git/audeering/audb/audb/core/info.py in duration(name, version)
    118     deps = dependencies(name, version=version)
    119     return pd.to_timedelta(
--> 120         sum([deps.duration(file) for file in deps.media]),
    121         unit='s',
    122     )

TypeError: unsupported operand type(s) for +: 'float' and 'NoneType'

But if I inspect the entries in the dependency dataframe there are no missing values:

>>> deps = audb.dependencies('audioset')
>>> df = deps()
>>> df['duration'].isnull().sum()
0
>>> df['duration'].sum()
19879979.737141866

So maybe instead of doing

sum([deps.duration(file) for file in deps.media]),

we should just do

deps()['duration'].sum()

Allow to request media files without tables

Currently, audb.load seems to contact the repository server to download metadata even when the requested database/media are already present locally. It would be nice if there was a way to disable this behaviour, mainly to avoid the multi-second delay of fetching this metadata.

Conan provides something similar for its install command. Only if the opt-in flag --update is set, it will always check the remote for newer versions. If the flag is not set and the requested artifacts can be provided by the local cache, the remote is not contacted at all and the command runs very quickly. Something similar (but maybe opt-out instead of opt-in) could work for audb.

Using wrong arguments does not raise an error

If you do

audb.load('emodb', metadata_only=True)

instead of

audb.load('emodb', only_metadata=True)

it will not raise an error, but simply ignore the wrong argument.
This happens due to our backwards compatibility handling code.

It's not a big deal, but it is unfortunate as in this case it just downloads all the media files, which is not what the user intended.

Converting to given flavor format in tables fails for empty tables

If one of the tables is empty in your database and you request format='wav' or a different format, the renaming of the files in the tables will fail with:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/scratch/shuber/envs/shuber_env/lib/python3.6/site-packages/audb/core/load.py", line 863, in load
    _fix_media_ext(db.tables.values(), flavor.format, num_workers, verbose)
  File "/scratch/shuber/envs/shuber_env/lib/python3.6/site-packages/audb/core/load.py", line 187, in _fix_media_ext
    task_description='Fix format',
  File "/scratch/shuber/envs/shuber_env/lib/python3.6/site-packages/audeer/core/utils.py", line 444, in run_tasks
    results[index] = task_func(*param[0], **param[1])
  File "/scratch/shuber/envs/shuber_env/lib/python3.6/site-packages/audb/core/load.py", line 179, in job
    inplace=True,
  File "/scratch/shuber/envs/shuber_env/lib/python3.6/site-packages/pandas/core/indexes/multi.py", line 830, in set_levels
    if is_list_like(levels[0]):
  File "/scratch/shuber/envs/shuber_env/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 4104, in __getitem__
    return getitem(key)
IndexError: index 0 is out of bounds for axis 0 with size 0

audb.load() always connects the backend

If the database is stored in the cache and you request it with the version:

audb.load(database, version=version)

audb.load() should not need a connection to the backend, but just load the data from cache.

This is not the case as in

audb/audb/core/load.py

Line 666 in b0f6c30

backend = lookup_backend(name, version)

we always establish a connection to the backend.

We should fix this, but it might make more sense to first tackle #46.

DOC: docstring for Repository missing

In the current master we have

Add a curse base UI to browse and play a database

Before we an implementation using https://github.com/frankenjoe/pandasgui.git for easily inspecting a table and playing back some audio.

I was wondering if we should instead implement a curse based interafce using urwid.

Support for flavors on MacOS broken

audresample needs to be fixed for MacOS to support flavors that require mixing or resampling, see audeering/audresample#4

Requesting single media files is much slower in version 1.1.0

First reported in #63

Requesting single media files from a database takes much longer now:

audb 1.0.4

>>> # Starting with empty cache
>>> timeit.timeit('audb.load("msppodcast", version="2.3.0", media=["Audios/MSP-PODCAST_0001_0008.wav"], full_path=False, verbose=True)', number=1, setup="import audb")
37.210156934015686
>>> # Loading from cache
>>> timeit.timeit('audb.load("msppodcast", version="2.3.0", media=["Audios/MSP-PODCAST_0001_0008.wav"], full_path=False, verbose=True)', number=1, setup="import audb")
4.945569497998804

audb 1.1.0

>>> # Starting with empty cache
>>> timeit.timeit('audb.load("msppodcast", version="2.3.0", media=["Audios/MSP-PODCAST_0001_0008.wav"], full_path=False, verbose=True)', number=1, setup="import audb")
93.68407620198559
>>> # Loading from cache
>>> timeit.timeit('audb.load("msppodcast", version="2.3.0", media=["Audios/MSP-PODCAST_0001_0008.wav"], full_path=False, verbose=True)', number=1, setup="import audb")
39.96311020699795

Add audb.info.files?

As we can filter by media (files) when loading a databases, I was wondering if we should provide an easy way to get all files that come with a database.

At the moment you can get that info with:

db = audb.load('emodb', only_metadata=True)
db.files

or faster with

deps = audb.dependencies('emodb')
deps.media

To get a list of all tables in a database you can do:

list(audb.info.tables('emodb'))

I was first thinking about audb.info.media('emodb'), but this exists already and returns information on the media type as audb.info in general deal with header information and only a few functions access the dependency file instead, e.g. audb.info.channels().

So we could think about adding something like audb.info.files() or audb.list_media().

@agfcrespi also reported that he did search the documentation for files not for media when he was searching for such a function.

Undocumented dependencies on libsndfile and SoX

I am trying to use audb in our CI. I install it via pip install audb==1.1.2 but it appears it has an additional undocumented dependency on libsndfile:

import audb
OSError: sndfile library not found

I haven't tested it yet but it seems the apt package libsndfile1 needs to be installed on Ubuntu: https://stackoverflow.com/questions/55086834/cant-import-soundfile-python

I think this should be documented or ideally the dependency be included in the pip package if possible.

Avoid regex warning?

I'm seeing this warning from time to time:

 /opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/audb/core/load.py:180: FutureWarning: The default value of regex will change from True to False in a future version.
  table.df.index = table.df.index.str.replace(cur_ext, new_ext)

Should we overwrite the default value to suppress it?

Readd command line interface

Before we had a command line interface based on fire, but we removed it as we had several problems with it.

Would be nice to readd a subset of those functions.

Let audb.load() and audb.load_to() share more code

At the moment audb.load() and audb.load_to() are implemented more or less independently of each other, which makes no sense and is risky to maintain. We should try to share as much code between them as possible.

audb.publish() should raise error if author is missing

It is ok in audformat.Database to allow for empty author and license entries, but audb.publish() should raise an error if:

author is missing
~~license is missing~~

Right management for shared folder is not implemented

In the shared folder databases have to be shared by different users, which is not working at the moment:

>>> audb.load("mgb5", version="1.0.0", cache_root="/data/audb")
PermissionError: [Errno 13] Permission denied: '/data/audb/mgb5/1.0.0/ebbb9037

as the data was downloaded before by another user.

In older versions of audb we handled this, I do not remember exactly how, but I guess it is related to these lines of code:

        # Set permissions for to be stored files to the one from cache folder
        current_permission = os.stat(cache_root).st_mode & 0o777
        mask = 0o777 - current_permission
        current_mask = os.umask(mask)

Speed up publish when no media files were altered

Currently publish() always checks if media was changed. To do that the checksum of all media files in the database has to be calculated. This can take quite some while on large databases. However, most of the time only the metadata is changed and maybe new media is added. That existing media changes is a rather rare case. So I wonder if we should give the user the option to skip the test for altered media.

Documentation TOC is wrongly formatted

whereas if I build it locally I get the desired result:

audb.load() can fail when searching for cached versions

I tried to load audioset on compute4 and got:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-7-1d6b92c78f3c> in <module>
----> 1 db = audb.load(name, version=version, only_metadata=True, full_path=False, tables='Human-sounds.unbalanced-train')

~/.envs/audb/lib/python3.6/site-packages/audb/core/load.py in load(name, version, only_metadata, bit_depth, channels, format, mixdown, sampling_rate, tables, media, removed_media, full_path, cache_root, num_workers, verbose, **kwargs)
    821             cache_root,
    822             num_workers,
--> 823             verbose,
    824         )
    825

~/.envs/audb/lib/python3.6/site-packages/audb/core/load.py in _load_tables(tables, backend, db_root, db_root_tmp, db, version, cached_versions, deps, flavor, cache_root, num_workers, verbose)
    502                 version,
    503                 flavor,
--> 504                 cache_root,
    505             )
    506         if cached_versions:

~/.envs/audb/lib/python3.6/site-packages/audb/core/load.py in _cached_versions(name, version, flavor, cache_root)
     33
     34     df = cached(cache_root=cache_root)
---> 35     df = df[df.name == name]
     36
     37     cached_versions = []

~/.envs/audb/lib/python3.6/site-packages/pandas/core/generic.py in __getattr__(self, name)
   5139             if self._info_axis._can_hold_identifiers_and_holds_name(name):
   5140                 return self[name]
-> 5141             return object.__getattribute__(self, name)
   5142
   5143     def __setattr__(self, name: str, value) -> None:

AttributeError: 'DataFrame' object has no attribute 'name'

Revisit command line interface

At the moment we have audbget to download single tables.

There are three changes that we might want to do:

rename it to audb get instead
use the table argument of audb.load() directly
make it work with the media argument
provide possibility to store files to a folder

Add load from cache to audb.info.header()

At the moment audb.info.header() always downloads the header file from the backend,
but it might be that the header is already stored in the cache folder.
So, in principle we could load it from there.

Loading a database using uppercase file extensions can fail

If you have stored your wav file using upper case letters, e.g. WAV or Wav loading the wav flavor might fail as it knows from the dependencies files that we have the WAV format and will not convert the data, but it will still rename all the table entries to lower case letters.

Here is an example of a corresponding dependency file of aspeechdb:

>>> deps = audb.dependencies('aspeechdb', version='1.0.0')
>>> deps()
                archive  bit_depth  channels                          checksum  duration format  removed  sampling_rate  type version
db.dev.csv          dev          0         0  d2978b3411c3f4ab69c5577e5d06c1ba     0.000    csv        0              0     0   1.0.0
db.test.csv        test          0         0  4c8cbb71a5e7ffe60b5fb3c5ba45debb     0.000    csv        0              0     0   1.0.0
db.train.csv      train          0         0  2f97e2352310462ae397b8ad3bffd1b7     0.000    csv        0              0     0   1.0.0
audio/1.1.WAV     audio         16         1  fd2045cd60e1b9b12e84b5862c5f835f     3.840    wav        0          16000     1   1.0.0
audio/1.10.WAV    audio         16         1  7e726d6eacd6dc52cff3d26e078c7da7     4.608    wav        0          16000     1   1.0.0
...                 ...        ...       ...                               ...       ...    ...      ...            ...   ...     ...
audio/99.79.WAV   audio         16         1  afc7d8211fb5a31a635293a95bdbfa68     4.096    wav        0          16000     1   1.0.0
audio/99.78.WAV   audio         16         1  8c802aef907927a4dfa319edb5b22a6b     4.608    wav        0          16000     1   1.0.0
audio/99.8.WAV    audio         16         1  74fe29eecd8f2ac01a04b4fe00a0690b     3.840    wav        0          16000     1   1.0.0
audio/99.9.WAV    audio         16         1  cd18866eaf082a01d9b6ec1df4de9c2d     4.096    wav        0          16000     1   1.0.0
audio/99.80.WAV   audio         16         1  cf14e49c36e809bc637ba92d8df274eb     4.608    wav        0          16000     1   1.0.0

[16404 rows x 10 columns]

Speed up audb.Dependencies.load()

As we need to load the dependency table for nearly every operation we do in audb it would be nice to speed this up.
E.g. for audioset running audb.Dependencies.load() takes around 130s.

The problem is we don't have that many options. We specify already datatypes for every column of the corresponding CSV file.

Temporary cache folder not locked

As the cache folder is given by the flavor, it can happen that two users will download at the same time to the same folder.

This can fail at the moment, e.g.

FileNotFoundError: [Errno 2] No such file or directory: '/data/audb/projectsmile-salamander-agent-tone/12.4.1/e2677cd6~/data/2020_09_08/7af930bb3f0645be933bc717826e9635_
7KiW/7af930bb3f0645be933bc717826e9635_7KiW.wav' -> '/data/audb/projectsmile-salamander-agent-tone/12.4.1/e2677cd6/data/2020_09_08/7af930bb3f0645be933bc717826e9635_7KiW/$
af930bb3f0645be933bc717826e9635_7KiW.wav'

as we don't lock the temporary (and the cache?) folder.

Wrong return type for some audb.info functions

The following functions all have dict as return type, but the correct return type is audformat.core.common.HeaderDict:

audb.info.media()
audb.info.meta()
audb.info.raters()
audb.info.schemes()
audb.info.splits()
audb.info.tables()

There are two solutions:

change the return type to audformat.core.common.HeaderDict
convert the return values to dict

The advantage of 1. would be that the result remains identical to calling e.g. db.media, its disadvantage is that audformat.core.common.HeaderDict is not documented.
The advantage of 2. would be that it returns a well known type, the disadvantage of 2. is that it wouldn't be any longer identical to calling e.g. db.media.

audb.available() does store wrong backend for databases published on different backends

It is totally fine to publish different versions of a database on different backends. For example, emodb 0.2.2 and 1.0.1 are stored on Repository('data-public-local', 'https://artifactory.audeering.com/artifactory', 'artifactory'), whereas version 1.1.0 is published on Repository('data-public', 'https://audeering.jfrog.io/artifactory', 'artifactory'). But if you request the list of available databases:

>>> df = audb.available(only_latest=True)
>>> df.loc['emodb']
backend                                         artifactory
host          https://artifactory.audeering.com/artifactory
repository                                data-public-local
version                                               1.1.0
Name: emodb, dtype: object

it shows the wrong repository.

This happens because we set backend information and database version independently of each other in audb.available():

            if name not in match:
                match[name] = { 
                    'backend': repository.backend,
                    'host': repository.host,
                    'repository': repository.name,
                    'version': [], 
                }
            match[name]['version'].append(version)

Improve speed or be more verbose

When loading a large database (e.g. voxceleb) it might take over 60 seconds before audb shows the first progress bar after showing the text message:

Get:   voxceleb2-videos v1.0.0
Cache: /data/work3/hwierstorf/audb/voxceleb2-videos/1.0.0/2208f75e

We should try to speed up audb or if not possible, maybe show another progress bar or a text message indicating that audb is doing something.

Raise error if conversion is not possible before downloading media files

If you try to resample MP4 files, but you do not specify format='wav' or format='flac' you will get an error that you have to specify it,
but this error only appears after it had downloaded all the data,. It seems to me much more convienient if we raise this error already before downloading.

It should be possible as we store the format inside the data base dependency files.

audb.depdencies() should cache the dependency file

As it can take a long time to download the dependency file of a big database and it is loaded by audb.load() anyway,
we should also cache it with audb.dependencies().

Raise error if table does not exist during load()

We should raise an error instead of returning empty tables when requesting a non-existing table.

E.g. at the moment we get:

>>> db = audb.load('emodb', tables='noise', verbose=False)
>>> db.files
Index([], dtype='object', name='file')
>>> list(db.tables)
[]

Instead it should raise an error and maybe present a list of available tables, which could be generated with:

>>> list(audb.info.tables('emodb'))
['emotion', 'files']

Search is broken in the documentation

The search is broken in the documentation, e.g.: https://audeering.github.io/audb/search.html?q=full_path&check_keywords=yes&area=default gives a JavaScript error:

Uncaught ReferenceError: Stemmer is not defined
    query https://audeering.github.io/audb/_static/searchtools.js:158
    setIndex https://audeering.github.io/audb/_static/searchtools.js:98
    <anonymous> https://audeering.github.io/audb/search.html?q=full_path&check_keywords=yes&area=default line 2 > injectedScript:1

We had this issue a while ago with other packages and it was fixed there by updating audeering-sphinx-theme or one of the other packages if I remember correctly.

Download single files from archives for filtering media

It might be possible that we could download single media files also for the case that they are stored in an archive.

The only problem is that this needs to be somehow supported by the backend.
For example, in Artifactory you can do something like this:

r = audfactory.rest_api_get(
    f'{host}/{repository}/{name}/media/{archive}/{version}/{archive}-{version}.zip!/{filename}'
)
with open(dst_filename, 'wb') as fp:
    fp.write(r.content)

Solve pandas future warnings

In the tests we are getting the following warnings at the moment (Python 3.7), copied from https://github.com/audeering/audb/pull/40/checks?check_run_id=2444852881

  /home/runner/work/audb/audb/audb/core/load.py:127: FutureWarning: The default value of regex will change from True to False in a future version.
787
    table.df.index = table.df.index.str.replace(cur_ext, new_ext)

  /home/runner/work/audb/audb/audb/core/load.py:130: FutureWarning: The default value of regex will change from True to False in a future version.
792
    table.df.index.levels[0].str.replace(cur_ext, new_ext),

  /home/runner/work/audb/audb/audb/core/load.py:132: FutureWarning: inplace is deprecated and will be removed in a future version.
797
    inplace=True,

  /home/runner/work/audb/audb/audb/core/load.py:156: FutureWarning: inplace is deprecated and will be removed in a future version.
805
    root + table.df.index.levels[0], 'file', inplace=True,

Not sure if the regex warnings need any action, but I list them here.

Load always creates temporal folder

It seems that audb.load() always creates a temporal folder. Even if it is not needed, e.g. when the database was already completely downloaded. Usually a user will not notice, unless she is missing write rights, e.g. when reading from the shared cache. So it would be safer to only generate a temporal folder if it is actually needed.

audb.dependencies() could use the cached files as well

When we store a database to the cache, we also cache its dependency file.
Which means when calling audb.dependencies() there is no need to always download the dependency table from the backend, instead we could first look into the cache.

Check that all media and tables are available before start upload

Make sure we are not starting a publishing process that will fail and leave behind a corrupted database.

tqdm exception in audb 1.1.0

After updating to audb 1.1.0, the following fails with an exception:

b = audbenchmark.load(
        name='arousal',
        subgroup='ser.msppodcast.regression',
        version='1.0.0',
        verbose=True
    )
b.load_test_set()
Get:   msppodcast v1.0.1
Cache: /media/chausner/Linux/Secured/audb/msppodcast/1.0.1/854d7d2f
  0%|                         [00:00<?] Missing tables                                                                                                            0%|                         [00:00<?] Cached files                                                                                                            Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/chausner/.local/lib/python3.6/site-packages/audbenchmark/core/benchmark.py", line 266, in load_test_set
    return self._load_set(self.test_set)
  File "/home/chausner/.local/lib/python3.6/site-packages/audbenchmark/core/benchmark.py", line 695, in _load_set
    _, y = d()
  File "/home/chausner/.local/lib/python3.6/site-packages/audbenchmark/core/data.py", line 72, in __call__
    db, data = self._call()
  File "/home/chausner/.local/lib/python3.6/site-packages/audbenchmark/core/data.py", line 189, in _call
    return self._columns._call()
  File "/home/chausner/.local/lib/python3.6/site-packages/audbenchmark/core/data.py", line 332, in _call
    verbose=self.verbose,
  File "/home/chausner/.local/lib/python3.6/site-packages/audb/core/load.py", line 737, in load
    verbose,
  File "/home/chausner/.local/lib/python3.6/site-packages/audb/core/load.py", line 451, in _get_tables_from_cache
    task_description='Copy tables',
  File "/home/chausner/.local/lib/python3.6/site-packages/audeer/core/utils.py", line 441, in run_tasks
    disable=not progress_bar,
  File "/home/chausner/.local/lib/python3.6/site-packages/audeer/core/tqdm.py", line 100, in progress_bar
    leave=config.TQDM_LEAVE,
  File "/home/chausner/.local/lib/python3.6/site-packages/tqdm/_tqdm.py", line 945, in __init__
    self.display()
  File "/home/chausner/.local/lib/python3.6/site-packages/tqdm/_tqdm.py", line 1335, in display
    self.sp(self.__repr__() if msg is None else msg)
  File "/home/chausner/.local/lib/python3.6/site-packages/tqdm/_tqdm.py", line 979, in __repr__
    return self.format_meter(**self.format_dict)
  File "/home/chausner/.local/lib/python3.6/site-packages/tqdm/_tqdm.py", line 452, in format_meter
    return bar_format.format(bar='?', **format_dict)
KeyError: 'percentage'

After downgrading to audb 1.0.4, it works again.

Wrong version number shown in the docs

The latest release of audb is v1.1.3 published at 2021/05/18, but the docs show:

They have been updated though

Wrong audio metadata stored for MP3 files in dependencies

See:

>>> deps = audb.dependencies('kit01')
>>> deps()
                                                       archive  bit_depth  channels                          checksum  duration format  removed  sampling_rate  type version
db.files.csv                                             files          0         0  aa46f52940b57000779ddb63b375da35       0.0    csv        0              0     0   1.0.0
db.emotion.csv                                         emotion          0         0  a5301d2fd6744287df9e8f61f8734bc8       0.0    csv        0              0     0   1.0.0
data/2N5YF_action_prompt_angry_bordcomputer_zoo...  kit01-data          0         0  02d7854ab410b7ce47fdaacb05e8d2e3       0.0    mp3        0              0     1   1.0.0
data/2N5YF_action_prompt_angry_connected_drive_...  kit01-data          0         0  1e47f6130648091be7f542174bae3dff       0.0    mp3        0              0     1   1.0.0
data/2N5YF_action_prompt_angry_fahrzeugstatus_z...  kit01-data          0         0  a42a3294d985597da7ccb092a6a09f72       0.0    mp3        0              0     1   1.0.0
...                                                        ...        ...       ...                               ...       ...    ...      ...            ...   ...     ...
data/Z1XVJ_dialog_prompt_surprised2_vier_zoom_c...  kit01-data          0         0  a858311a1e0a34b9dd62f7dfa733dbb7       0.0    mp3        0              0     1   1.0.0
data/Z1XVJ_dialog_prompt_surprised2_zwei_zoom_c...  kit01-data          0         0  e75227d19a940bca27ba96039bb228eb       0.0    mp3        0              0     1   1.0.0
data/Z1XVJ_dialog_prompt_surprised2_menue_zoom_...  kit01-data          0         0  3f0ac983669e6ecbd9cfaa5cfee27fad       0.0    mp3        0              0     1   1.0.0
data/Z1XVJ_dialog_prompt_surprised2_vorherige_z...  kit01-data          0         0  cbbae5a9ec9fb1533bc257c07d08b475       0.0    mp3        0              0     1   1.0.0
data/Z1XVJ_dialog_prompt_surprised2_zurueck_zoo...  kit01-data          0         0  8764cf91e87517033fd00fad0d859b92       0.0    mp3        0              0     1   1.0.0

[6077 rows x 10 columns]

I guess this can happen if we publish on a device that does not have MP3 support when using audiofile.

BUG: empty tables when filtering tables and requesting a different format

import audb


db = audb.load(
    'testdata',
    tables='emotion.dev.gold',
)
db['emotion.dev.gold'].get()

                                                                                                  emotion
file                                               start                  end                            
/media/jwagner/Data/audb/testdata/1.6.0/d3b62a9... 0 days 00:00:02.374641 0 days 00:00:04.101248  unhappy
                                                   0 days 00:00:05.445999 0 days 00:00:13.061626    happy
                                                   0 days 00:00:13.960496 0 days 00:00:14.897836    happy
                                                   0 days 00:00:21.454417 0 days 00:00:28.235479  unhappy
                                                   0 days 00:00:31.573883 0 days 00:00:35.081475    happy
                                                   0 days 00:00:46.336832 0 days 00:00:49.666294  neutral
                                                   0 days 00:00:53.288169 0 days 00:00:57.397128  neutral
/media/jwagner/Data/audb/testdata/1.6.0/d3b62a9... 0 days 00:00:04.441153 0 days 00:00:05.069330  unhappy
                                                   0 days 00:00:08.263919 0 days 00:00:13.812035  neutral
                                                   0 days 00:00:15.163421 0 days 00:00:18.361817  neutral
                                                   0 days 00:00:23.164433 0 days 00:00:23.922398  neutral
                                                   0 days 00:00:32.178272 0 days 00:00:35.268576  neutral
                                                   0 days 00:00:42.320708 0 days 00:00:42.838252  neutral
                                                   0 days 00:00:47.715380 0 days 00:00:48.870772  neutral
                                                   0 days 00:00:50.283749 0 days 00:00:51.663219    happy
                                                   0 days 00:00:57.899072 0 days 00:00:58.337701  unhappy
/media/jwagner/Data/audb/testdata/1.6.0/d3b62a9... 0 days 00:00:04.113521 0 days 00:00:06.757677  unhappy
                                                   0 days 00:00:07.189011 0 days 00:00:09.499757  neutral
                                                   0 days 00:00:10.056658 0 days 00:00:17.380463  neutral
                                                   0 days 00:00:20.189824 0 days 00:00:24.043259    happy
                                                   0 days 00:00:25.961979 0 days 00:00:26.743246  neutral
                                                   0 days 00:00:27.502626 0 days 00:00:27.814136  neutral
                                                   0 days 00:00:36.057617 0 days 00:00:39.101987  neutral
                                                   0 days 00:00:43.284854 0 days 00:00:46.313790  neutral
                                                   0 days 00:00:49.399081 0 days 00:00:49.681210    happy
                                                   0 days 00:00:53.789372 0 days 00:00:59.017524    happy

Looks ok, but if we set format=flac we get:

db = audb.load(
    'testdata',
    tables='emotion.dev.gold',
    format='flac',
)
db['emotion.dev.gold'].get()

Empty DataFrame
Columns: [emotion]
Index: []

To solve the issue we should change the file extension in the tables after applying the filtering.

Clean up audb.publish()

We need to revisit helper functions _find_media() and _put_media() in audb.publish(). The functions are not well separated yet and need some more comments.

Long file path can fail on Windows

As long as long file paths are not supported in audeer (audeering/audeer#15), loading a database with long file paths might fail.