audeering / audbcards Goto Github PK
View Code? Open in Web Editor NEWData cards for audio datasets
Home Page: https://audeering.github.io/audbcards/
License: Other
Data cards for audio datasets
Home Page: https://audeering.github.io/audbcards/
License: Other
Currently, the RST page of an data card is rendered by audbcards.Datacard._render_template()
, which calls content = template.render(dataset)
, where dataset
is a dictionary in the form of {'name': 'database-name', ...}
, whereby the keys are used in the templates to specify what should be displayed on the RST page.
If a user would have access to dataset
, e.g. in the form of audbcards.Datacard.dataset
it would be very easy to extend a datacard by some special entries without the need to first update the audbcards
package.
I also see two challenges we need to solve when adding this feature:
__init__()
as the dictionary is only filled when _render_template()
is calleddataset
might not be the best choice as it is easy to be confused with an actual audbcards.Dataset
objectIt would be great if a data card could be created for a not yet published database by loading it from a folder.
This would have the advantage to get information on the data before publishing it and maybe spotting some bugs in the data.
The biggest change this would require is replace calls to the dependency table as this would not be available.
At the moment we support only Artifactory as a backend as audfactory
is used in Dataset.publication
:
audbcards/audbcards/core/dataset.py
Lines 215 to 224 in bd291f3
and Dataset.repository_link
also hard codes an Artifactory server address:
audbcards/audbcards/core/dataset.py
Lines 226 to 235 in bd291f3
At the moment we have a cache_root
argument to audbcards.Dataset
and reuse the cache for audbcards.Datacard.example
. Instead we should introduce an official cache folder under ~/.cache/audbcards
and have two separate folders dataset
and datacard
there.
In #28 (comment) @ChristianGeng suggested to cache the single properties of audbcards.Dataset
. My impression is that __init__()
takes already too long (~0.5 s) and we might simply try to save the whole Dataset
object as PKL in the cache.
This is the second part of #4.
At the moment audbcards.Dataset.repository_link
simply expects an Artifactory backend, e.g.
NOTE: the following code needs to executed using the add-test-db-fixture
branch at the moment.
>>> import audbcards
>>> dataset = audbcards.Dataset('emodb', '1.3.0')
>>> dataset.repository_link
'`data-public <https://audeering.jfrog.io/artifactory/webapp/#/artifacts/browse/tree/General/data-public/emodb>`__'
One solution to make this independent of the used backend would be by introducing and if
-clause. For example, we can return the above string if we detect the artifacory
backend (as returned by audb.Repository.backend) and return just the name otherwise:
'data-public'
@ChristianGeng any thoughts on this?
It is not always clear whether the languages are named in a consistent fashion.
Examples:
Would be an additional property that uses an iso mapping as implemented e.g. in audformat.utils.map_language
.
It needs to be checked whether this is needed in, as iso mappings might lose dialectal information.
For example, I am dealing with data that have e.g. Maroccan Arabic and don't want to lose the dialectal information.
I have added a PR that implements this feature.
In the present form this can be used to count the number of datasets containing English but it cannot be used to ask questions about segments or files. I know that there will be #31 that will count the segments altogether.
But when one needs to break down by property and count the number of segments this currently fails.
For example, without digging into tables we cannot answer how many segments are annotated for Armenian or for females, this cannot be answered. This will be beyond the scope of this package I take it?
The motivation currently is to only clean up dirty language mappings that may occur when publishers publish unnormalized language fields, for example 'Deutsch', 'german', 'de' might map to iso 'deu'. This distorts language counts when operating on an array of databases. This can be seen as an intermediate step, as there might be better alternatives to this approach.
Please add a file tests/test_dataset.py
that uses the db
fixture to test return values of all methods/properties/attributes of audbcards.Dataset
.
As mentioned in #4 publication
and repository_link
do not support file-system
backend yet, so you could skip those for now.
For a start you could use this example:
import pandas as pd
import pytest
import audb
import audbcards
def test_dataset(db):
dataset = audbcards.Dataset(pytest.NAME, pytest.VERSION)
# __init__
assert dataset.name == pytest.NAME
assert dataset.version == pytest.VERSION
assert dataset.repository == pytest.REPOSITORY
expected_header = audb.info.header(
pytest.NAME,
version=pytest.VERSION,
)
assert str(dataset.header) == str(expected_header)
expected_deps = audb.dependencies(
pytest.NAME,
version=pytest.VERSION,
)
pd.testing.assert_frame_equal(dataset.deps(), expected_deps())
# archives
assert dataset.archives == str(len(db.files))
# ...
If some methods could have multiple test parameters or are more complicated to test, it might make sense to add a separate def test_<method_name>()
to test such a method.
In #13 we are introducing audbcards.Datacard._trim_trailing_whitespace()
which has no influence for the tables created for emodb
.
We should check with other datasets if this is needed, or can be removed.
In #28 (comment) @ChristianGeng proposed to add a property audbcards.Dataset.datapoints
that would return the number of files or segments based on a possible existing files
or segments
table.
The downside is that neither a files
nor a segments
table have to exist inside a database. The ground truth to get all files in a database is audformat.Database.files
and audformat.Database.segments
to get all segments. We already have audbcards.Dataset.files
to get the number of files. So it seems reasonable to also add audbcards.Dataset.segments
.
This has one downside though: in order to get the number of all possible segments, we need to load all tables first and calculate the union of existing segments, compare https://github.com/audeering/audformat/blob/07b000266735ce460af3e4c09b611c15a63f76c0/audformat/core/database.py#L280-L286.
In audb
and audformat
we usually refer with database
to the created artifact. In the literature most often the term dataset
is used instead. At the moment, we use Dataset
as name for class representing a database, for which we want to create a data card. Should we stay with that name or change to Database
?
@ChristianGeng @Bahaabrougui any opinion on this?
It might be useful to create a data card for non-published databases, e.g. when preparing the first version of a database it might be helpful to spot bugs or add information to a corresponding pull request.
Currently we use the publish_db
and db
fixtures to publish a single example database that can then be used in the tests.
But in order to test certain functionality or border cases like #34 and #32 we would need to use different databases.
We need to check if this can be handled by providing arguments to the publish_db
fixture, or if we should publish several different databases in the fixture, or if we publish databases only as parts of the single tests.
When building a data card with sphinx
the user can change the location of the build folder, which we need to handle inside the methods/functions that copy stuff to the build folder.
Add the moment we hard code in the code how the actual data cards will look at the end, e.g.
audbcards/audbcards/core/dataset.py
Lines 311 to 328 in bd291f3
It will be much more user friendly if we could provide a HTML template(s) (maybe with Jinja2?) that a user can configure the resulting data card.
/cc @ChristianGeng
As mentionned in #12, the method create_datacard_page()
and create_datasets_page()
both require a /datasets
folder to exist in order to create the rst files, however the datasets folder is only created when calling the player property, and the property is called in the later stages of create_datacard_page(),
so the method will fail because of the non-existence of the folder before reaching the code that creates the folder.
We will need to explicitly create /datasets
under the BUILD
folder, and change the paths uses in the methods from e.g. datasets/{dataset.name}.rst
to BUILD/datasets/{dataset.name}.rst
cc/ @hagenw
Inside audbcards.Datacard
we currently do:
# Add audio player for example file
dataset['example'] = self.example
dataset['player'] = self.player(dataset['example'])
But it can happen that self.example
does not find a suitable file and returns ''
. In this case the code will break.
We need to fix this and add tests for it.
At the moment we have repository
as argument to Dataset
, but it is not used to limit audb.config.RePOSITORIES
to exactly that repository. I think it will be safer if we would do that.
When Dataset.player
gets called, the waveplot under the name db.png
gets created under the project's root and not under the build folder (See screenshot below)
This can be easily fixed by changing the path where {dataset.name}.png
gets exported.
cc/ @hagenw
The current implementation breaks if the db object has no schemes.
The bugfix is implemented by channging the _scheme_table_columns
function to return an empty list of columns when there are no schemes in the header_dict
that keeps the schemes.
As a consequence, the schemes_table
becomes an empty list of lists:
dataset.schemes_table == [[]]
>>> True
In order for this to work the template needs to check whether there is at least one data row apart from the empty header row.
One could possibly also have used None
for the dataset_schemes_table in order to avoid the [[]]
empty list of lists for schemes_table
. Then the checking in the template would have been simpler, i.e. a schemes_table is defined
check would have been sufficient.
The downside would have been that this would have broken the list of lists signature typing.List[typing.List[str]]
in the schemes_table
.
In #13 we introduce the new audbcards.Datacard
class that handles the rendering of the actual data card. We still continue with audbcards.Dataset
as this might be handy to gather statistical information about datasets.
Having this in mind it might be better to move all the functionality that creates figures or audio player elements from audbcards.Dataset
to audbcards.Datacard
and focus on the actual values in audbcards.Dataset
.
Pandas will deprecate applymap
in forseeable future. We however still use it in the calculation of the Dataset.schemes_table
property.
It should be replaced without altering the functionally
Links:
http://pandas.pydata.org/docs/reference/api/pandas.DataFrame.map.html
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.applymap.html
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.