Provide user access to dataset dictionary in audbcards.Datacard

Currently, the RST page of an data card is rendered by audbcards.Datacard._render_template(), which calls content = template.render(dataset), where dataset is a dictionary in the form of {'name': 'database-name', ...}, whereby the keys are used in the templates to specify what should be displayed on the RST page.

If a user would have access to dataset, e.g. in the form of audbcards.Datacard.dataset it would be very easy to extend a datacard by some special entries without the need to first update the audbcards package.
I also see two challenges we need to solve when adding this feature:

We cannot just add it during __init__() as the dictionary is only filled when _render_template() is called
The name dataset might not be the best choice as it is easy to be confused with an actual audbcards.Dataset object

Support database folders

It would be great if a data card could be created for a not yet published database by loading it from a folder.
This would have the advantage to get information on the data before publishing it and maybe spotting some bugs in the data.

The biggest change this would require is replace calls to the dependency table as this would not be available.

Make audbcards independent from the used backend

At the moment we support only Artifactory as a backend as audfactory is used in Dataset.publication:

audbcards/audbcards/core/dataset.py

Lines 215 to 224 in bd291f3

    
           @property 
        
           def publication(self) -> str: 
        
               r"""Date and author uploading dataset to repository.""" 
        
               url = ( 
        
                   f'{self.host}/{self.repository}/{self.name}/' 
        
                   f'db/{self.version}/db-{self.version}.zip' 
        
               ) 
        
               path = audfactory.path(url) 
        
               stat = path.stat() 
        
               return f'{stat.ctime:%Y-%m-%d} by {stat.created_by}'

and Dataset.repository_link also hard codes an Artifactory server address:

audbcards/audbcards/core/dataset.py

Lines 226 to 235 in bd291f3

    
           @property 
        
           def repository_link(self) -> str: 
        
               r"""Repository name with link to dataset in Artifactory web UI.""" 
        
               url = ( 
        
                   f'{self.host}/' 
        
                   f'webapp/#/artifacts/browse/tree/General/' 
        
                   f'{self.repository}/' 
        
                   f'{self.name}' 
        
               ) 
        
               return f'`{self.repository} <{url}>`__'

Improve cache handling

At the moment we have a cache_root argument to audbcards.Dataset and reuse the cache for audbcards.Datacard.example. Instead we should introduce an official cache folder under ~/.cache/audbcards and have two separate folders dataset and datacard there.

In #28 (comment) @ChristianGeng suggested to cache the single properties of audbcards.Dataset. My impression is that __init__() takes already too long (~0.5 s) and we might simply try to save the whole Dataset object as PKL in the cache.

Support Dataset.repository_link on all backends

This is the second part of #4.

At the moment audbcards.Dataset.repository_link simply expects an Artifactory backend, e.g.

NOTE: the following code needs to executed using the add-test-db-fixture branch at the moment.

>>> import audbcards
>>> dataset = audbcards.Dataset('emodb', '1.3.0')
>>> dataset.repository_link
'`data-public <https://audeering.jfrog.io/artifactory/webapp/#/artifacts/browse/tree/General/data-public/emodb>`__'

One solution to make this independent of the used backend would be by introducing and if-clause. For example, we can return the above string if we detect the artifacory backend (as returned by audb.Repository.backend) and return just the name otherwise:

'data-public'

@ChristianGeng any thoughts on this?

Normalized Language Mappings

Problem

It is not always clear whether the languages are named in a consistent fashion.
Examples:

en, English, eng, english
deu, german, Deutsch

Would be an additional property that uses an iso mapping as implemented e.g. in audformat.utils.map_language.
It needs to be checked whether this is needed in, as iso mappings might lose dialectal information.

For example, I am dealing with data that have e.g. Maroccan Arabic and don't want to lose the dialectal information.

I have added a PR that implements this feature.

In the present form this can be used to count the number of datasets containing English but it cannot be used to ask questions about segments or files. I know that there will be #31 that will count the segments altogether.

But when one needs to break down by property and count the number of segments this currently fails.
For example, without digging into tables we cannot answer how many segments are annotated for Armenian or for females, this cannot be answered. This will be beyond the scope of this package I take it?

The motivation currently is to only clean up dirty language mappings that may occur when publishers publish unnormalized language fields, for example 'Deutsch', 'german', 'de' might map to iso 'deu'. This distorts language counts when operating on an array of databases. This can be seen as an intermediate step, as there might be better alternatives to this approach.

Add tests for audbcards.Dataset

Please add a file tests/test_dataset.py that uses the db fixture to test return values of all methods/properties/attributes of audbcards.Dataset.
As mentioned in #4 publication and repository_link do not support file-system backend yet, so you could skip those for now.

For a start you could use this example:

import pandas as pd
import pytest

import audb
import audbcards


def test_dataset(db):

    dataset = audbcards.Dataset(pytest.NAME, pytest.VERSION)

    # __init__
    assert dataset.name == pytest.NAME
    assert dataset.version == pytest.VERSION
    assert dataset.repository == pytest.REPOSITORY
    expected_header = audb.info.header(
        pytest.NAME,
        version=pytest.VERSION,
    )   
    assert str(dataset.header) == str(expected_header)
    expected_deps = audb.dependencies(
        pytest.NAME,
        version=pytest.VERSION,
    )   
    pd.testing.assert_frame_equal(dataset.deps(), expected_deps())

    # archives
    assert dataset.archives == str(len(db.files))

    # ...

If some methods could have multiple test parameters or are more complicated to test, it might make sense to add a separate def test_<method_name>() to test such a method.

Check if Datacard._trim_trailing_whitespace() is needed

In #13 we are introducing audbcards.Datacard._trim_trailing_whitespace() which has no influence for the tables created for emodb.

We should check with other datasets if this is needed, or can be removed.

Add audbcards.Dataset.segments

In #28 (comment) @ChristianGeng proposed to add a property audbcards.Dataset.datapoints that would return the number of files or segments based on a possible existing files or segments table.

The downside is that neither a files nor a segments table have to exist inside a database. The ground truth to get all files in a database is audformat.Database.files and audformat.Database.segments to get all segments. We already have audbcards.Dataset.files to get the number of files. So it seems reasonable to also add audbcards.Dataset.segments.

This has one downside though: in order to get the number of all possible segments, we need to load all tables first and calculate the union of existing segments, compare https://github.com/audeering/audformat/blob/07b000266735ce460af3e4c09b611c15a63f76c0/audformat/core/database.py#L280-L286.

Use Database or Dataset as name for database class

In audb and audformat we usually refer with database to the created artifact. In the literature most often the term dataset is used instead. At the moment, we use Dataset as name for class representing a database, for which we want to create a data card. Should we stay with that name or change to Database?

@ChristianGeng @Bahaabrougui any opinion on this?

Support creating data cards for non-published databases

It might be useful to create a data card for non-published databases, e.g. when preparing the first version of a database it might be helpful to spot bugs or add information to a corresponding pull request.

Update tests to handle more than one database

Currently we use the publish_db and db fixtures to publish a single example database that can then be used in the tests.
But in order to test certain functionality or border cases like #34 and #32 we would need to use different databases.

We need to check if this can be handled by providing arguments to the publish_db fixture, or if we should publish several different databases in the fixture, or if we publish databases only as parts of the single tests.

Extract build folder from sphinx

When building a data card with sphinx the user can change the location of the build folder, which we need to handle inside the methods/functions that copy stuff to the build folder.

Add support for templates

Add the moment we hard code in the code how the actual data cards will look at the end, e.g.

audbcards/audbcards/core/dataset.py

Lines 311 to 328 in bd291f3

    
           # Overview table 
        
           fp.write('============= ======================\n') 
        
           fp.write(f'version       {dataset.version_link}\n') 
        
           fp.write(f'license       {dataset.license_link}\n') 
        
           fp.write(f'source        {db.source}\n') 
        
           fp.write(f'usage         {db.usage}\n') 
        
           if db.languages is not None: 
        
               fp.write(f'languages     {", ".join(db.languages)}\n') 
        
           fp.write(f'format        {dataset.formats}\n') 
        
           fp.write(f'channel       {dataset.channels}\n') 
        
           fp.write(f'sampling rate {dataset.sampling_rates}\n') 
        
           fp.write(f'bit depth     {dataset.bit_depths}\n') 
        
           fp.write(f'duration      {dataset.duration}\n') 
        
           fp.write(f'files         {dataset.files}\n') 
        
           fp.write(f'repository    {dataset.repository_link}\n') 
        
           fp.write(f'published     {dataset.publication}\n') 
        
           fp.write('============= ======================\n') 
        
           fp.write('\n\n')

It will be much more user friendly if we could provide a HTML template(s) (maybe with Jinja2?) that a user can configure the resulting data card.

/cc @ChristianGeng

Missing creation of datasets folder which will contain the rst files

As mentionned in #12, the method create_datacard_page() and create_datasets_page() both require a /datasets folder to exist in order to create the rst files, however the datasets folder is only created when calling the player property, and the property is called in the later stages of create_datacard_page(), so the method will fail because of the non-existence of the folder before reaching the code that creates the folder.

We will need to explicitly create /datasets under the BUILD folder, and change the paths uses in the methods from e.g. datasets/{dataset.name}.rst to BUILD/datasets/{dataset.name}.rst

cc/ @hagenw

Add support for empty example in audbcards.Datacard

Inside audbcards.Datacard we currently do:

        # Add audio player for example file
        dataset['example'] = self.example
        dataset['player'] = self.player(dataset['example'])

But it can happen that self.example does not find a suitable file and returns ''. In this case the code will break.

We need to fix this and add tests for it.

Limit the repository inside Dataset

At the moment we have repository as argument to Dataset, but it is not used to limit audb.config.RePOSITORIES to exactly that repository. I think it will be safer if we would do that.

Wave plot gets created under project root instead of build folder

When Dataset.player gets called, the waveplot under the name db.png gets created under the project's root and not under the build folder (See screenshot below)

This can be easily fixed by changing the path where {dataset.name}.png gets exported.

cc/ @hagenw

Treatment of empty Schemes in Dataset

The current implementation breaks if the db object has no schemes.

The bugfix is implemented by channging the _scheme_table_columns function to return an empty list of columns when there are no schemes in the header_dict that keeps the schemes.

As a consequence, the schemes_table becomes an empty list of lists:

dataset.schemes_table == [[]]
>>> True

In order for this to work the template needs to check whether there is at least one data row apart from the empty header row.

One could possibly also have used None for the dataset_schemes_table in order to avoid the [[]] empty list of lists for schemes_table. Then the checking in the template would have been simpler, i.e. a schemes_table is defined check would have been sufficient.

The downside would have been that this would have broken the list of lists signature typing.List[typing.List[str]] in the schemes_table.

Move some properties from Dataset to Datacard?

In #13 we introduce the new audbcards.Datacard class that handles the rendering of the actual data card. We still continue with audbcards.Dataset as this might be handy to gather statistical information about datasets.

Having this in mind it might be better to move all the functionality that creates figures or audio player elements from audbcards.Dataset to audbcards.Datacard and focus on the actual values in audbcards.Dataset.

Replace deprecated usage of `applymap`

Problem

Pandas will deprecate applymap in forseeable future. We however still use it in the calculation of the Dataset.schemes_table property.

It should be replaced without altering the functionally

Links:
http://pandas.pydata.org/docs/reference/api/pandas.DataFrame.map.html
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.applymap.html

	@property
	def publication(self) -> str:
	r"""Date and author uploading dataset to repository."""
	url = (
	f'{self.host}/{self.repository}/{self.name}/'
	f'db/{self.version}/db-{self.version}.zip'
	)
	path = audfactory.path(url)
	stat = path.stat()
	return f'{stat.ctime:%Y-%m-%d} by {stat.created_by}'

	@property
	def repository_link(self) -> str:
	r"""Repository name with link to dataset in Artifactory web UI."""
	url = (
	f'{self.host}/'
	f'webapp/#/artifacts/browse/tree/General/'
	f'{self.repository}/'
	f'{self.name}'
	)
	return f'`{self.repository} <{url}>`__'

	# Overview table
	fp.write('============= ======================\n')
	fp.write(f'version {dataset.version_link}\n')
	fp.write(f'license {dataset.license_link}\n')
	fp.write(f'source {db.source}\n')
	fp.write(f'usage {db.usage}\n')
	if db.languages is not None:
	fp.write(f'languages {", ".join(db.languages)}\n')
	fp.write(f'format {dataset.formats}\n')
	fp.write(f'channel {dataset.channels}\n')
	fp.write(f'sampling rate {dataset.sampling_rates}\n')
	fp.write(f'bit depth {dataset.bit_depths}\n')
	fp.write(f'duration {dataset.duration}\n')
	fp.write(f'files {dataset.files}\n')
	fp.write(f'repository {dataset.repository_link}\n')
	fp.write(f'published {dataset.publication}\n')
	fp.write('============= ======================\n')
	fp.write('\n\n')

audeering / audbcards Goto Github PK

audbcards's People

Contributors

Watchers

audbcards's Issues

Problem

Problem

Recommend Projects

Recommend Topics

Recommend Org