In audbackend 2.0.0 we will have support for connecti

to store the opened backend in a global variable, <p di

In <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id=

As shown in <a class="issue-link js-issue-link" data-error-text="Failed to load title"

Support to connect to a backend only ones,about audeering/audb

frankenjoe commented on August 23, 2024

to store the opened backend in a global variable,

Not sure if possible since files of a database can be distributed across several repositories.

from audb.

hagenw commented on August 23, 2024

We do not support distribution across several repositories at the moment, compare #233
and

audb/audb/core/load.py

Lines 694 to 707 in 069cc04

    
           if missing_files: 
        
               if backend is None: 
        
                   backend = lookup_backend(db.name, version) 
        
               if files_type == "media": 
        
                   _get_media_from_backend( 
        
                       db.name, 
        
                       missing_files, 
        
                       db_root, 
        
                       flavor, 
        
                       deps, 
        
                       backend, 
        
                       num_workers, 
        
                       verbose, 
        
                   )

from audb.

hagenw commented on August 23, 2024

But you have of course a valid point, it might be a good idea to support loading data from different repositories. But this should also be done in a way that we do not have to establish a connection to all backends for every single file.

from audb.

ChristianGeng commented on August 23, 2024

ones

For me it started making sense when I replace this with once. Conceptually this makes a lot of sense, if you work with database conncetions this will also be beneficial for speed, and results in a much nicer design (and you do not have to pass around credentials all the time either).

The question on how to implement a feature of "one-instance-only" is also a good question. What comes to my mind first are concepts like singleton design patterns (or another creational pattern, e.g. Borg I think). A quick google search suggests that this is a matter of debate like e.g. here. This post suggests a slight preference for creational patterns but also states that "good design is as little design as possible".

I am currently uncertain how the backend_interface interfaces are constructed: I believe that ideally you say that once you iterate over config.REPOSITORIES the connection would be created and authentication be dealt with, but I haven't understood the exact mechanisms yet. I can try to dig into this and read up more.

I see another additional caveat in instantiating a connection only once: server-side request-response timeouts - that exist probably not only for REST APIs but also for RMDBS systems.

from audb.

hagenw commented on August 23, 2024

ones

Thanks, I corrected it to "once".

I see another additional caveat in instantiating a connection only once: server-side request-response timeouts

Good question, I don't know how long you can use an ArtifactoryPath object without the need to re-authenticate. The ideal design in this case would be that the corresponding backend object from audbackend tries to reconnect when an error regarding timeout/authentication is raised by the backend. The problem is that this requires that the error raised by the backend is different from other errors.

from audb.

hagenw commented on August 23, 2024

I updated the implementation of using audbackend 2.0.0 (#386, solution 1) in #388 to pass on an open backend from function to function (solution 2). Solution 1 also reuses an already opened backend when downloading media files, as this happens anyway within a single function.

I then benchmarked the different solutions on compute5 for loading emodb to an empty cache folder.

Implementation	Execution time
`main` branch	20 s
solution 1	11 s
solution 2	11 s

Benchmark code

import time

import audb
import audeer


cache = "./cache"
audeer.rmdir(cache)

t0 = time.time()
db = audb.load(
    "emodb",
    version="1.4.1",
    num_workers=4,
    verbose=False,
    cache_root=cache,
)
t = time.time() - t0
print(f"Execution time: {t:.2f} s")

So it seems, that it will anyway not be that important to reuse the same connection when loading header and dependency files for a dataset as the most time is anyway spend on loading media files and tables. But it would of course still be nice to have a good design.

from audb.

ChristianGeng commented on August 23, 2024

I updated the implementation of using audbackend 2.0.0 (#386, solution 1) in #388 to pass on an open backend from function to function (solution 2). Solution 1 also reuses an already opened backend when downloading media files, as this happens anyway within a single function.

I then benchmarked the different solutions on compute5 for loading emodb to an empty cache folder.

Implementation Execution time
main branch 20 s
solution 1 11 s
solution 2 11 s

So improving connection handling does not buy anything. I was trying to find the code locations and technical documentation that are responsible for the real speed advantage (to 11s) but was not successfully.

from audb.

hagenw commented on August 23, 2024

In audeering/audbackend#215 under section "Speedup audbackend.backend.Artifactory" there is a code example given:

import time

import audbackend


backend = audbackend.backend.Artifactory("https://audeering.jfrog.io/artifactory", "data-public")
backend.open()
interface = audbackend.interface.Maven(backend)

t0 = time.time()
interface.exists("/emodb", "1.4.1")
interface.exists("/emodb/db", "1.4.1")
interface.exists("/emodb/meta", "1.4.1")
interface.exists("/emodb/media", "1.4.1")
t = time.time()
print(f"{t - t0:.3f} s")

This simply asks for four different paths if the location exists on the backend. With the previous implementation (referenced above as main branch) this takes 0.872 s as every time it creates a artifactory.ArtifactoryPath that authenticates at the backend. With the new implementation (solution 1) it does only authenticates when backend.open() is called and takes 0.178 s.

The reason why solution 2 doesn't provide much improvements for audb is that most time is spend on downloading media files, which is done within a single function for most files, compare

audb/audb/core/load.py

Lines 401 to 485 in 069cc04

    
           def _get_media_from_backend( 
        
               name: str, 
        
               media: typing.Sequence[str], 
        
               db_root: str, 
        
               flavor: typing.Optional[Flavor], 
        
               deps: Dependencies, 
        
               backend: audbackend.Backend, 
        
               num_workers: typing.Optional[int], 
        
               verbose: bool, 
        
           ): 
        
               r"""Load media from backend.""" 
        
               # figure out archives 
        
               archives = set() 
        
               archive_names = set() 
        
               for file in media: 
        
                   archive_name = deps.archive(file) 
        
                   archive_version = deps.version(file) 
        
                   archives.add((archive_name, archive_version)) 
        
                   archive_names.add(archive_name) 
        
               # collect all files that will be extracted, 
        
               # if we have more files than archives 
        
               if len(deps.files) > len(deps.archives): 
        
                   files = list() 
        
                   for file in deps.media: 
        
                       archive = deps.archive(file) 
        
                       if archive in archive_names: 
        
                           files.append(file) 
        
                   media = files 
        
               # create folder tree to avoid race condition 
        
               # in os.makedirs when files are unpacked 
        
               # using multi-processing 
        
               db_root_tmp = database_tmp_root(db_root) 
        
               utils.mkdir_tree(media, db_root) 
        
               utils.mkdir_tree(media, db_root_tmp) 
        
               def job(archive: str, version: str): 
        
                   archive = backend.join( 
        
                       "/", 
        
                       name, 
        
                       define.DEPEND_TYPE_NAMES[define.DependType.MEDIA], 
        
                       archive + ".zip", 
        
                   ) 
        
                   # extract and move all files that are stored in the archive, 
        
                   # even if only a single file from the archive was requested 
        
                   files = backend.get_archive( 
        
                       archive, 
        
                       db_root_tmp, 
        
                       version, 
        
                       tmp_root=db_root_tmp, 
        
                   ) 
        
                   for file in files: 
        
                       if os.name == "nt":  # pragma: no cover 
        
                           file = file.replace(os.sep, "/") 
        
                       if flavor is not None: 
        
                           bit_depth = deps.bit_depth(file) 
        
                           channels = deps.channels(file) 
        
                           sampling_rate = deps.sampling_rate(file) 
        
                           src_path = os.path.join(db_root_tmp, file) 
        
                           file = flavor.destination(file) 
        
                           dst_path = os.path.join(db_root_tmp, file) 
        
                           flavor( 
        
                               src_path, 
        
                               dst_path, 
        
                               src_bit_depth=bit_depth, 
        
                               src_channels=channels, 
        
                               src_sampling_rate=sampling_rate, 
        
                           ) 
        
                           if src_path != dst_path: 
        
                               os.remove(src_path) 
        
                       audeer.move_file( 
        
                           os.path.join(db_root_tmp, file), 
        
                           os.path.join(db_root, file), 
        
                       ) 
        
               audeer.run_tasks( 
        
                   job, 
        
                   params=[([archive, version], {}) for archive, version in archives], 
        
                   num_workers=num_workers, 
        
                   progress_bar=verbose, 
        
                   task_description="Load media", 
        
               ) 
        
               audeer.rmdir(db_root_tmp)

There the backend needs to be opened only once, and then it is reused with solution 1. What solution 2 does on top is to reuse it also when the header, the dependency table, the attachments, and the tables are loaded, whereas in solution 1 all of those require at least one other authentication each. But if you have 1000 media files, which saves 999 authentications with solution 1, saving another 5 with solution 2 does of course not provide much benefit.

from audb.

hagenw commented on August 23, 2024

We have no drastically reduced the connections to the server as we require to connect only ones for downloading all media files, and connecting only ones for downloading all tables.
But the solution seems also to have a downside. On the public Artifactory server we see, that the download always fails after some time: #409. This error does not happen on our internal Artifactory server.

from audb.

hagenw commented on August 23, 2024

As shown in #387 (comment) I showed that the current reduction of connections to the server seems already sufficient. Instead, we should try to solve #409, before spending more time to reduce the connection even further. I will close this issue for now.

from audb.

Support to connect to a backend only ones about audb HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	if missing_files:
	if backend is None:
	backend = lookup_backend(db.name, version)
	if files_type == "media":
	_get_media_from_backend(
	db.name,
	missing_files,
	db_root,
	flavor,
	deps,
	backend,
	num_workers,
	verbose,
	)

	def _get_media_from_backend(
	name: str,
	media: typing.Sequence[str],
	db_root: str,
	flavor: typing.Optional[Flavor],
	deps: Dependencies,
	backend: audbackend.Backend,
	num_workers: typing.Optional[int],
	verbose: bool,
	):
	r"""Load media from backend."""
	# figure out archives
	archives = set()
	archive_names = set()
	for file in media:
	archive_name = deps.archive(file)
	archive_version = deps.version(file)
	archives.add((archive_name, archive_version))
	archive_names.add(archive_name)
	# collect all files that will be extracted,
	# if we have more files than archives
	if len(deps.files) > len(deps.archives):
	files = list()
	for file in deps.media:
	archive = deps.archive(file)
	if archive in archive_names:
	files.append(file)
	media = files

	# create folder tree to avoid race condition
	# in os.makedirs when files are unpacked
	# using multi-processing
	db_root_tmp = database_tmp_root(db_root)
	utils.mkdir_tree(media, db_root)
	utils.mkdir_tree(media, db_root_tmp)

	def job(archive: str, version: str):
	archive = backend.join(
	"/",
	name,
	define.DEPEND_TYPE_NAMES[define.DependType.MEDIA],
	archive + ".zip",
	)
	# extract and move all files that are stored in the archive,
	# even if only a single file from the archive was requested
	files = backend.get_archive(
	archive,
	db_root_tmp,
	version,
	tmp_root=db_root_tmp,
	)
	for file in files:
	if os.name == "nt": # pragma: no cover
	file = file.replace(os.sep, "/")
	if flavor is not None:
	bit_depth = deps.bit_depth(file)
	channels = deps.channels(file)
	sampling_rate = deps.sampling_rate(file)
	src_path = os.path.join(db_root_tmp, file)
	file = flavor.destination(file)
	dst_path = os.path.join(db_root_tmp, file)
	flavor(
	src_path,
	dst_path,
	src_bit_depth=bit_depth,
	src_channels=channels,
	src_sampling_rate=sampling_rate,
	)
	if src_path != dst_path:
	os.remove(src_path)

	audeer.move_file(
	os.path.join(db_root_tmp, file),
	os.path.join(db_root, file),
	)

	audeer.run_tasks(
	job,
	params=[([archive, version], {}) for archive, version in archives],
	num_workers=num_workers,
	progress_bar=verbose,
	task_description="Load media",
	)

	audeer.rmdir(db_root_tmp)