Coder Social home page Coder Social logo

Comments (10)

frankenjoe avatar frankenjoe commented on August 23, 2024

to store the opened backend in a global variable,

Not sure if possible since files of a database can be distributed across several repositories.

from audb.

hagenw avatar hagenw commented on August 23, 2024

We do not support distribution across several repositories at the moment, compare #233
and

audb/audb/core/load.py

Lines 694 to 707 in 069cc04

if missing_files:
if backend is None:
backend = lookup_backend(db.name, version)
if files_type == "media":
_get_media_from_backend(
db.name,
missing_files,
db_root,
flavor,
deps,
backend,
num_workers,
verbose,
)

from audb.

hagenw avatar hagenw commented on August 23, 2024

But you have of course a valid point, it might be a good idea to support loading data from different repositories. But this should also be done in a way that we do not have to establish a connection to all backends for every single file.

from audb.

ChristianGeng avatar ChristianGeng commented on August 23, 2024

ones

For me it started making sense when I replace this with once. Conceptually this makes a lot of sense, if you work with database conncetions this will also be beneficial for speed, and results in a much nicer design (and you do not have to pass around credentials all the time either).

The question on how to implement a feature of "one-instance-only" is also a good question. What comes to my mind first are concepts like singleton design patterns (or another creational pattern, e.g. Borg I think). A quick google search suggests that this is a matter of debate like e.g. here. This post suggests a slight preference for creational patterns but also states that "good design is as little design as possible".

I am currently uncertain how the backend_interface interfaces are constructed: I believe that ideally you say that once you iterate over config.REPOSITORIES the connection would be created and authentication be dealt with, but I haven't understood the exact mechanisms yet. I can try to dig into this and read up more.

I see another additional caveat in instantiating a connection only once: server-side request-response timeouts - that exist probably not only for REST APIs but also for RMDBS systems.

from audb.

hagenw avatar hagenw commented on August 23, 2024

ones

Thanks, I corrected it to "once".

I see another additional caveat in instantiating a connection only once: server-side request-response timeouts

Good question, I don't know how long you can use an ArtifactoryPath object without the need to re-authenticate. The ideal design in this case would be that the corresponding backend object from audbackend tries to reconnect when an error regarding timeout/authentication is raised by the backend. The problem is that this requires that the error raised by the backend is different from other errors.

from audb.

hagenw avatar hagenw commented on August 23, 2024

I updated the implementation of using audbackend 2.0.0 (#386, solution 1) in #388 to pass on an open backend from function to function (solution 2). Solution 1 also reuses an already opened backend when downloading media files, as this happens anyway within a single function.

I then benchmarked the different solutions on compute5 for loading emodb to an empty cache folder.

Implementation Execution time
main branch 20 s
solution 1 11 s
solution 2 11 s
Benchmark code
import time

import audb
import audeer


cache = "./cache"
audeer.rmdir(cache)

t0 = time.time()
db = audb.load(
    "emodb",
    version="1.4.1",
    num_workers=4,
    verbose=False,
    cache_root=cache,
)
t = time.time() - t0
print(f"Execution time: {t:.2f} s")

So it seems, that it will anyway not be that important to reuse the same connection when loading header and dependency files for a dataset as the most time is anyway spend on loading media files and tables. But it would of course still be nice to have a good design.

from audb.

ChristianGeng avatar ChristianGeng commented on August 23, 2024

I updated the implementation of using audbackend 2.0.0 (#386, solution 1) in #388 to pass on an open backend from function to function (solution 2). Solution 1 also reuses an already opened backend when downloading media files, as this happens anyway within a single function.

I then benchmarked the different solutions on compute5 for loading emodb to an empty cache folder.

Implementation Execution time
main branch 20 s
solution 1 11 s
solution 2 11 s

So improving connection handling does not buy anything. I was trying to find the code locations and technical documentation that are responsible for the real speed advantage (to 11s) but was not successfully.

from audb.

hagenw avatar hagenw commented on August 23, 2024

In audeering/audbackend#215 under section "Speedup audbackend.backend.Artifactory" there is a code example given:

import time

import audbackend


backend = audbackend.backend.Artifactory("https://audeering.jfrog.io/artifactory", "data-public")
backend.open()
interface = audbackend.interface.Maven(backend)

t0 = time.time()
interface.exists("/emodb", "1.4.1")
interface.exists("/emodb/db", "1.4.1")
interface.exists("/emodb/meta", "1.4.1")
interface.exists("/emodb/media", "1.4.1")
t = time.time()
print(f"{t - t0:.3f} s")

This simply asks for four different paths if the location exists on the backend. With the previous implementation (referenced above as main branch) this takes 0.872 s as every time it creates a artifactory.ArtifactoryPath that authenticates at the backend. With the new implementation (solution 1) it does only authenticates when backend.open() is called and takes 0.178 s.

The reason why solution 2 doesn't provide much improvements for audb is that most time is spend on downloading media files, which is done within a single function for most files, compare

audb/audb/core/load.py

Lines 401 to 485 in 069cc04

def _get_media_from_backend(
name: str,
media: typing.Sequence[str],
db_root: str,
flavor: typing.Optional[Flavor],
deps: Dependencies,
backend: audbackend.Backend,
num_workers: typing.Optional[int],
verbose: bool,
):
r"""Load media from backend."""
# figure out archives
archives = set()
archive_names = set()
for file in media:
archive_name = deps.archive(file)
archive_version = deps.version(file)
archives.add((archive_name, archive_version))
archive_names.add(archive_name)
# collect all files that will be extracted,
# if we have more files than archives
if len(deps.files) > len(deps.archives):
files = list()
for file in deps.media:
archive = deps.archive(file)
if archive in archive_names:
files.append(file)
media = files
# create folder tree to avoid race condition
# in os.makedirs when files are unpacked
# using multi-processing
db_root_tmp = database_tmp_root(db_root)
utils.mkdir_tree(media, db_root)
utils.mkdir_tree(media, db_root_tmp)
def job(archive: str, version: str):
archive = backend.join(
"/",
name,
define.DEPEND_TYPE_NAMES[define.DependType.MEDIA],
archive + ".zip",
)
# extract and move all files that are stored in the archive,
# even if only a single file from the archive was requested
files = backend.get_archive(
archive,
db_root_tmp,
version,
tmp_root=db_root_tmp,
)
for file in files:
if os.name == "nt": # pragma: no cover
file = file.replace(os.sep, "/")
if flavor is not None:
bit_depth = deps.bit_depth(file)
channels = deps.channels(file)
sampling_rate = deps.sampling_rate(file)
src_path = os.path.join(db_root_tmp, file)
file = flavor.destination(file)
dst_path = os.path.join(db_root_tmp, file)
flavor(
src_path,
dst_path,
src_bit_depth=bit_depth,
src_channels=channels,
src_sampling_rate=sampling_rate,
)
if src_path != dst_path:
os.remove(src_path)
audeer.move_file(
os.path.join(db_root_tmp, file),
os.path.join(db_root, file),
)
audeer.run_tasks(
job,
params=[([archive, version], {}) for archive, version in archives],
num_workers=num_workers,
progress_bar=verbose,
task_description="Load media",
)
audeer.rmdir(db_root_tmp)

There the backend needs to be opened only once, and then it is reused with solution 1. What solution 2 does on top is to reuse it also when the header, the dependency table, the attachments, and the tables are loaded, whereas in solution 1 all of those require at least one other authentication each. But if you have 1000 media files, which saves 999 authentications with solution 1, saving another 5 with solution 2 does of course not provide much benefit.

from audb.

hagenw avatar hagenw commented on August 23, 2024

We have no drastically reduced the connections to the server as we require to connect only ones for downloading all media files, and connecting only ones for downloading all tables.
But the solution seems also to have a downside. On the public Artifactory server we see, that the download always fails after some time: #409. This error does not happen on our internal Artifactory server.

from audb.

hagenw avatar hagenw commented on August 23, 2024

As shown in #387 (comment) I showed that the current reduction of connections to the server seems already sufficient. Instead, we should try to solve #409, before spending more time to reduce the connection even further. I will close this issue for now.

from audb.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.