Comments (10)
to store the opened backend in a global variable,
Not sure if possible since files of a database can be distributed across several repositories.
from audb.
We do not support distribution across several repositories at the moment, compare #233
and
Lines 694 to 707 in 069cc04
from audb.
But you have of course a valid point, it might be a good idea to support loading data from different repositories. But this should also be done in a way that we do not have to establish a connection to all backends for every single file.
from audb.
ones
For me it started making sense when I replace this with once
. Conceptually this makes a lot of sense, if you work with database conncetions this will also be beneficial for speed, and results in a much nicer design (and you do not have to pass around credentials all the time either).
The question on how to implement a feature of "one-instance-only" is also a good question. What comes to my mind first are concepts like singleton design patterns (or another creational pattern, e.g. Borg I think). A quick google search suggests that this is a matter of debate like e.g. here. This post suggests a slight preference for creational patterns but also states that "good design is as little design as possible".
I am currently uncertain how the backend_interface
interfaces are constructed: I believe that ideally you say that once you iterate over config.REPOSITORIES
the connection would be created and authentication be dealt with, but I haven't understood the exact mechanisms yet. I can try to dig into this and read up more.
I see another additional caveat in instantiating a connection only once: server-side request-response timeouts - that exist probably not only for REST APIs but also for RMDBS systems.
from audb.
ones
Thanks, I corrected it to "once".
I see another additional caveat in instantiating a connection only once: server-side request-response timeouts
Good question, I don't know how long you can use an ArtifactoryPath
object without the need to re-authenticate. The ideal design in this case would be that the corresponding backend object from audbackend
tries to reconnect when an error regarding timeout/authentication is raised by the backend. The problem is that this requires that the error raised by the backend is different from other errors.
from audb.
I updated the implementation of using audbackend
2.0.0 (#386, solution 1) in #388 to pass on an open backend from function to function (solution 2). Solution 1 also reuses an already opened backend when downloading media files, as this happens anyway within a single function.
I then benchmarked the different solutions on compute5
for loading emodb
to an empty cache folder.
Implementation | Execution time |
---|---|
main branch |
20 s |
solution 1 | 11 s |
solution 2 | 11 s |
Benchmark code
import time
import audb
import audeer
cache = "./cache"
audeer.rmdir(cache)
t0 = time.time()
db = audb.load(
"emodb",
version="1.4.1",
num_workers=4,
verbose=False,
cache_root=cache,
)
t = time.time() - t0
print(f"Execution time: {t:.2f} s")
So it seems, that it will anyway not be that important to reuse the same connection when loading header and dependency files for a dataset as the most time is anyway spend on loading media files and tables. But it would of course still be nice to have a good design.
from audb.
I updated the implementation of using
audbackend
2.0.0 (#386, solution 1) in #388 to pass on an open backend from function to function (solution 2). Solution 1 also reuses an already opened backend when downloading media files, as this happens anyway within a single function.I then benchmarked the different solutions on
compute5
for loadingemodb
to an empty cache folder.Implementation Execution time
main
branch 20 s
solution 1 11 s
solution 2 11 s
So improving connection handling does not buy anything. I was trying to find the code locations and technical documentation that are responsible for the real speed advantage (to 11s) but was not successfully.
from audb.
In audeering/audbackend#215 under section "Speedup audbackend.backend.Artifactory" there is a code example given:
import time
import audbackend
backend = audbackend.backend.Artifactory("https://audeering.jfrog.io/artifactory", "data-public")
backend.open()
interface = audbackend.interface.Maven(backend)
t0 = time.time()
interface.exists("/emodb", "1.4.1")
interface.exists("/emodb/db", "1.4.1")
interface.exists("/emodb/meta", "1.4.1")
interface.exists("/emodb/media", "1.4.1")
t = time.time()
print(f"{t - t0:.3f} s")
This simply asks for four different paths if the location exists on the backend. With the previous implementation (referenced above as main
branch) this takes 0.872 s as every time it creates a artifactory.ArtifactoryPath
that authenticates at the backend. With the new implementation (solution 1) it does only authenticates when backend.open()
is called and takes 0.178 s.
The reason why solution 2 doesn't provide much improvements for audb
is that most time is spend on downloading media files, which is done within a single function for most files, compare
Lines 401 to 485 in 069cc04
There the backend needs to be opened only once, and then it is reused with solution 1. What solution 2 does on top is to reuse it also when the header, the dependency table, the attachments, and the tables are loaded, whereas in solution 1 all of those require at least one other authentication each. But if you have 1000 media files, which saves 999 authentications with solution 1, saving another 5 with solution 2 does of course not provide much benefit.
from audb.
We have no drastically reduced the connections to the server as we require to connect only ones for downloading all media files, and connecting only ones for downloading all tables.
But the solution seems also to have a downside. On the public Artifactory server we see, that the download always fails after some time: #409. This error does not happen on our internal Artifactory server.
from audb.
As shown in #387 (comment) I showed that the current reduction of connections to the server seems already sufficient. Instead, we should try to solve #409, before spending more time to reduce the connection even further. I will close this issue for now.
from audb.
Related Issues (20)
- Error on using `load` with `format` argument HOT 7
- Header as returned by audb.info.header() fails for __eq__()
- Investigate if we should skip zipping of parquet dependency table HOT 7
- Depend on a smaller pyarrow package
- Dependency file error reported when trying to build the documentation locally HOT 3
- `Dependencies._column_loc`: files parameter has a mismatch between typing and implementation HOT 4
- Downloading datasets from public servers fails after some time HOT 4
- Updating and publishing databases without `parquet` fails with 1.7.2 HOT 7
- `ModuleNotFoundError`/`KeyError` when trying to load a database from cache
- Improve definition of dependency table column names and dtypes
- String representation of dependency table might vary
- Requesting versions of a database can fail with ConnectionError
- pathlib._Flavour AttributeError when importing audb (Python 3.12) HOT 4
- Add support for PARQUET file tables HOT 1
- Share more code between audb.load() and audb.load_to()
- Comparing CSV and PARQUET dependency tables might fail
- Include cache handling in documentation on audb load process
- Progress bar estimated remaining time too erratic HOT 2
- Document right settings of shared cache might not be presistent after reboot
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from audb.