iterative / dvc-data Goto Github PK
View Code? Open in Web Editor NEWDVC's data management subsystem
Home Page: https://dvc.org
License: Apache License 2.0
DVC's data management subsystem
Home Page: https://dvc.org
License: Apache License 2.0
current diff() is pretty naive and just lists all keys and then generates differences, but we need to walk and diff instead, so that we can propagate hierarchical status (e.g. unknown
in dvc data status
) and be able to stop early if we have dir hashes that match (e.g. imagine we have a dataset with the same .dir md5 in both indexes - it means that there is no point in walking into them and we can shortcircuit it quickly).
Needed to finish migrating dvc data status
to index for iterative/dvc#8761
FileSystem
. (@skshetry)CallbackMixin
since callbacks are now generally supported. (@skshetry)NoDirectoriesMixin
, merge with HTTPFileSystem
. (@skshetry)to_json
/from_json
instead of config
(@skshetry)fs.utils
/dvc.utils.fs
/System
(low priority)Cloud
fixtures, so that they can work with any fsspec
-compatible filesystems by default.Cloud.get_url()
and Cloud.config
out of it. (low priority)The latest release renamed the cli from dvc-data
to dvc
, conflicting with https://github.com/iterative/dvc. Assuming that was just a typo?
Line 62 in 4ccb420
Currently this is the only user of those objects and it requires manual assignments to every entry in a tree, which is very costly and rather pointless. We could assign fs
and path
instead, but that would only create a similar problem (which actually already exists too). We should probably just introduce some kind of factory/cb/map/etc that would generate fs/path pair from an entry.
Another way of approaching this could be to supply those factories to the index (it uses odb/remote to lazy-load directories anyway).
Related to iterative/dvc#8827
Hello,
I described a bug in the following dvc
issue: iterative/dvc#8420, that would be more relevant here, can you check it?
We currently use in-memory prefix trie, but for large enough indexes it will be much nicer to have an ability to use a proper db and also use it to do operations like diff
more efficiently (e.g. directly in sql query instead of fetching and comparing in python).
Both ref
object and refodb
are great, they allow us to read and write to this virtual odb as if we are dealing with regular HashFile
objects. But the problem is that we can't actually work with ref
objects as ref
objects and do things like transfer
them from memodb
to localodb
(e.g. if we want to make them persistent).
We should better separate refobj/refodb from underlying rawobj/rawodb and provide an easy access to them.
Capturing *link
info
will allow us to be smarter deciding if we need to relink stuff (e.g. in subsequent "noop" dvc add) and greatly improve performance there.
Important not to forget to not write those to dvc files on dvc side (at least for now to preserve current behaviour).
Similar to iterative/scmrepo#81 (comment), I was wondering if we could have a faster version of status/diff that’d return early if things are modified in the repo. This might be useful in non-granular status/diff.
Ideally it’d do staging and diffing together, in a generator so that one can be piped to other and iterated together.
With #208 implemented, a deamon could even keep writing the index so that we don't have to rebuild anything at all when we need to use it in dvc.
Should probably be on top of existing diff, similar to https://github.com/jelmer/dulwich/blob/90ff89e3254054c7bfc723a201e64398c441831e/dulwich/diff_tree.py#L192
Currently we only support loading .dir
objects from ObjectStorage
, but directories can be stored in FileStorage
too (e.g. in gitfs or in some backed up location) and we should be able to load it up as well.
Required for iterative/dvc#8789 , because cloud versioning imports are FileStorage
stuff and once you go into support chained imports you now have to be able to dynamically load index on each level of the chain from either ObjectStorage
or FileStorage
.
E.g.paths with /
in them on windows are perfectly legal, but will mess up our data structures.
restore
is kinda opposite of save
: given an index with hashes, it needs to restore it using odb into a virtual dataset. This is extremely useful for those in-between states, where we don't yet have a real workspace to work with (e.g. we didn't checkout your dataset), but want to virtually reconstruct it using cache. This allows us to operate on datasets no matter how they are actually stored (e.g. real dataset on s3 vs dvc cached dataset).
Another example, admittedly a bit unrelated, is dvc's run-cache
thing that goes on out-by-out basis trying to fetch everything one-by-one, but with restore
functionality we could virtually build a dataset using run-cache.
Needed for iterative/dvc#8761 , because there we have to compare a virtually restored dataset with a real one on the cloud.
Think of restore
as an index that we would build out of a dataset that we've actually tried to checkout
from cache. E.g. if some cache files were missing - those files will be missing from the workspace.
This is a non critical but, but it might be nice to fix it.
It seems that if you run dvc migrate
on a repo that doesn't contain any 2.0 structure, then an error is raised in dvc_data/hashfile/db/migrate.py
:
2023-09-07 17:26:54,828 ERROR: unexpected error - not enough values to unpack (expected 2, got 0)
Traceback (most recent call last):
File "/home/joncrall/.pyenv/versions/3.11.2/envs/pyenv3.11.2/lib/python3.11/site-packages/dvc/cli/__init__.py", line 209, in main
ret = cmd.do_run()
^^^^^^^^^^^^
File "/home/joncrall/.pyenv/versions/3.11.2/envs/pyenv3.11.2/lib/python3.11/site-packages/dvc/cli/command.py", line 26, in do_run
return self.run()
^^^^^^^^^^
File "/home/joncrall/.pyenv/versions/3.11.2/envs/pyenv3.11.2/lib/python3.11/site-packages/dvc/commands/cache.py", line 44, in run
migrate_2_to_3(self.repo, dry=self.args.dry)
File "/home/joncrall/.pyenv/versions/3.11.2/envs/pyenv3.11.2/lib/python3.11/site-packages/dvc/cachemgr.py", line 135, in migrate_2_to_3
migration = prepare(src, dest, callback=cb)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/joncrall/.pyenv/versions/3.11.2/envs/pyenv3.11.2/lib/python3.11/site-packages/dvc_data/hashfile/db/migrate.py", line 55, in prepare
paths, oids = zip(*executor.imap_unordered(func, src_paths))
^^^^^^^^^^^
ValueError: not enough values to unpack (expected 2, got 0)
It looks like you probably just need to run:
items = list(executor.imap_unordered(func, src_paths))
if items:
paths, oids = zip(*items)
else:
paths, oids = [], []
instead
Here is a MWE to reproduce:
def simple_demo_repo(dvc_root):
"""
Build a simple repo using only standard dvc commands for upstream MWEs
"""
import ubelt as ub
# Build in a staging area first
assert not dvc_root.exists(), 'directory must not exist yet'
dvc_root = dvc_root
dvc_root.ensuredir()
def cmd(command):
return ub.cmd(command, cwd=dvc_root, verbose=2, system=True)
cmd('git init')
cmd('dvc init')
cmd('dvc config core.autostage true')
cmd('dvc config cache.type symlink,reflink,hardlink,copy')
cmd('dvc config cache.protected true')
cmd('dvc config core.analytics false')
cmd('dvc config core.check_update false')
cmd('dvc config core.check_update false')
# Build basic data
(dvc_root / 'test-set1').ensuredir()
assets_dpath = (dvc_root / 'test-set1/assets').ensuredir()
for idx in range(1, 21):
fpath = assets_dpath / f'asset_{idx:03d}.data'
fpath.write_text(str(idx) * 100)
manifest_fpath = (dvc_root / 'test-set1/manifest.txt')
manifest_fpath.write_text('pretend-data')
root_fpath = dvc_root / 'root_file'
root_fpath.write_text('----' * 100)
cmd(f'dvc add {root_fpath}')
cmd(f'dvc add {manifest_fpath}')
cmd(f'dvc add {assets_dpath}')
cmd('git commit -am "initial commit"')
def mwe():
import ubelt as ub
# Build a simple fresh dvc repo
dvc_root = ub.Path.appdir('simpledvc', 'simple_demo')
dvc_root.delete()
simple_demo_repo(dvc_root)
_ = ub.cmd('dvc cache migrate -vvv', cwd=dvc_root, verbose=3, system=True)
DVC doctor:
(pyenv3.11.2) joncrall@toothbrush:~/.cache/simpledvc/simple_demo$ dvc doctor
DVC version: 3.19.0 (pip)
-------------------------
Platform: Python 3.11.2 on Linux-6.2.0-32-generic-x86_64-with-glibc2.35
Subprojects:
dvc_data = 2.16.0
dvc_objects = 1.0.1
dvc_render = 0.5.3
dvc_task = 0.3.0
scmrepo = 1.3.1
Supports:
azure (adlfs = 2023.4.0, knack = 0.10.1, azure-identity = 1.12.0),
gdrive (pydrive2 = 1.15.4),
gs (gcsfs = 2023.6.0),
hdfs (fsspec = 2023.6.0, pyarrow = 11.0.0),
http (aiohttp = 3.8.4, aiohttp-retry = 2.8.3),
https (aiohttp = 3.8.4, aiohttp-retry = 2.8.3),
oss (ossfs = 2021.8.0),
s3 (s3fs = 2023.6.0, boto3 = 1.26.76),
ssh (sshfs = 2023.4.1),
webdav (webdav4 = 0.9.8),
webdavs (webdav4 = 0.9.8),
webhdfs (fsspec = 2023.6.0)
Config:
Global: /home/joncrall/.config/dvc
System: /etc/xdg/xdg-ubuntu/dvc
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/mapper/vgubuntu-root
Caches: local
Remotes: None
Workspace directory: ext4 on /dev/mapper/vgubuntu-root
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/cba64d0f7628d6e7cf6a9216093a7519
To handle dir loading errors.
Don't forget the tests (maybe even for import/get/ls iterative/dvc#9785)
Hi,
I have a dvc repository with a total size of 1.2TB and about 300,000 files. I understand that with this many files, I cannot expect all dvc operations to be fast, but when I add one small file and perform dvc commit, it takes 3 minutes to finish. Furthermore, the console output during the commit seems a bit strange to me:
The empty output for two minutes confuses me, and the time it takes for whatever it is doing then seems a bit long to me.
To find out what it is doing in that time, I attached a debugger during that time with pyrasite and obtained this stacktrace:
File "dvc/__main__.py", line 7, in <module>
File "dvc/cli/__init__.py", line 185, in main
File "dvc/cli/command.py", line 22, in do_run
File "dvc/commands/commit.py", line 20, in run
File "dvc/repo/__init__.py", line 48, in wrapper
File "dvc/repo/commit.py", line 66, in commit
File "funcy/decorators.py", line 45, in wrapper
File "dvc/stage/decorators.py", line 43, in rwlocked
File "funcy/decorators.py", line 66, in __call__
File "dvc/stage/__init__.py", line 548, in commit
File "dvc/output.py", line 713, in commit
File "dvc/output.py", line 676, in _checkout
File "dvc_data/hashfile/checkout.py", line 274, in checkout
File "dvc_data/hashfile/checkout.py", line 221, in _checkout
File "dvc_data/hashfile/checkout.py", line 115, in _checkout_file
File "dvc_data/hashfile/state.py", line 107, in save
File "diskcache/core.py", line 823, in __setitem__
File "diskcache/core.py", line 796, in set
File "contextlib.py", line 142, in __exit__
File "diskcache/core.py", line 744, in _transact
Since dvc_data appears there, this is hopefully the right repository for this issue.
dvc version: 2.41.1
harddrive is SSD with xfs. Reflinks are enabled.
At the moment, it seems that there are no logging for index.checkout
, so it has made it harder to find what's failing during checkout
.
E.g. checkout
uses entry.fs
and entry.path
, which could be easilly derived in a factory based on corresponding outputs. E.g. if we have an output a
, it means that a
, a, b
, a, b, c
etc entries in index have the same output.fs/odb/remote and path is join(output.path, *key).
This will remove the annoying need to fill up all of those fields when creating an index and will make index slimmer and easier to handle (e.g. serialize).
This is low priority, unless it gets in the way in particular scenarios. Just noting it down.
Hello,
dvc-objects
has just released 1.4.x version, which causes an import error (iterative/dvc-objects#241)
ERROR: unexpected error - cannot import name 'umask' from 'dvc_objects.fs.system' (/opt/hostedtoolcache/Python/3.10.4/x64/lib/python3.10/site-packages/dvc_objects/fs/system.py)
I think this is because the umask
is removed in new version of dvc-objects
.
dvc-data
: from dvc_objects.fs.system import umaskdvc-objects
changes: iterative/dvc-objects@1.4.1...1.4.2Currently if we detect that some file was corrupted - we completely delete it, which takes quite a bit of time for large files and is also lossy, as it might be your last source of your useful data. We should just move corrupted file instead (e.g. .dvc/cache/12/345
-> .dvc/cache/bad/12345
) so one could recover it if needed.
For the record: bad
is like in git lfs .git/lfs/bad
It'd be nice if we could figure out a way to optimize Tree.from_list()
, taking more than 1s to loa d one .dir
file.
2 0.017 0.009 2.621 1.310 __init__.py:23(load)
2 0.000 0.000 2.604 1.302 tree.py:175(load)
2 0.563 0.281 2.452 1.226 tree.py:152(from_list)
202605 0.090 0.000 1.433 0.000 <attrs generated init dvc_data.hashfile.diff.Change>:1(__init__)
202605 0.228 0.000 1.343 0.000 diff.py:36(_)
405207 1.044 0.000 1.191 0.000 meta.py:75(from_dict)
202604 0.060 0.000 0.816 0.000 _make.py:1718(__ne__)
202604 0.678 0.000 0.756 0.000 <attrs generated eq dvc_data.hashfile.diff.TreeEntry>:1(__eq__)
405210 0.492 0.000 0.621 0.000 diff.py:103(_in_cache)
405210 0.252 0.000 0.564 0.000 diff.py:94(_get)
42 0.001 0.000 0.444 0.011 __init__.py:1(<module>)
405207 0.231 0.000 0.326 0.000 hash_info.py:20(from_dict)
810421 0.209 0.000 0.299 0.000 diff.py:26(__bool__)
Some filesystems like s3/azure/etc have built-in object versioning (e.g. you can always access the previous version of an s3 object by using version-id
), which means that
for us. From odb perspective, implementation will likely look a lot like old refdb
: dvc objects that reference a path with a version-id in it, but we won't need to validate it beyond ensuring that it exists because the versions are immutable.
We currently have a junky version of fetch
based on odb
that is not used anywhere. It was part of early experiments (not dvc exp) and is no longer needed.
In dvc fetch
we currently do two things:
we need to take 2), make it dedup based on source fs/path and download stuff into a temporary location (note that we are not talking about reproducing the structure of indexes there, but purely stashing data somewhere). This will allow us to download stuff optimally across different indexes (e.g. across different git revisions), which also means that fetch
should probably accept multiple indexes and not just 1. And probably it should update storage_info.data
as a result.
State should be replaced by using data index, which is easier to work with and easier to update. Note that this is not 1to1 replacement, but rather requires working with data through index point of view.
For example, in state.get
we retrieve the entry for a particular path and then check if the recorded metadata matches the one from actual stat()
. With index we should instead build a new index from the filesystem and then transfer md5s from an old index to this new index entries if the metadata matches. The latter is a pure sql operation that could be done more efficiently.
This is also important for NFS, to reduce the number of sqlite databases that we have do deal with.
dvc-data/src/dvc_data/build.py
Lines 124 to 127 in 5f6ba22
Currently, dvc data status
etc might fail if we have .dvcignore
inside a tracked directory. Do we really need to raise?
I think it would be sufficient to skip file or print a warning, failing is too strict.
dvc data status --untracked --unchanged --granular
ERROR: .dvcignore file should not be in collected dir path: '/home/saugat/projects/iterative/example-get-started/data/features/.dvcignore'
We don't define a proper root node when using Trie, which makes us rely on an obscure behaviour like this
dvc-data/src/dvc_data/hashfile/tree.py
Line 101 in 2e6d0ca
Trie
right now and instead should use some kind of root convention. We didn't use /
before to avoid associating it with POSIX paths, but we might indeed want to use that unless there are better ideas in mind. Obviously can just go with ROOT = "/"
defined for now and rename it any time later in the future if needed.On example-get-started repo, if you delete the .dir
file, it shows confusing results, sometimes reporting added
vs modified
.
$ rm -rf $(dvc-data o2p 20b786b6e6f80e2b3fcf17827ad18597.dir)
$ dvc data status
Not in cache:
(use "dvc pull <file>..." to update your local storage)
data/prepared/
DVC committed changes:
(git commit the corresponding dvc files to update the repo)
added: data.xml
DVC uncommitted changes:
(use "dvc commit <file>..." to track changes)
added: data/prepared/
(there are other changes not tracked by dvc, use "git status" to see)
$ dvc data status --granular
DVC committed changes:
(git commit the corresponding dvc files to update the repo)
added: data.xml
DVC uncommitted changes:
(use "dvc commit <file>..." to track changes)
added: data/prepared/test.tsv
added: data/prepared/train.tsv
See iterative/dvc#7943 (comment).
Possibly related: iterative/dvc#7661
We collect all the files we need to download in a form of index and it would be great to cache it, so we don't have to recollect it every time. This will dramatically reduce dvc fetch
time by skipping "cache collection" after 1 time.
After #341 this became very straightforward and I already have a POC, which needs to be cleaned up and submitted.
Currently, we have to iterate through diff.{changed,added,deleted,modified,unchanged}
to check for the ROOT (''
) to get the summary.
It'd be great if we can get access to this in a simple way.
Also, please take a look at how I try to avoid ROOT
diff in https://github.com/skshetry/dvc/blob/e4506e7dcb6dd6668a622b16bb6b8c73790b167f/dvc/commands/data.py#L85-L100.
See #277 (comment)
At the moment, when users do dvc add data
, we are copying all of the files in that directory and again, checking them back out. This is done as part of relinking, which is not necessary, except for symlink/hardlinks.
Currently in dvc data status
, we don't care about cache checks for HEAD
or workspace
, we only care about the index. But, diff
currently checks for all the objects, either of the old object or the new object.
Lines 128 to 129 in ee9e6f7
Currently index entries contain a single meta
instance that really contains meta information from several different filesystems - so entry.meta
may end up holding a local md5/inode/mtime but can also contain a remote etag/version_id. meta information should really be tracked per-filesystem, the current behavior makes merging and comparing metadata a mess (especially at the DVC level). The current behavior also essentially makes it impossible to use more than one cloud versioned remote at a time in DVC
OS : Ubuntu 20.04
Python : 3.10
DVC-data : 3.7.0 (but the bug is still present on the main branch)
I am using a DVCFilesystem
object to get files from a remote repository. To make the process efficient, I added a local cache to prevent downloading the same md5 again. On that side, everything is good. However, after the file is downloaded into the cache, it gets copied in the final directory instead of symlinked, like it is supposed to be by configuration.
While debugging, I found that there is an error for this use-case in the fs.py
module, more precisely, the get_files
method of DataFileSystem
. When a md5 is absent from cache, it gets downloaded using the _cache_remote_file
method, but then gets copied, since the later _transfer
uses the storage
options from the remote
instead of the cache_storage
options it should.
Steps to replicate :
remote_config
and cache configuration with config
.
symlink
or any other link
typeget
method to pull a file
from remote storage to a location
location
, it will be a copySeveral different dvc cloud versioning/worktree behaviors are offloaded into index.checkout
now (version-aware push, worktree push, worktree update/checkout) and the new flags controlling the behavior don't really belong in index.checkout
. We should separate these behaviors properly, but don't have time to do so right now before the initial cloud versioning release
index.checkout
should not be modifying the input new/old indexes (right now we update meta in the "new" index to support worktree push)
w/threadpoolexecutor and the `cats-dogs` dataset:
default remote:
time dvc push -r s3-unversioned
2801 files pushed
dvc push -r s3-unversioned 41.37s user 7.50s system 10% cpu 7:56.26 total
time dvc pull -r s3-unversioned
A cats-dogs/
1 file added and 2800 files fetched
dvc pull -r s3-unversioned 12.03s user 4.40s system 21% cpu 1:14.68 total
version_aware = true
remote:
time dvc push -r s3-versioned
2800 files pushed
dvc push -r s3-versioned 21.65s user 3.40s system 12% cpu 3:13.01 total
time dvc pull -r s3-versioned
A cats-dogs/
1 file added and 2800 files fetched
dvc pull -r s3-versioned 11.19s user 4.03s system 20% cpu 1:15.42 total
Not sure why versioned remote push performs so much faster than unversioned on my machine after these changes, it may be due to the same listing performance problems noted in the gc
issue iterative/dvc#5961 (comment). (we don't do full remote listing for versioned remotes)
Originally posted by @pmrowla in #246 (comment)
Might as well combine it with usage report daemon from iterative/dvc#693 .
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.