iterative / dvc-data Goto Github PK

View Code? Open in Web Editor NEW

17.0 14.0 18.0 787 KB

DVC's data management subsystem

Home Page: https://dvc.org

License: Apache License 2.0

Python 100.00%

data data-management dvc package python

dvc-data's Issues

index: diff: use hierarchical approach instead of flat

current diff() is pretty naive and just lists all keys and then generates differences, but we need to walk and diff instead, so that we can propagate hierarchical status (e.g. unknown in dvc data status) and be able to stop early if we have dir hashes that match (e.g. imagine we have a dataset with the same .dir md5 in both indexes - it means that there is no point in walking into them and we can shortcircuit it quickly).

Needed to finish migrating dvc data status to index for iterative/dvc#8761

Tracking fsspec related pending changes

Fsspec Compatibility

Post-fsspec changes

Handle OSErrors instead of DvcExceptions. (low priority)
Rethink Cloud fixtures, so that they can work with any fsspec-compatible filesystems by default.
Extract Cloud.get_url() and Cloud.config out of it. (low priority)

conflicting `dvc` script -- typo?

The latest release renamed the cli from dvc-data to dvc, conflicting with https://github.com/iterative/dvc. Assuming that was just a typo?

dvc-data/pyproject.toml

Line 62 in 4ccb420

dvc = "dvci_data.__main__:main"

fs: don't rely on entry.odb/remote objects directly

Currently this is the only user of those objects and it requires manual assignments to every entry in a tree, which is very costly and rather pointless. We could assign fs and path instead, but that would only create a similar problem (which actually already exists too). We should probably just introduce some kind of factory/cb/map/etc that would generate fs/path pair from an entry.

Another way of approaching this could be to supply those factories to the index (it uses odb/remote to lazy-load directories anyway).

Related to iterative/dvc#8827

fs: fails to recuperate hash value from file within a dvc tracked directory

Hello,

I described a bug in the following dvc issue: iterative/dvc#8420, that would be more relevant here, can you check it?

index: add db-based implementation (e.g. sqlite)

We currently use in-memory prefix trie, but for large enough indexes it will be much nicer to have an ability to use a proper db and also use it to do operations like diff more efficiently (e.g. directly in sql query instead of fetching and comparing in python).

refodb: provide raw odb view

Both ref object and refodb are great, they allow us to read and write to this virtual odb as if we are dealing with regular HashFile objects. But the problem is that we can't actually work with ref objects as ref objects and do things like transfer them from memodb to localodb (e.g. if we want to make them persistent).

We should better separate refobj/refodb from underlying rawobj/rawodb and provide an easy access to them.

meta: capture nlink/ishardlink and islink/issymlink

Capturing *link info will allow us to be smarter deciding if we need to relink stuff (e.g. in subsequent "noop" dvc add) and greatly improve performance there.

Important not to forget to not write those to dvc files on dvc side (at least for now to preserve current behaviour).

coarse status/diff

Similar to iterative/scmrepo#81 (comment), I was wondering if we could have a faster version of status/diff that’d return early if things are modified in the repo. This might be useful in non-granular status/diff.

Ideally it’d do staging and diffing together, in a generator so that one can be piped to other and iterated together.

index: add filewatcher

With #208 implemented, a deamon could even keep writing the index so that we don't have to rebuild anything at all when we need to use it in dvc.

index: diff: add rename detection

Should probably be on top of existing diff, similar to https://github.com/jelmer/dulwich/blob/90ff89e3254054c7bfc723a201e64398c441831e/dulwich/diff_tree.py#L192

index: support loading dirs from FileStorage

Currently we only support loading .dir objects from ObjectStorage, but directories can be stored in FileStorage too (e.g. in gitfs or in some backed up location) and we should be able to load it up as well.

Required for iterative/dvc#8789 , because cloud versioning imports are FileStorage stuff and once you go into support chained imports you now have to be able to dynamically load index on each level of the chain from either ObjectStorage or FileStorage.

use data index to batch operations

load/dump all state when loading/dumping index (e.g. build/checkout should likely not interact with state at all anymore). Related #111, #125
batch makedirs during checkout, so that we don't call them for every file we checkout

validate file/dir names when building tree objects

E.g.paths with / in them on windows are perfectly legal, but will mess up our data structures.

index: introduce restore method

restore is kinda opposite of save: given an index with hashes, it needs to restore it using odb into a virtual dataset. This is extremely useful for those in-between states, where we don't yet have a real workspace to work with (e.g. we didn't checkout your dataset), but want to virtually reconstruct it using cache. This allows us to operate on datasets no matter how they are actually stored (e.g. real dataset on s3 vs dvc cached dataset).

Another example, admittedly a bit unrelated, is dvc's run-cache thing that goes on out-by-out basis trying to fetch everything one-by-one, but with restore functionality we could virtually build a dataset using run-cache.

Needed for iterative/dvc#8761 , because there we have to compare a virtually restored dataset with a real one on the cloud.

Think of restore as an index that we would build out of a dataset that we've actually tried to checkout from cache. E.g. if some cache files were missing - those files will be missing from the workspace.

dvc migrate fails on 3.0 repos

This is a non critical but, but it might be nice to fix it.

It seems that if you run dvc migrate on a repo that doesn't contain any 2.0 structure, then an error is raised in dvc_data/hashfile/db/migrate.py:

2023-09-07 17:26:54,828 ERROR: unexpected error - not enough values to unpack (expected 2, got 0)                                                                                   
Traceback (most recent call last):
  File "/home/joncrall/.pyenv/versions/3.11.2/envs/pyenv3.11.2/lib/python3.11/site-packages/dvc/cli/__init__.py", line 209, in main
    ret = cmd.do_run()
          ^^^^^^^^^^^^
  File "/home/joncrall/.pyenv/versions/3.11.2/envs/pyenv3.11.2/lib/python3.11/site-packages/dvc/cli/command.py", line 26, in do_run
    return self.run()
           ^^^^^^^^^^
  File "/home/joncrall/.pyenv/versions/3.11.2/envs/pyenv3.11.2/lib/python3.11/site-packages/dvc/commands/cache.py", line 44, in run
    migrate_2_to_3(self.repo, dry=self.args.dry)
  File "/home/joncrall/.pyenv/versions/3.11.2/envs/pyenv3.11.2/lib/python3.11/site-packages/dvc/cachemgr.py", line 135, in migrate_2_to_3
    migration = prepare(src, dest, callback=cb)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/joncrall/.pyenv/versions/3.11.2/envs/pyenv3.11.2/lib/python3.11/site-packages/dvc_data/hashfile/db/migrate.py", line 55, in prepare
    paths, oids = zip(*executor.imap_unordered(func, src_paths))
    ^^^^^^^^^^^
ValueError: not enough values to unpack (expected 2, got 0)

It looks like you probably just need to run:

items = list(executor.imap_unordered(func, src_paths))
if items:
    paths, oids = zip(*items)
else:
    paths, oids = [], []

instead

Here is a MWE to reproduce:

def simple_demo_repo(dvc_root):
    """
    Build a simple repo using only standard dvc commands for upstream MWEs
    """
    import ubelt as ub

    # Build in a staging area first
    assert not dvc_root.exists(), 'directory must not exist yet'
    dvc_root = dvc_root
    dvc_root.ensuredir()

    def cmd(command):
        return ub.cmd(command, cwd=dvc_root, verbose=2, system=True)

    cmd('git init')
    cmd('dvc init')

    cmd('dvc config core.autostage true')
    cmd('dvc config cache.type symlink,reflink,hardlink,copy')
    cmd('dvc config cache.protected true')
    cmd('dvc config core.analytics false')
    cmd('dvc config core.check_update false')
    cmd('dvc config core.check_update false')

    # Build basic data
    (dvc_root / 'test-set1').ensuredir()
    assets_dpath = (dvc_root / 'test-set1/assets').ensuredir()
    for idx in range(1, 21):
        fpath = assets_dpath / f'asset_{idx:03d}.data'
        fpath.write_text(str(idx) * 100)
    manifest_fpath = (dvc_root / 'test-set1/manifest.txt')
    manifest_fpath.write_text('pretend-data')

    root_fpath = dvc_root / 'root_file'
    root_fpath.write_text('----' * 100)

    cmd(f'dvc add {root_fpath}')
    cmd(f'dvc add {manifest_fpath}')
    cmd(f'dvc add {assets_dpath}')

    cmd('git commit -am "initial commit"')


def mwe():
    import ubelt as ub

    # Build a simple fresh dvc repo
    dvc_root = ub.Path.appdir('simpledvc', 'simple_demo')
    dvc_root.delete()
    simple_demo_repo(dvc_root)

    _ = ub.cmd('dvc cache migrate -vvv', cwd=dvc_root, verbose=3, system=True)

DVC doctor:

(pyenv3.11.2) joncrall@toothbrush:~/.cache/simpledvc/simple_demo$ dvc doctor
DVC version: 3.19.0 (pip)
-------------------------
Platform: Python 3.11.2 on Linux-6.2.0-32-generic-x86_64-with-glibc2.35
Subprojects:
	dvc_data = 2.16.0
	dvc_objects = 1.0.1
	dvc_render = 0.5.3
	dvc_task = 0.3.0
	scmrepo = 1.3.1
Supports:
	azure (adlfs = 2023.4.0, knack = 0.10.1, azure-identity = 1.12.0),
	gdrive (pydrive2 = 1.15.4),
	gs (gcsfs = 2023.6.0),
	hdfs (fsspec = 2023.6.0, pyarrow = 11.0.0),
	http (aiohttp = 3.8.4, aiohttp-retry = 2.8.3),
	https (aiohttp = 3.8.4, aiohttp-retry = 2.8.3),
	oss (ossfs = 2021.8.0),
	s3 (s3fs = 2023.6.0, boto3 = 1.26.76),
	ssh (sshfs = 2023.4.1),
	webdav (webdav4 = 0.9.8),
	webdavs (webdav4 = 0.9.8),
	webhdfs (fsspec = 2023.6.0)
Config:
	Global: /home/joncrall/.config/dvc
	System: /etc/xdg/xdg-ubuntu/dvc
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/mapper/vgubuntu-root
Caches: local
Remotes: None
Workspace directory: ext4 on /dev/mapper/vgubuntu-root
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/cba64d0f7628d6e7cf6a9216093a7519

index: load: introduce onerror

To handle dir loading errors.

#409

Don't forget the tests (maybe even for import/get/ls iterative/dvc#9785)

dvc commit slow with many files

Hi,
I have a dvc repository with a total size of 1.2TB and about 300,000 files. I understand that with this many files, I cannot expect all dvc operations to be fast, but when I add one small file and perform dvc commit, it takes 3 minutes to finish. Furthermore, the console output during the commit seems a bit strange to me:

In the first minute, a progressbar appears and it says that it is building data objects, building cache and transfering.
In the next two minutes, there is no progressbar and no text displayed at all.

The empty output for two minutes confuses me, and the time it takes for whatever it is doing then seems a bit long to me.

To find out what it is doing in that time, I attached a debugger during that time with pyrasite and obtained this stacktrace:

  File "dvc/__main__.py", line 7, in <module>
  File "dvc/cli/__init__.py", line 185, in main
  File "dvc/cli/command.py", line 22, in do_run
  File "dvc/commands/commit.py", line 20, in run
  File "dvc/repo/__init__.py", line 48, in wrapper
  File "dvc/repo/commit.py", line 66, in commit
  File "funcy/decorators.py", line 45, in wrapper
  File "dvc/stage/decorators.py", line 43, in rwlocked
  File "funcy/decorators.py", line 66, in __call__
  File "dvc/stage/__init__.py", line 548, in commit
  File "dvc/output.py", line 713, in commit
  File "dvc/output.py", line 676, in _checkout
  File "dvc_data/hashfile/checkout.py", line 274, in checkout
  File "dvc_data/hashfile/checkout.py", line 221, in _checkout
  File "dvc_data/hashfile/checkout.py", line 115, in _checkout_file
  File "dvc_data/hashfile/state.py", line 107, in save
  File "diskcache/core.py", line 823, in __setitem__
  File "diskcache/core.py", line 796, in set
  File "contextlib.py", line 142, in __exit__
  File "diskcache/core.py", line 744, in _transact

Since dvc_data appears there, this is hopefully the right repository for this issue.

dvc version: 2.41.1
harddrive is SSD with xfs. Reflinks are enabled.

index: checkout: add logging

At the moment, it seems that there are no logging for index.checkout, so it has made it harder to find what's failing during checkout.

index: consider replacing fs/path/odb/remote fields with factories to specific methods

E.g. checkout uses entry.fs and entry.path, which could be easilly derived in a factory based on corresponding outputs. E.g. if we have an output a, it means that a, a, b, a, b, c etc entries in index have the same output.fs/odb/remote and path is join(output.path, *key).

This will remove the annoying need to fill up all of those fields when creating an index and will make index slimmer and easier to handle (e.g. serialize).

This is low priority, unless it gets in the way in particular scenarios. Just noting it down.

Cannot import name 'umask' from 'dvc_objects.fs.system'

Hello,
dvc-objects has just released 1.4.x version, which causes an import error (iterative/dvc-objects#241)

ERROR: unexpected error - cannot import name 'umask' from 'dvc_objects.fs.system' (/opt/hostedtoolcache/Python/3.10.4/x64/lib/python3.10/site-packages/dvc_objects/fs/system.py)

I think this is because the umask is removed in new version of dvc-objects.

dvc-data: from dvc_objects.fs.system import umask
dvc-objects changes: iterative/dvc-objects@1.4.1...1.4.2

odb: move corrupted files to /bad instead of deleting

Currently if we detect that some file was corrupted - we completely delete it, which takes quite a bit of time for large files and is also lossy, as it might be your last source of your useful data. We should just move corrupted file instead (e.g. .dvc/cache/12/345 -> .dvc/cache/bad/12345) so one could recover it if needed.

For the record: bad is like in git lfs .git/lfs/bad

optimize `Tree.from_list()`

It'd be nice if we could figure out a way to optimize Tree.from_list(), taking more than 1s to loa d one .dir file.

        2    0.017    0.009    2.621    1.310 __init__.py:23(load)
        2    0.000    0.000    2.604    1.302 tree.py:175(load)
        2    0.563    0.281    2.452    1.226 tree.py:152(from_list)
   202605    0.090    0.000    1.433    0.000 <attrs generated init dvc_data.hashfile.diff.Change>:1(__init__)
   202605    0.228    0.000    1.343    0.000 diff.py:36(_)
   405207    1.044    0.000    1.191    0.000 meta.py:75(from_dict)
   202604    0.060    0.000    0.816    0.000 _make.py:1718(__ne__)
   202604    0.678    0.000    0.756    0.000 <attrs generated eq dvc_data.hashfile.diff.TreeEntry>:1(__eq__)
   405210    0.492    0.000    0.621    0.000 diff.py:103(_in_cache)
   405210    0.252    0.000    0.564    0.000 diff.py:94(_get)
       42    0.001    0.000    0.444    0.011 __init__.py:1(<module>)
   405207    0.231    0.000    0.326    0.000 hash_info.py:20(from_dict)
   810421    0.209    0.000    0.299    0.000 diff.py:26(__bool__)

db: built-in fs object versioning (s3, azure, etc)

Some filesystems like s3/azure/etc have built-in object versioning (e.g. you can always access the previous version of an s3 object by using version-id), which means that

for us. From odb perspective, implementation will likely look a lot like old refdb: dvc objects that reference a path with a version-id in it, but we won't need to validate it beyond ensuring that it exists because the versions are immutable.

index: introduce fetch

We currently have a junky version of fetch based on odb that is not used anywhere. It was part of early experiments (not dvc exp) and is no longer needed.

In dvc fetch we currently do two things:

collect and trasfer objects from regular outputs
download files to a temp location using an index built out of imports

we need to take 2), make it dedup based on source fs/path and download stuff into a temporary location (note that we are not talking about reproducing the structure of indexes there, but purely stashing data somewhere). This will allow us to download stuff optimally across different indexes (e.g. across different git revisions), which also means that fetch should probably accept multiple indexes and not just 1. And probably it should update storage_info.data as a result.

hashfile: get rid of state

State should be replaced by using data index, which is easier to work with and easier to update. Note that this is not 1to1 replacement, but rather requires working with data through index point of view.

For example, in state.get we retrieve the entry for a particular path and then check if the recorded metadata matches the one from actual stat(). With index we should instead build a new index from the filesystem and then transfer md5s from an old index to this new index entries if the metadata matches. The latter is a pure sql operation that could be done more efficiently.

This is also important for NFS, to reduce the number of sqlite databases that we have do deal with.

build: do we really need to raise if we have a `.dvcignore` inside a tracked directory?

dvc-data/src/dvc_data/build.py

Lines 124 to 127 in 5f6ba22

    
           if DefaultIgnoreFile in fnames: 
        
               raise IgnoreInCollectedDirError( 
        
                   DefaultIgnoreFile, fs.path.join(root, DefaultIgnoreFile) 
        
               )

Currently, dvc data status etc might fail if we have .dvcignore inside a tracked directory. Do we really need to raise?

I think it would be sufficient to skip file or print a warning, failing is too strict.

dvc data status --untracked --unchanged  --granular            
ERROR: .dvcignore file should not be in collected dir path: '/home/saugat/projects/iterative/example-get-started/data/features/.dvcignore'

pygtrie: don't rely on _SENTINEL

We don't define a proper root node when using Trie, which makes us rely on an obscure behaviour like this

dvc-data/src/dvc_data/hashfile/tree.py

Line 101 in 2e6d0ca

kwargs = {"prefix": prefix}

, where the only way to iterate from the root is not specify the prefix at all (which interally in Trie results in _SENTINEL being used as a prefix). We are pretty much misusing Trie right now and instead should use some kind of root convention. We didn't use / before to avoid associating it with POSIX paths, but we might indeed want to use that unless there are better ideas in mind. Obviously can just go with ROOT = "/" defined for now and rename it any time later in the future if needed.

diff: confusing results when object not in cache

On example-get-started repo, if you delete the .dir file, it shows confusing results, sometimes reporting added vs modified.

repro

$ rm -rf $(dvc-data o2p 20b786b6e6f80e2b3fcf17827ad18597.dir)

$ dvc data status
Not in cache:                                                                        
  (use "dvc pull <file>..." to update your local storage)
        data/prepared/

DVC committed changes:
  (git commit the corresponding dvc files to update the repo)
        added: data.xml

DVC uncommitted changes:
  (use "dvc commit <file>..." to track changes)
        added: data/prepared/
(there are other changes not tracked by dvc, use "git status" to see)

$ dvc data status --granular
DVC committed changes:                                                               
  (git commit the corresponding dvc files to update the repo)
        added: data.xml

DVC uncommitted changes:
  (use "dvc commit <file>..." to track changes)
        added: data/prepared/test.tsv
        added: data/prepared/train.tsv

See iterative/dvc#7943 (comment).
Possibly related: iterative/dvc#7661

index: fetch: use index to cache collected tasks

We collect all the files we need to download in a form of index and it would be great to cache it, so we don't have to recollect it every time. This will dramatically reduce dvc fetch time by skipping "cache collection" after 1 time.

After #341 this became very straightforward and I already have a POC, which needs to be cleaned up and submitted.

diff: provide an easier way to access summary

Currently, we have to iterate through diff.{changed,added,deleted,modified,unchanged} to check for the ROOT ('') to get the summary.
It'd be great if we can get access to this in a simple way.

Also, please take a look at how I try to avoid ROOT diff in https://github.com/skshetry/dvc/blob/e4506e7dcb6dd6668a622b16bb6b8c73790b167f/dvc/commands/data.py#L85-L100.

index: checkout: support checking out one file without a dir

E.g.

index[()] = DataIndexEntry(...)
checkout(index)

Discovered while working on #341

index: lazy-load dirs on `view.ls()`

See #277 (comment)

replace state and ref objects with proper caching

transfer/checkout: reduce relink/transfer

At the moment, when users do dvc add data, we are copying all of the files in that directory and again, checking them back out. This is done as part of relinking, which is not necessary, except for symlink/hardlinks.

diff: provide a way to avoid checking for cache from old or new objects or completely

Currently in dvc data status, we don't care about cache checks for HEAD or workspace, we only care about the index. But, diff currently checks for all the objects, either of the old object or the new object.

dvc-data/src/dvc_data/diff.py

Lines 128 to 129 in ee9e6f7

    
           old=TreeEntry(_in_cache(old_oid, cache), key, old_meta, old_oid), 
        
           new=TreeEntry(_in_cache(new_oid, cache), key, new_meta, new_oid),

index/meta: separate meta handling per filesystem

Currently index entries contain a single meta instance that really contains meta information from several different filesystems - so entry.meta may end up holding a local md5/inode/mtime but can also contain a remote etag/version_id. meta information should really be tracked per-filesystem, the current behavior makes merging and comparing metadata a mess (especially at the DVC level). The current behavior also essentially makes it impossible to use more than one cloud versioned remote at a time in DVC

Cached files get copied to destination instead of linked

OS : Ubuntu 20.04
Python : 3.10
DVC-data : 3.7.0 (but the bug is still present on the main branch)

I am using a DVCFilesystem object to get files from a remote repository. To make the process efficient, I added a local cache to prevent downloading the same md5 again. On that side, everything is good. However, after the file is downloaded into the cache, it gets copied in the final directory instead of symlinked, like it is supposed to be by configuration.

While debugging, I found that there is an error for this use-case in the fs.pymodule, more precisely, the get_files method of DataFileSystem. When a md5 is absent from cache, it gets downloaded using the _cache_remote_file method, but then gets copied, since the later _transfer uses the storage options from the remote instead of the cache_storageoptions it should.

Steps to replicate :

Create a DVCFilesystem, passing a remote configuration with remote_config and cache configuration with config.
- The cache configuration must have the symlink or any other link type
Use the getmethod to pull a file from remote storage to a location
Inspect the file created at location, it will be a copy
Inspect the cache to discover a file is present there as well

index: refactor checkout

Several different dvc cloud versioning/worktree behaviors are offloaded into index.checkout now (version-aware push, worktree push, worktree update/checkout) and the new flags controlling the behavior don't really belong in index.checkout. We should separate these behaviors properly, but don't have time to do so right now before the initial cloud versioning release

non-worktree push is really a version-aware transfer and not checkout
index.checkout should not be modifying the input new/old indexes (right now we update meta in the "new" index to support worktree push)
- if anything, checkout should probably return a new index containing what the checkout result, since we currently do not actually account for deletes in the caller

remote transfer slow for unversioned data

    w/threadpoolexecutor and the `cats-dogs` dataset:

default remote:

time dvc push -r s3-unversioned
2801 files pushed
dvc push -r s3-unversioned  41.37s user 7.50s system 10% cpu 7:56.26 total

time dvc pull -r s3-unversioned
A       cats-dogs/
1 file added and 2800 files fetched
dvc pull -r s3-unversioned  12.03s user 4.40s system 21% cpu 1:14.68 total

version_aware = true remote:

time dvc push -r s3-versioned
2800 files pushed
dvc push -r s3-versioned  21.65s user 3.40s system 12% cpu 3:13.01 total

time dvc pull -r s3-versioned
A       cats-dogs/
1 file added and 2800 files fetched
dvc pull -r s3-versioned  11.19s user 4.03s system 20% cpu 1:15.42 total

Not sure why versioned remote push performs so much faster than unversioned on my machine after these changes, it may be due to the same listing performance problems noted in the gc issue iterative/dvc#5961 (comment). (we don't do full remote listing for versioned remotes)

Originally posted by @pmrowla in #246 (comment)

dvc: consider introducing inotify daemon

Might as well combine it with usage report daemon from iterative/dvc#693 .

	if DefaultIgnoreFile in fnames:
	raise IgnoreInCollectedDirError(
	DefaultIgnoreFile, fs.path.join(root, DefaultIgnoreFile)
	)

	old=TreeEntry(_in_cache(old_oid, cache), key, old_meta, old_oid),
	new=TreeEntry(_in_cache(new_oid, cache), key, new_meta, new_oid),

iterative / dvc-data Goto Github PK

dvc-data's Issues

Fsspec Compatibility

Post-fsspec changes

repro

Recommend Projects

Recommend Topics

Recommend Org