hashdist / hashdist Goto Github PK
View Code? Open in Web Editor NEWThe HashDist environment management system
Home Page: https://hashdist.github.io/
License: Other
The HashDist environment management system
Home Page: https://hashdist.github.io/
License: Other
Hashdist for example says:
[hashdist] Building hdf5
[hdf5] Building 7pmm.., follow log with:
[hdf5] tail -f /home/ondrej/repos/python-hpcmp2/bld/hdf5-n-7pmm/build.log
so the hash seems to be 7pmm...
, but the directory db/artifacts/7p
only contains one symlink:
$ ls -l ~/repos/python-hpcmp2/db/artifacts/7p/
total 0
lrwxrwxrwx 1 ondrej ondrej 22 Feb 18 14:58 mmowfdgralw66cswphspccq23vazfm -> ../../../opt/hdf5/7pmm
Questions:
7pmm...
and what does it mean?mmowfdgralw66cswphspccq23vazfm
mean?Shouldn't they be the same?
Python pyc files are relocateable; but still irritate in that they contain the absolute path used during compilation.
This is used if you remove the corresponding .py file. Since we would never do such a thing, it won't ever be displayed to the user in practice, but it's still there in the file.
The Python compile modules (py_compile, compileall) have options so that you can provide the path yourself instead of the absolute path. By introducing a crash in importer.py, running
python -m compileall -d '$hdist/nose/...' importer.py
and then deleting importer.py, I'm able to produce:
...
File "$hdist/nose/tljo/lib/python2.7/lib/site-packages/nose/importer.py", line 13, in <module>
ZeroDivisionError: integer division or modulo by zero
I'm not quite sure how I want to make the download tests (optional on environment variable?) so putting it off a bit.
We need *.so to be relocateable. Build systems can usually not do this (see http://hashdist.readthedocs.org/en/latest/building.html#unix-dynamic-libraries), so we should support
hdist build-postprocess --relative-rpath
(See hashdist/cli/build_tools_cli.py
).
patchelf is a GPL C++ program available at http://nixos.org/patchelf.html
It currently does not support the $ORIGIN feature, simply because it sanity checks the input. So we must patch patchelf to support this:
(master) ~/code/hpcmp2/opt/hdf5/mrvr/lib $ patchelf --set-rpath='${ORIGIN}/../../../szip/dpys/lib' libhdf5.so
stat: No such file or directory
(master) ~/code/hpcmp2/opt/hdf5/mrvr/lib $ patchelf --print-rpath libhdf5.so
/home/dagss/code/hpcmp2/opt/szip/dpys/lib
I think there's a case to be made for treating ${ORIGIN}
and $ORIGIN
properly (in the patchelf C++ source itself) as documented in man ld.so
; if we just add a --force
flag it backfire if we want to use --shorten-rpath
, which we want to, as it can speed up load times a lot (if you want to do a simple build, which you do, you just add everything to RPATH and then shorten it afterwards).
Tasks:
$ORIGIN
hdist build-postprocess --relative-rpath
to invoke $PATCHELF/bin/patchelf
and replace all absolute RPATHs that are in the HDIST_IMPORTS
with relative ones (assume that $PATCHELF
is set in the environment). RPATHS to something on the users' system, e.g., /opt/intel/mkl/lib64
, should not be replaced with relative paths. os.path.relpath
is your friend.hdist build-postprocess --relative-rpath
also invoke --shorten-rpath
while we're at it. (Perhaps it can be combined with the --set-rpath
)The various hashdist stores should be versioned and also either a) constructed in a race-safe manner, or b) be created in a separate setup step and not created-if-nonexisting.
Currently file downloads have too many opportunities to suddenly fail without giving any message to the user, which is obviously very very bad.
This ticket lists several tickets to make downloading robust and how to go about fixing them.
To work on this:
Issues:
Currently our tests pass only with Python 2.6 and 2.7. We should aim for 2.6+ and 3.3+.
Travis tests all versions, but the Python 3 ones are allowed to fail.
Using the 2to3 tool seems like a bad idea, given that we don't really have an installation phase. It will be much more convenient if we can use "six":
Subversion has a repository UUID ("svn info" prints it), so one suggestion for key format is to use:
svn:2c0bc1af-c665-0410-8482-af9a87a0766a/branches/mybranch/dir/subdir@543
While the URI is much more canonical than with git and hg, using the UUID gets around sometimes having to move the SVN server, using different ports, different access methods (svn vs. svn+ssh vs. http) etc.
Since SVN isn't distributed we can't download the entire repo and then refer to commits like in git; but here's some ideas:
svn has a nice "export" command where one can simply replace the UUID with the server access method and name in the above key and then one is done:
svn export svn://server.host.com/branches/mybranch/dir/subdir@543 target
This should be a very good MVP. The disadvantage is that for every new revision one redownloads the entire contents from server.
Instead of "svn export", one does "svn checkout" the first time to a directory under .hdist/src/svn/..../r543. When getting another revision, take the closest one, copy it to a temporary name, "svn update -r r843", and finally atomic mv to "r843" and get the contents from there.
So similar to what we do with git, but play to svn's strengths by only getting the intermediate revisions we care about.
Currently, the strategy for making the artifact directory is (this is in core/build_store.py):
Note: In this setting, os.path.exists and similar are banned; one must check if it exists by creating it, and if one created it, one got it, in order to avoid races.
The problem is if:
a) a build suddenly terminates too abrubtly (machine power off, so that the builder doesn't get to remove the artifact directory during stack unwind), or
b) two builds of the same artifact are launched at the same time
In either case, you'll end up "zlib/a5df" and "zlib/a5dfg" for the same build artifact (one perhaps partially finished, and only one of them referenced from the "$db_root").
The principle is: You should only end up with "zlib/a5df" and "zlib/a5dfg" if they are actually two different build artifacts with differing build.json.
A common strategy would be to build into "zlib/a5df.tmp", then once the build is done rename the directory (which is an atomic operation). However, this requires support from build systems to build against a non-existent prefix and install somewhere else -- automake supports "DESTDIR" environment variable, but it is still an extra burden when creating builds that we would like to avoid.
So it seems like the best strategy here is explicit locking/raising an error.
Note: What decides if an artifact is built is the presence of "db/zlib/a5/df...". This describes the case where such a file was not found, but still there is a collision for the build artifact.
What to do:
1) Rather than attempting "mkdir" to create the artifact dir, instead first create another directory using tempfile.mkdtemp in the same parent directory ("zlib/a5df.tmpXDSFSD"), with a file "id" containing the full artifact id and a newline, "zlib/a5dfghhjkasdfasdf...\n".
Then atomically rename it to the real name, in order to to atomically create the directory with the "id" file already in it.
2) If the directory already exists, first check the "id" file to see if the id is different -- if it is, lengthen the hash as today. But if it is the same, then instead of continuing, raise an error informing about it, and say it could happen because of a race or because of an earlier aborted build.
The user can recover by explicitly removing it:
chmod +w -R opt/zlib/a5df
rm -r opt/zlib/a5df
Code to change: hashdist/cli/main.py, hashdist/hdist_logging.py
Currently one always gets a stack trace when something goes wrong, which is not user-friendly. We should gradually move towards a strategy to give clear error messages.
In hdist_logging.py, make it maintain a flag "has_error_occurred", set to True on the first call to error().
In main.py, catch all exceptions. If the logger has the error_occurred flag set, simply exit with a status code but silence the stack trace (the logger has already printed the error message).
Still, if HDIST_DEBUG=1 in os.environ, the exception should not be intercepted at all.
If error_occurred is NOT set, print the stack trace to the logger or sys.stderr (using the traceback module), THEN print info to the user saying "this exception has not been translated to a human-friendly error message, please file an issue pasting this stack trace".
PS: Be very careful about not dumping anything to sys.stdout!, since sys.stdout is often used as input to other programs for the hdist tool.
Optionally, of course, but we want to encourage pkgcfg and discourage libtool
As of abf5d00, some of the tests fail. Are these known errors?
E.......S....F.............................F...........Downloading /tmp/foo/garbage.tar.gz...
curl: (3) <url> malformed
Downloading http://localhost:999/foo.tar.gz...
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0curl: (7) couldn't connect to host
..From /tmp/tmpN54wqw
* branch master -> FETCH_HEAD
From /tmp/tmpN54wqw
* branch devel -> FETCH_HEAD
.From /tmp/tmpN54wqw
* branch devel -> FETCH_HEAD
* branch master -> FETCH_HEAD
...From /tmp/tmpN54wqw
* branch master -> FETCH_HEAD
From /tmp/tmpN54wqw
* branch master -> FETCH_HEAD
...
gzip: stdin: not in gzip format
tar: Child returned status 1
tar: Error is not recoverable: exiting now
..........
======================================================================
ERROR: hashdist.cli.test.test_build_tools.test_symlinks
----------------------------------------------------------------------
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/nose/case.py", line 197, in runTest
self.test(*self.arg)
File "/home/ondrej/repos/hashdist/hashdist/cli/test/test_build_tools.py", line 39, in test_symlinks
sh.hdist('create-links', '--key=section1/section2', 'build.json', _env=env)
File "/home/ondrej/repos/hashdist/hashdist/deps/sh.py", line 648, in __call__
return RunningCommand(cmd, call_args, stdin, stdout, stderr)
File "/home/ondrej/repos/hashdist/hashdist/deps/sh.py", line 268, in __init__
self.wait()
File "/home/ondrej/repos/hashdist/hashdist/deps/sh.py", line 272, in wait
self._handle_exit_code(self.process.wait())
File "/home/ondrej/repos/hashdist/hashdist/deps/sh.py", line 281, in _handle_exit_code
self.process.stderr
ErrorReturnCode_1:
RAN: '/home/ondrej/repos/hashdist/bin/hdist create-links --key=section1/section2 build.json'
STDOUT:
STDERR:
Traceback (most recent call last):
File "/home/ondrej/repos/hashdist/bin/hdist", line 5, in <module>
from hashdist.cli.main import main
ImportError: No module named hashdist.cli.main
======================================================================
FAIL: hashdist.core.test.test_build_store.test_hash_prefix_collision
----------------------------------------------------------------------
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/nose/case.py", line 197, in runTest
self.test(*self.arg)
File "/home/ondrej/repos/hashdist/hashdist/core/test/test_build_store.py", line 89, in decorated
return func(tempdir, sc, bldr, config)
File "/home/ondrej/repos/hashdist/hashdist/core/test/test_build_store.py", line 209, in test_hash_prefix_collision
assert x[:1] in hashparts
AssertionError
======================================================================
FAIL: hashdist.core.test.test_run_job.test_run_job_environment
----------------------------------------------------------------------
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/nose/case.py", line 197, in runTest
self.test(*self.arg)
File "/home/ondrej/repos/hashdist/hashdist/core/test/test_build_store.py", line 89, in decorated
return func(tempdir, sc, bldr, config)
File "/home/ondrej/repos/hashdist/hashdist/core/test/test_run_job.py", line 53, in test_run_job_environment
'BAZ': 'BAZ'}
AssertionError
----------------------------------------------------------------------
Ran 74 tests in 46.021s
FAILED (SKIP=1, errors=1, failures=2)
Currently GNU tar with the --strip-components option is needed, it would be better to use the builtin tarfile module.
There's prior art in "distlib" (in util.py) for supporting extraction of different extraction schemes:
Note that we support a "strip=1" feature so that one does not need to go "cd zlib-1.2.3" after extraction. To support this, the code would likely look something like this (this seems to be OK looking briefly at Python/Lib/tarfile.py):
def remove_prefix(tarinfo):
tarinfo.path = pjoin(os.path.split(tarinfo.path)[strip:]) # but raise error if there are fewer components
tar = tarfile.open("sample.tar.gz")
tar.extractall(members=(remove_prefix(tarinfo) for tarinfo in tar))
tar.close()
Note: in test_source_cache.py, make_mock_archive/make_temporary_tarball should be modified to put items into a sub-directory to test the strip feature.
Pasting here as directed:
ondrej@hawk:~/repos/hashdist(fix31v2)$ PYTHONPATH=. bin/hdist fetchgit https://github.com/hashdist/hashdist.git master
Uncaught exception:
Traceback (most recent call last):
File "/home/ondrej/repos/hashdist/hashdist/cli/main.py", line 114, in main
retcode = args.subcommand_handler(ctx, args)
File "/home/ondrej/repos/hashdist/hashdist/cli/source_cache_cli.py", line 35, in run
key = store.fetch_git(args.repository, args.rev)
File "/home/ondrej/repos/hashdist/hashdist/core/source_cache.py", line 193, in fetch_git
return GitSourceCache(self).fetch_git(repository, rev)
File "/home/ondrej/repos/hashdist/hashdist/core/source_cache.py", line 414, in fetch_git
commit = self._resolve_remote_rev(repository, rev)
File "/home/ondrej/repos/hashdist/hashdist/core/source_cache.py", line 369, in _resolve_remote_rev
(rev, repository))
SourceNotFoundError: "master" resolves to multiple branches/tags in "https://github.com/hashdist/hashdist.git"
This exception has not been translated to a human-friendly error message,
please file an issue at https://github.com/hashdist/hashdist/issues pasting
this stack trace.
What does this error mean? I just wanted to download hashdist.git.
The loader installation itself relies on a recipe, and it would be more convenient if that recipe could specify postprocess --launcher-shebangs. IOW, delay raising an exception for $LOADER until it is actually needed.
In 02b8ed8, changed from urllib2 to shelling out to curl for downloading in order to get nice statistics, progress etc. This should be switched back:
The relevant code is in
SourceCache._download_and_hash
When an error happens (404 or similar) one should first log the error, then re-raise the exception (see #29),
Just switching it over to urllib2 is easy, the problem is that when downloading a huge file one should probably update a progress meter (something like "34% (4 MB of 100 MB)"). To integrate a progress meter:
In hashdist/hashdist_logging.py, add a "start_progress(msg)", "update_progress(percentage)", "stop_progress" methods. The former sets "self.progress_msg", while update_progress prints "\r{self.progress_msg}{percentage-message}", and the latter just emits "done\n" instead of {percentage-message}.
BUT, the progress meter should not be dumped to backing log files ("raw streams"), then only the start/end messages and no "\r" characters should be emitted.
I have pushed in an initial .travis.yml: ef43205, it tests Python 2.6, 2.7., 3.2, 3.3, but all excep Python 2.7 are allowed to fail on Travis. The goal is to always have a green light, and as we add support for more Python versions, the tests will start working and we remove the "allowed to fail" flag.
Currently, there is only one test failure in Python 2.7:
https://travis-ci.org/hashdist/hashdist/jobs/4908412
Any idea what is wrong?
Currently the build artifact path, e.g., "$artifact_root/zlib/45fg", is configurable by the user.
This works well for local builds (they are resolved by looking up the full hash in "$db_root/45/fg...", and even if you change the configuration in the middle there's no problem). However, since build artifacts refer to each other with relative paths (e.g., in the RPATH), then if you want to redistribute the artifacts to another machine you really need to use the same format.
So we should probably pick one standard path, with a strategy for how much of the hash to include.
Options:
Some numbers
The chance of collision with k entries with a hash-space of size n is
def f(k, n): return (1 - np.exp(-k * (k - 1)/2./n))
So for a number of packages, each with a number of builds, and "numchars" long base32 hash, you have
def g(numpkg, numbld, numchars):
return 1 - (1 - f(numbld, 2**(5 * numchars)))**numpkg
and the probability with 1000 packages, 10000 builds each, of having a collision is
In [47]: g(1000, 10000, 8)
Out[47]: 0.044451910685290308
In [48]: g(1000, 10000, 10)
Out[48]: 4.4403494215305983e-05
In [49]: g(1000, 10000, 12)
Out[49]: 4.336375614144572e-08
Note: I'm assuming one does some form of collision detection using the full hash, so that we are not concerned with the "security" aspect here. (Since you can't trust the contents of a build artifact from the hash alone anyway, there's zero security to leverage from the hash already, so we're only concerned with accidental collision.)
8, 10, 12, and all chars are respectively:
opt/zlib/mmowfdgr
opt/zlib/mmowfdgral
opt/zlib/mmowfdgralw6
opt/zlib/mmowfdgralw66cswphspccq23vazfm7p
When shelling out to git, git currently inherits stderr. Instead, stderr should be read and forwarded to the self.logger.debug().
This should be rather simple...
This is in response to hashdist/hashstack-old#2
I would like this to happen at the Hashdist level and working from the key. I.e., you don't go for Python-2.7.3.tar.bz2
, but tar.bz2:ojsfpyi4wfj23q7ufcvpdea7yvq2g5gd
, and check a list of mirror servers for that:
http://mirror1/hashdist/src/tar.bz2/oj/sfpyi4wfj23q7ufcvpdea7yvq2g5gd.tar.bz2
http://mirror2/hashdist/src/tar.bz2/oj/sfpyi4wfj23q7ufcvpdea7yvq2g5gd.tar.bz2
It'd be good with a tool you can use for this, e.g., you do
hdist fetch-to-remote http://python.org/ftp/python/2.7.3/Python-2.7.3.tar.bz2
and in addition to downloading the file locally, that file is scp-ed to the tarball servers of choice (configured in ~/.hashdistconfig
).
Also, instead of using 7pmm..
or dpys..
, use the first 7 characters of the hash, with the git color.
Pasting here as directed:
ondrej@hawk:~/repos/hashdist(fix31v2)$ PYTHONPATH=. bin/hdist fetchgit https://github.com/hashdist/hashdist.git git:65df19fbe5b9f9de56e6eecea77caa478002aa20
Uncaught exception:
Traceback (most recent call last):
File "/home/ondrej/repos/hashdist/hashdist/cli/main.py", line 114, in main
retcode = args.subcommand_handler(ctx, args)
File "/home/ondrej/repos/hashdist/hashdist/cli/source_cache_cli.py", line 35, in run
key = store.fetch_git(args.repository, args.rev)
File "/home/ondrej/repos/hashdist/hashdist/core/source_cache.py", line 193, in fetch_git
return GitSourceCache(self).fetch_git(repository, rev)
File "/home/ondrej/repos/hashdist/hashdist/core/source_cache.py", line 414, in fetch_git
commit = self._resolve_remote_rev(repository, rev)
File "/home/ondrej/repos/hashdist/hashdist/core/source_cache.py", line 358, in _resolve_remote_rev
"characters" % rev)
SourceNotFoundError: no rev 'git:65df19fbe5b9f9de56e6eecea77caa478002aa20'; note that when using a git SHA1 commit hash one needs to use all 40 characters
This exception has not been translated to a human-friendly error message,
please file an issue at https://github.com/hashdist/hashdist/issues pasting
this stack trace.
A builder dying in-progress (or a race) will currently block other builders with a NotImplementError because the full artifact path is already taken.
There's two strategies for this:
My current favourite is 2.
Currently hashdist/core/run_job.py
hard-codes the build environment after certain rules (i.e., sets up HDIST_CFLAGS
by adding -I$ARTIFACT/include
for all dependency artifacts and so on).
However, this is not flexible enough. E.g., we want all build dependencies that are Python packages to add themselves to PYTHONPATH -- but if and only if you ask them (because using hdist build-profile push
is another option and in that case you don't want PYTHONPATH set).
Each artifact has an "artifact.json". Currently this only has an "install" section used for profile installation, but it should gain another section, something like:
"on_import": {
"env_actions": {
"PATH": {
"sep": ":",
"prepend": "$ARTIFACT/bin"
},
"PYTHONPATH": {
"sep": ":",
"prepend": "$ARTIFACT/lib/site-packages/python2.7"
},
"FOOFLAG": { "assign": "some-value" }
}
}
Rules:
$ARTIFACT
is expanded to the artifact in question (I think this is the only one so far).What should be modified is get_imports_env
in hashdist/core/run_job.py
.
There's already an in_env
flag for each imported artifact that should be honored (i.e., if you want to use "hdist build-profile push" instead, you let in_env
be False).
I am following the instructions at:
http://hashdist.readthedocs.org/en/latest/tutorial.html
and got:
ondrej@hawk:~/repos/hashdist(master)$ python mystack.py target
Status:
profile/hsrs.. [needs build]
hdf5/2vuj.. [needs build]
virtual:gcc-stack/host (=gcc-stack/vi4w..) [needs build]
virtual:hdist-cli/r0 (=hdist-cli/o5au..) [needs build]
szip/m4pr.. [needs build]
virtual:unix/host (=unix/djnp..) [needs build]
virtual:hdist-cli/r0 (=hdist-cli/o5au..) (see above)
virtual:gcc-stack/host (=gcc-stack/vi4w..) (see above)
zlib/hkcm.. [needs build]
virtual:gcc-stack/host (=gcc-stack/vi4w..),virtual:unix/host (=unix/djnp..) (see above)
virtual:unix/host (=unix/djnp..) (see above)
szip/m4pr..,zlib/hkcm.. (see above)
Build needed
[hdist-cli] Building o5au.., follow log with:
[hdist-cli] tail -f /home/ondrej/.hdist/bld/hdist-cli-r0-o5au/build.log
[gcc-stack] Building vi4w.., follow log with:
[gcc-stack] tail -f /home/ondrej/.hdist/bld/gcc-stack-host-vi4w/build.log
[unix] Building djnp.., follow log with:
[unix] tail -f /home/ondrej/.hdist/bld/unix-host-djnp/build.log
Downloading http://zlib.net/zlib-1.2.6.tar.gz...
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 334 100 334 0 0 2236 0 --:--:-- --:--:-- --:--:-- 4575
Traceback (most recent call last):
File "mystack.py", line 25, in <module>
hr.cli.stack_script_cli(profile)
File "/home/ondrej/repos/hashdist/hashdist/recipes/cli.py", line 61, in stack_script_cli
build_recipes(build_store, source_cache, config, [root_recipe], keep_build=args.keep)
File "/home/ondrej/repos/hashdist/hashdist/recipes/recipes.py", line 302, in build_recipes
_depth_first_build(recipe)
File "/home/ondrej/repos/hashdist/hashdist/recipes/recipes.py", line 286, in _depth_first_build
_depth_first_build(dep_pkg)
File "/home/ondrej/repos/hashdist/hashdist/recipes/recipes.py", line 288, in _depth_first_build
recipe.fetch_sources(source_cache)
File "/home/ondrej/repos/hashdist/hashdist/recipes/recipes.py", line 147, in fetch_sources
fetch.fetch_into(source_cache)
File "/home/ondrej/repos/hashdist/hashdist/recipes/recipes.py", line 23, in fetch_into
source_cache.fetch(self.url, self.key)
File "/home/ondrej/repos/hashdist/hashdist/core/source_cache.py", line 262, in fetch
handler.fetch(url, type, hash)
File "/home/ondrej/repos/hashdist/hashdist/core/source_cache.py", line 539, in fetch
self.fetch_archive(url, type, hash)
File "/home/ondrej/repos/hashdist/hashdist/core/source_cache.py", line 551, in fetch_archive
(url, hash, expected_hash))
RuntimeError: File downloaded from "http://zlib.net/zlib-1.2.6.tar.gz" has hash d3lib55einogrnsam4e3wcehik2wfn5o but expected HtaA96RDXGi2QGcJuwiHQrVit67BpE0Llh6UzzCv-q8
For git, we keep a list of git repositories to sync in ~/.hdistconfig, and if a git commit is not present, all git repos are queried for the commit and the one containing it is fetched.
Simply symlinking scripts to a launcher is inadequate because often enough Python files are both imported and executed.
I guess we should resort to Python-specific multi-line shebangs in this case, like this example:
All other work in progress branches should be maintained in our private repositories at github. Exceptions should be branches that for some reason cannot be easily merged with master.
n/t
The hashdist repo should not have gh-pages, but rather all pages should be handled by the new hashdist.github.com repo.
``hdist rpath`` tool which will take ``HDIST_ABS_PATH`` and turn
it into an arbitrary relative path.
The key of some source item (say, an archive) should probably be something stable and documented and not something that depends on a specific source cache instance. source_cache.py should be refactored to separate these concerns clearer.
Sometimes downloading an archive fetches the wrong file. While that should be fixed at a higher level (e.g., a 404 should raise an error rather than downloading the 404 HTML page!!), the most important thing is just a basic validation that what was downloaded is indeed a valid archive.
Preferably, the Python 'tarfile' module should be used for this for maximum portability. (Eventually 'tarfile' should be used for all extraction, but it's slightly complicated to implement the --strip-components feature, so that's a separate ticket.)
There's prior art in "distlib" (in util.py) for supporting extraction of lots of different files:
To be done:
Step 1)
The source cache should start using the logger to report errors. The CLI in hashdist/cli/source_cache_cli.py already sets up the right logger, but SourceCache.init needs to take the logger and save it (and it must be passed on from SourceCache.create_from_config). Add a few "logger.info("Downloading %s")"-messages.
PS. Log messages are silenced during unit-testing by default, so to check that this works, do, e.g.,
VERBOSE=1 nosetests --nocapture hashdist/core/test/test_source_cache -m test_basic
There's also a MemoryLogger in test.utils one can use if one wants to unit-test logging (though usually that's a bit overkill).
Step 2)
In hashdist/core/source_cache.py, at the very end of ArchiveSourceCache._download_and_hash, stream through the archive (iterate over all the members and get header info, without extracting) using the Python tarfile module. If there's an exception, capture it and
a) self.logger.error("File downloaded from %s is not an archive" ...)
b) raise SourceNotFoundError("File downloaded from ...")
(At the end of _download_and_hash, the file has been downloaded to a temporary location, but not yet moved into the final destination, which is the right place to do the check. One can alternatively call the check right after the caller of _download_and_hash gets the result back...)
Step 3)
Write a test like fetch_archive in hashdist/core/test/test_source_cache.py, but fetch a corrupt archive instead:
..
with temp_dir() as d:
with open(pjoin(d, 'foo.tar.gz')) as f:
f.write('foo') # definitely not a tar.gz archive
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.