conp-pcno / git-annex-remote-globus Goto Github PK
View Code? Open in Web Editor NEWA git annex special remote
License: MIT License
A git annex special remote
License: MIT License
Files is there and I am even logged in into globus (unrelated to the remote mechanism):
$> globus ls 558c54fe-cc41-11e8-8c6a-0a1d4c5c824a:/SenzaiY/YMV01/YMV01_170818/ | grep ampli
amplitudes.npy
$> git annex addurl --debug globus://558c54fe-cc41-11e8-8c6a-0a1d4c5c824a/SenzaiY/YMV01/YMV01_170818/amplitudes.npy
[2020-06-29 22:43:43.768509247] read: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","show-ref","git-annex"]
[2020-06-29 22:43:43.773170539] process done ExitSuccess
[2020-06-29 22:43:43.773271196] read: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","show-ref","--hash","refs/heads/git-annex"]
[2020-06-29 22:43:43.777702694] process done ExitSuccess
[2020-06-29 22:43:43.777907788] read: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","log","refs/heads/git-annex..3458fdc310d2deaeaa80c8dfaefe1698a9682e1c","--pretty=%H","-n1"]
[2020-06-29 22:43:43.790929834] process done ExitSuccess
[2020-06-29 22:43:43.791368807] chat: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","cat-file","--batch"]
[2020-06-29 22:43:43.791908396] chat: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","cat-file","--batch-check=%(objectname) %(objecttype) %(objectsize)"]
[2020-06-29 22:43:43.796621346] chat: /home/yoh/proj/datalad/git-annex-remote-globus/venvs/dev3/bin/git-annex-remote-globus []
[2020-06-29 22:43:44.238477469] git-annex-remote-globus[1] --> VERSION 1
[2020-06-29 22:43:44.2387346] git-annex-remote-globus[1] <-- EXTENSIONS INFO
[2020-06-29 22:43:44.239000521] git-annex-remote-globus[1] --> EXTENSIONS
[2020-06-29 22:43:44.239135367] git-annex-remote-globus[1] <-- PREPARE
[2020-06-29 22:43:44.2393114] git-annex-remote-globus[1] --> GETCONFIG uuid
[2020-06-29 22:43:44.239419178] git-annex-remote-globus[1] <-- VALUE 558c54fe-cc41-11e8-8c6a-0a1d4c5c824a
[2020-06-29 22:43:44.239597611] git-annex-remote-globus[1] --> GETCONFIG fileprefix
[2020-06-29 22:43:44.239709186] git-annex-remote-globus[1] <-- VALUE
[2020-06-29 22:43:44.239849382] git-annex-remote-globus[1] --> GETCONFIG endpoint
[2020-06-29 22:43:44.239962534] git-annex-remote-globus[1] <-- VALUE 558c54fe-cc41-11e8-8c6a-0a1d4c5c824a
[2020-06-29 22:43:44.610854782] git-annex-remote-globus[1] --> PREPARE-SUCCESS
[2020-06-29 22:43:44.611090293] git-annex-remote-globus[1] <-- CLAIMURL globus://558c54fe-cc41-11e8-8c6a-0a1d4c5c824a/SenzaiY/YMV01/YMV01_170818/amplitudes.npy
[2020-06-29 22:43:44.611351451] git-annex-remote-globus[1] --> CLAIMURL-SUCCESS
[2020-06-29 22:43:44.611508727] git-annex-remote-globus[1] <-- CHECKURL globus://558c54fe-cc41-11e8-8c6a-0a1d4c5c824a/SenzaiY/YMV01/YMV01_170818/amplitudes.npy
[2020-06-29 22:43:44.875955308] git-annex-remote-globus[1] --> CHECKURL-CONTENTS 13950872
addurl globus://558c54fe-cc41-11e8-8c6a-0a1d4c5c824a/SenzaiY/YMV01/YMV01_170818/amplitudes.npy (from globus) (to 558c54fe_cc41_11e8_8c6a_0a1d4c5c824a_SenzaiY_YMV01_YMV01_170818_amplitudes.npy) [2020-06-29 22:43:44.876574066] read: git ["--version"]
[2020-06-29 22:43:44.880568781] process done ExitSuccess
[2020-06-29 22:43:44.880762238] chat: git ["--git-dir=.git","--work-tree=.","check-ignore","-z","--stdin","--verbose","--non-matching"]
[2020-06-29 22:43:44.909946917] git-annex-remote-globus[1] <-- TRANSFER RETRIEVE URL-s13950872--globus://558c54fe-cc41-11e8-8c6-1d0e3f16a2fbf978cc33ec614c02e79d .git/annex/tmp/URL-s13950872--globus&c%%558c54fe-cc41-11e8-8c6-1d0e3f16a2fbf978cc33ec614c02e79d
[2020-06-29 22:43:44.910314097] git-annex-remote-globus[1] --> GETURLS URL-s13950872--globus://558c54fe-cc41-11e8-8c6-1d0e3f16a2fbf978cc33ec614c02e79d globus://
[2020-06-29 22:43:44.912313981] git-annex-remote-globus[1] <-- VALUE globus://558c54fe-cc41-11e8-8c6a-0a1d4c5c824a/SenzaiY/YMV01/YMV01_170818/amplitudes.npy
[2020-06-29 22:43:44.912479181] git-annex-remote-globus[1] <-- VALUE
[2020-06-29 22:43:46.3695805] git-annex-remote-globus[1] --> TRANSFER-SUCCESS RETRIEVE URL-s13950872--globus://558c54fe-cc41-11e8-8c6-1d0e3f16a2fbf978cc33ec614c02e79d
[2020-06-29 22:43:46.370252941] chat: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","check-attr","-z","--stdin","annex.backend","annex.numcopies","annex.largefiles","--"]
[2020-06-29 22:43:46.37075713] read: git ["--version"]
[2020-06-29 22:43:46.374468113] process done ExitSuccess
(non-large file; adding content to git repository) ok
(recording state in git...)
[2020-06-29 22:43:46.393727743] feed: xargs ["-0","git","--git-dir=.git","--work-tree=.","--literal-pathspecs","add","--"]
[2020-06-29 22:43:46.443980111] process done ExitSuccess
[2020-06-29 22:43:46.512114758] process done ExitSuccess
[2020-06-29 22:43:46.512947576] process done ExitSuccess
[2020-06-29 22:43:46.51342478] process done ExitSuccess
[2020-06-29 22:43:46.513917556] process done ExitSuccess
[2020-06-29 22:43:46.514354977] process done ExitFailure 1
(dev3) 2 19497 [1].....................................:Mon 29 Jun 2020 10:43:47 PM EDT:.
(git)smaug:/tmp/testdsg[master]git
$> echo $?
0
but that file is actually an HTML demanding the login -- may be the credentials record didn't workout or smth like that -- dont' know:
$> file -L 558c54fe_cc41_11e8_8c6a_0a1d4c5c824a_SenzaiY_YMV01_YMV01_170818_amplitudes.npy
558c54fe_cc41_11e8_8c6a_0a1d4c5c824a_SenzaiY_YMV01_YMV01_170818_amplitudes.npy: HTML document, UTF-8 Unicode text, with very long lines
$> grep 'login' 558c54fe_cc41_11e8_8c6a_0a1d4c5c824a_SenzaiY_YMV01_YMV01_170818_amplitudes.npy
<link href="https://fonts.googleapis.com/css?family=Roboto" rel="stylesheet" type="text/css"> <!-- included for google login -->
<body class="login-page">
<h1>Additional logins needed to use 2a9f4.8443.dn.glob.us</h1>
<h2>Use your existing organizational login</h2>
Globus ID is an identity provider operated by Globus. If you don't want to use your existing organizational login (e.g. university, national lab, facility, project, Google) with Globus, you can create an account on Glob
us ID and use that to log into Globus.
<button id="login-btn" type="submit">Continue</button>
var continue_btn = $('#login-btn');
/* Load / save the selected IdP in localStorage for simple login convenience */
<div id="login_btns_block" class="text-center">
I must confess that the remote configuration is also quite ad-hoc since I see no point in all of those endpoint specific settings to be able to download urls (which already carry that information) but they are needed for current code to function:
$> git show git-annex:remote.log
558c54fe-cc41-11e8-8c6a-0a1d4c5c824a encryption=none endpoint=558c54fe-cc41-11e8-8c6a-0a1d4c5c824a externaltype=globus name=globus type=external uuid=558c54fe-cc41-11e8-8c6a-0a1d4c5c824a timestamp=1593482271.014958164s
but the point is that the code should detect that I am not properly authenticated
Currently
$> du -scm .git/objects
9 .git/objects
9 total
which is too big for what it should be because various files which shouldn't be in git (venv , __pycache__
) where committed. testremote
branch already has them gone, but IMHO (earlier than later) with git filter-branch
the whole history should be rewritten to shrink .git/objects
$> git annex initremote datalad type=external externaltype=datalad encryption=none
...
$> git annex addurl --pathdepth=-1 --debug s3://openfmri/ds116/sub001/BOLD/task001_run001/QA/fd.txt
...
[2019-11-08 10:50:13.671664748] git-annex-remote-datalad[1] <-- CLAIMURL s3://openfmri/ds116/sub001/BOLD/task001_run001/QA/fd.txt
[2019-11-08 10:50:13.671791652] git-annex-remote-datalad[1] --> DEBUG Encodings: filesystem utf-8, default utf-8
[2019-11-08 10:50:13.67187777] Encodings: filesystem utf-8, default utf-8
[2019-11-08 10:50:13.672111287] git-annex-remote-datalad[1] --> DEBUG Claiming url 's3://openfmri/ds116/sub001/BOLD/task001_run001/QA/fd.txt'
[2019-11-08 10:50:13.672210539] Claiming url 's3://openfmri/ds116/sub001/BOLD/task001_run001/QA/fd.txt'
[2019-11-08 10:50:13.672270751] git-annex-remote-datalad[1] --> CLAIMURL-SUCCESS
[2019-11-08 10:50:13.67239921] git-annex-remote-datalad[1] <-- CHECKURL s3://openfmri/ds116/sub001/BOLD/task001_run001/QA/fd.txt
[2019-11-08 10:50:14.040837149] git-annex-remote-datalad[1] --> CHECKURL-CONTENTS 4250 ds116/sub001/BOLD/task001_run001/QA/fd.txt
addurl s3://openfmri/ds116/sub001/BOLD/task001_run001/QA/fd.txt (from datalad) (to ds116/sub001/BOLD/task001_run001/QA/fd.txt) [2019-11-08 10:50:14.041350273] read: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","symbolic-ref","-q","HEAD"]
[2019-11-08 10:50:14.050977324] process done ExitSuccess
[2019-11-08 10:50:14.051141617] read: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","show-ref","refs/heads/master"]
[2019-11-08 10:50:14.063291123] process done ExitSuccess
[2019-11-08 10:50:14.063796021] chat: git ["--git-dir=.git","--work-tree=.","check-ignore","-z","--stdin","--verbose","--non-matching"]
[2019-11-08 10:50:14.085449929] git-annex-remote-datalad[1] <-- TRANSFER RETRIEVE URL-s4250--s3://openfmri/ds116/sub001/BOLD/task001_run001/QA/fd.txt .git/annex/tmp/URL-s4250--s3&c%%openfmri%ds116%sub001%BOLD%task001_run001%QA%fd.txt
[2019-11-08 10:50:14.087122748] git-annex-remote-datalad[1] --> GETURLS URL-s4250--s3://openfmri/ds116/sub001/BOLD/task001_run001/QA/fd.txt http:
[2019-11-08 10:50:14.090879728] git-annex-remote-datalad[1] <-- VALUE
[2019-11-08 10:50:14.09178823] git-annex-remote-datalad[1] --> GETURLS URL-s4250--s3://openfmri/ds116/sub001/BOLD/task001_run001/QA/fd.txt https:
[2019-11-08 10:50:14.094266045] git-annex-remote-datalad[1] <-- VALUE
[2019-11-08 10:50:14.094895422] git-annex-remote-datalad[1] --> GETURLS URL-s4250--s3://openfmri/ds116/sub001/BOLD/task001_run001/QA/fd.txt s3:
[2019-11-08 10:50:14.098633146] git-annex-remote-datalad[1] <-- VALUE s3://openfmri/ds116/sub001/BOLD/task001_run001/QA/fd.txt
[2019-11-08 10:50:14.098871334] git-annex-remote-datalad[1] <-- VALUE
[INFO] Downloading 's3://openfmri/ds116/sub001/BOLD/task001_run001/QA/fd.txt' into '.git/annex/tmp/URL-s4250--s3&c%%openfmri%ds116%sub001%BOLD%task001_run001%QA%fd.txt'
[2019-11-08 10:50:14.245314399] git-annex-remote-datalad[1] --> PROGRESS 4250
100% 4.15 KiB 26 KiB/s 0s[2019-11-08 10:50:14.246005387] git-annex-remote-datalad[1] --> TRANSFER-SUCCESS RETRIEVE URL-s4250--s3://openfmri/ds116/sub001/BOLD/task001_run001/QA/fd.txt
[2019-11-08 10:50:14.24666711] chat: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","check-attr","-z","--stdin","annex.backend","annex.numcopies","annex.largefiles","--"]
[INFO] Successfully downloaded s3://openfmri/ds116/sub001/BOLD/task001_run001/QA/fd.txt into .git/annex/tmp/URL-s4250--s3&c%%openfmri%ds116%sub001%BOLD%task001_run001%QA%fd.txt
ok
(recording state in git...)
...
So the workflow would be
CLAIMURL-SUCCES
for CLAIMURL those urls which start with globus://[<globus-name>|<globus-uuid>]/<fileprefix>
where <globus-name>
(or <globus-uuid>
) and <fileprefix>
are options of the special remoteCHECKURL-CONTENTS Size|UNKNOWN Filename
response for CHECKURL
query by annex (for those matching URLs)TRANSFER RETRIEVE Key File
we could analyze the provided Key.
URL-
) we could actually avoid using GETURLS (as we did in datalad) but just parse that key to extract the URL and corresponding path to be RETRIEVEDbut overall summary -- we should be able to make it work as a proper git annex external special remote with GET/PUT and EXPORT while also supporting regular annex addurl globus://...
functionality (thus datalad addurs
could be used to establish "import" of already existing directories on globus; until git annex provides protocol/support for proper "import").
BUT I believe that git-annex might be the one which seems to "register url" (thus storing the ad-hoc globus:// url in git-annex branch in .web file for the key) for the key upon addurl URL
. Ideally we should avoid that url being stored, but rather just store the path to the file (and version info) to that file assuming globus://[<globus-name>|<globus-uuid>]/<fileprefix>
prefix. that would allow for more flexible management (e.g. rename of the globus endpoint, or renaming/moving fileprefix
), and minimize storage within git-annex branch. We might need to clarify that with @joeyh (Q: is it possible for special remote to announce that claimed url shouldn't be stored as a url for the file)
currently code relies on ambiguous (looks global but is local) imports:
from lookup_url import lookup_url
...
from globusclient import GlobusClient
from annexremote import Master
from annexremote import ExportRemote
from annexremote import SpecialRemote
from annexremote import RemoteError, ProtocolError
those files cannot be deployed (installed) on their own, so either would need to be moved into some local "proper package" (e.g. git_annex_remote_globus
;) ) or just absorbed within git-annex-remote-globus
. In the former case, even the git-annex-remote-globus
code could pretty much be moved into the package (e.g. as git_annex_remote_globus/cmd.py
) and the actual git_annex_remote_globus
would be just having the __main__
portion.
That would make it easier to (eventually) establish git_annex_remote_globus/tests
which would test functionality within that cmd.py somehow
I would like to ask you a question about git annex way to manage unused data. I
built a git annex special remote using Globus.org as a datastore and I am
setting up a new datalad dataset retrieving files content from Globus.So I set the files keys present with 1 and register the related urls as
expected. Now, the input parameter (prefix) to initremote was wrong, hence my
registered urls were all wrong, so I run 'git annex enableremote [new_prefix]'.
Did you use "git-annex registerurl" for that, or did the external
special remote use the SETURLPRESENT message?
The only way that initremote/enableremote would affect the url is if a
special remote does use SETURLPRESENT.
If the special remote does use SETURLPRESENT, there is also a SETURLMISSING
that it could use to remove the urls you want to remove. But, that might
not be the best fit for your situation, I think.
There is the git-annex rmurl
command, that can remove an url from an
annexed file. Probably what you want, I think.
This is all fine, git annex updates the prefix and builds new, now correct,
urls. As you can imagine at this point, git annex has 2 urls per file (one
wrong and one correct) and does not know what to do and fails.
Hmm, normally when git-annex has more than one url for a single file, it
will try all the urls, and only fails if all of the urls are not
accessible. So, I'm curious what this failure looks like.
Admittedly it would be useful to have a command that clears content of internal
git annex files (in the git annex branch) and provides a clean uninitialized
repository without needing to set to 0 or 1 the whole previous history. For
example, I have the case that the datasets in globus move to different
locations and git annex enableremote must be re-run with the new updated
parameters, and everything should work as expected. So I am thinking of
something like 'git annex enablremote [new_parameter] --override'
Hmm, what data in the git-annex branch would that remove, would it be
all the data about that remote? Including removing information about
what keys are stored in the remote?
There is a way to do that, it's just to run git-annex dead
with the
name or uuid of the remote, and then git-annex initremote
a new
remote name.
But if you don't want to remove the information about what keys are
stored in the remote, using git-annex rmurl
seems like a better
approach.
Would it help to be able to run a pipeline like this to find and delete
them all?
git-annex whereis --format '${file}\0${url}' | grep globus://wrong_prefix/ | git-annex rmurl --batch -z
(Not yet implemented but something like that could be added.)
ATM it would require user to figure out which python modules (like a new to me wget
) need to be installed to get this special remote working
globus_sdk already depends on requests:
(git)lena:~datalad/GLOBUSannex[testremote]git
$> pip install globus_sdk
Collecting globus_sdk
Using cached https://files.pythonhosted.org/packages/be/c1/ccf8e8ebeb229887e6355e87e9cc9f07e06b1661b3c6e50311566bae9f28/globus_sdk-1.8.0-py2.py3-none-any.whl
Requirement already satisfied: six<2.0.0,>=1.10.0 in /usr/lib/python3/dist-packages (from globus_sdk) (1.12.0)
Requirement already satisfied: pyjwt[crypto]<2.0.0,>=1.5.3 in /usr/lib/python3/dist-packages (from globus_sdk) (1.7.1)
Requirement already satisfied: requests<3.0.0,>=2.9.2 in /usr/lib/python3/dist-packages (from globus_sdk) (2.21.0)
Requirement already satisfied: cryptography>=1.4 in /usr/lib/python3/dist-packages (from pyjwt[crypto]<2.0.0,>=1.5.3->globus_sdk) (2.6.1)
Installing collected packages: globus-sdk
Successfully installed globus-sdk-1.8.0
so imho would be better just to use requests. wget
module seems to be not maintained since 2015 so, even though it is quite minimalistic, I expect it to eventually break with newer python versions.
To correspond to the name of the main script. Also would be consistent with how https://github.com/Lykos153/git-annex-remote-googledrive is named and I like it -- makes it explicit what this repository is about (not just something annexed to GLOBUS)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.