Coder Social home page Coder Social logo

git-annex-remote-globus's People

Contributors

gi114 avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar

git-annex-remote-globus's Issues

does nohow detect that downloaded content is actually an HTML requiring to authenticate

Files is there and I am even logged in into globus (unrelated to the remote mechanism):

$> globus ls 558c54fe-cc41-11e8-8c6a-0a1d4c5c824a:/SenzaiY/YMV01/YMV01_170818/ | grep ampli
amplitudes.npy
remote seems to happily download the file (although not sure what that last ExitFailure 1 about -- exit status seems clean):
$> git annex addurl --debug globus://558c54fe-cc41-11e8-8c6a-0a1d4c5c824a/SenzaiY/YMV01/YMV01_170818/amplitudes.npy
[2020-06-29 22:43:43.768509247] read: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","show-ref","git-annex"]
[2020-06-29 22:43:43.773170539] process done ExitSuccess
[2020-06-29 22:43:43.773271196] read: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","show-ref","--hash","refs/heads/git-annex"]
[2020-06-29 22:43:43.777702694] process done ExitSuccess
[2020-06-29 22:43:43.777907788] read: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","log","refs/heads/git-annex..3458fdc310d2deaeaa80c8dfaefe1698a9682e1c","--pretty=%H","-n1"]
[2020-06-29 22:43:43.790929834] process done ExitSuccess
[2020-06-29 22:43:43.791368807] chat: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","cat-file","--batch"]
[2020-06-29 22:43:43.791908396] chat: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","cat-file","--batch-check=%(objectname) %(objecttype) %(objectsize)"]
[2020-06-29 22:43:43.796621346] chat: /home/yoh/proj/datalad/git-annex-remote-globus/venvs/dev3/bin/git-annex-remote-globus []
[2020-06-29 22:43:44.238477469] git-annex-remote-globus[1] --> VERSION 1
[2020-06-29 22:43:44.2387346] git-annex-remote-globus[1] <-- EXTENSIONS INFO
[2020-06-29 22:43:44.239000521] git-annex-remote-globus[1] --> EXTENSIONS
[2020-06-29 22:43:44.239135367] git-annex-remote-globus[1] <-- PREPARE
[2020-06-29 22:43:44.2393114] git-annex-remote-globus[1] --> GETCONFIG uuid
[2020-06-29 22:43:44.239419178] git-annex-remote-globus[1] <-- VALUE 558c54fe-cc41-11e8-8c6a-0a1d4c5c824a
[2020-06-29 22:43:44.239597611] git-annex-remote-globus[1] --> GETCONFIG fileprefix
[2020-06-29 22:43:44.239709186] git-annex-remote-globus[1] <-- VALUE
[2020-06-29 22:43:44.239849382] git-annex-remote-globus[1] --> GETCONFIG endpoint
[2020-06-29 22:43:44.239962534] git-annex-remote-globus[1] <-- VALUE 558c54fe-cc41-11e8-8c6a-0a1d4c5c824a
[2020-06-29 22:43:44.610854782] git-annex-remote-globus[1] --> PREPARE-SUCCESS
[2020-06-29 22:43:44.611090293] git-annex-remote-globus[1] <-- CLAIMURL globus://558c54fe-cc41-11e8-8c6a-0a1d4c5c824a/SenzaiY/YMV01/YMV01_170818/amplitudes.npy
[2020-06-29 22:43:44.611351451] git-annex-remote-globus[1] --> CLAIMURL-SUCCESS
[2020-06-29 22:43:44.611508727] git-annex-remote-globus[1] <-- CHECKURL globus://558c54fe-cc41-11e8-8c6a-0a1d4c5c824a/SenzaiY/YMV01/YMV01_170818/amplitudes.npy
[2020-06-29 22:43:44.875955308] git-annex-remote-globus[1] --> CHECKURL-CONTENTS 13950872
addurl globus://558c54fe-cc41-11e8-8c6a-0a1d4c5c824a/SenzaiY/YMV01/YMV01_170818/amplitudes.npy (from globus) (to 558c54fe_cc41_11e8_8c6a_0a1d4c5c824a_SenzaiY_YMV01_YMV01_170818_amplitudes.npy) [2020-06-29 22:43:44.876574066] read: git ["--version"]
[2020-06-29 22:43:44.880568781] process done ExitSuccess
[2020-06-29 22:43:44.880762238] chat: git ["--git-dir=.git","--work-tree=.","check-ignore","-z","--stdin","--verbose","--non-matching"]

[2020-06-29 22:43:44.909946917] git-annex-remote-globus[1] <-- TRANSFER RETRIEVE URL-s13950872--globus://558c54fe-cc41-11e8-8c6-1d0e3f16a2fbf978cc33ec614c02e79d .git/annex/tmp/URL-s13950872--globus&c%%558c54fe-cc41-11e8-8c6-1d0e3f16a2fbf978cc33ec614c02e79d
[2020-06-29 22:43:44.910314097] git-annex-remote-globus[1] --> GETURLS URL-s13950872--globus://558c54fe-cc41-11e8-8c6-1d0e3f16a2fbf978cc33ec614c02e79d globus://
[2020-06-29 22:43:44.912313981] git-annex-remote-globus[1] <-- VALUE globus://558c54fe-cc41-11e8-8c6a-0a1d4c5c824a/SenzaiY/YMV01/YMV01_170818/amplitudes.npy
[2020-06-29 22:43:44.912479181] git-annex-remote-globus[1] <-- VALUE
[2020-06-29 22:43:46.3695805] git-annex-remote-globus[1] --> TRANSFER-SUCCESS RETRIEVE URL-s13950872--globus://558c54fe-cc41-11e8-8c6-1d0e3f16a2fbf978cc33ec614c02e79d
[2020-06-29 22:43:46.370252941] chat: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","check-attr","-z","--stdin","annex.backend","annex.numcopies","annex.largefiles","--"]
[2020-06-29 22:43:46.37075713] read: git ["--version"]
[2020-06-29 22:43:46.374468113] process done ExitSuccess
(non-large file; adding content to git repository) ok
(recording state in git...)
[2020-06-29 22:43:46.393727743] feed: xargs ["-0","git","--git-dir=.git","--work-tree=.","--literal-pathspecs","add","--"]
[2020-06-29 22:43:46.443980111] process done ExitSuccess
[2020-06-29 22:43:46.512114758] process done ExitSuccess
[2020-06-29 22:43:46.512947576] process done ExitSuccess
[2020-06-29 22:43:46.51342478] process done ExitSuccess
[2020-06-29 22:43:46.513917556] process done ExitSuccess
[2020-06-29 22:43:46.514354977] process done ExitFailure 1
(dev3) 2 19497 [1].....................................:Mon 29 Jun 2020 10:43:47 PM EDT:.
(git)smaug:/tmp/testdsg[master]git
$> echo $?
0

but that file is actually an HTML demanding the login -- may be the credentials record didn't workout or smth like that -- dont' know:

$> file -L 558c54fe_cc41_11e8_8c6a_0a1d4c5c824a_SenzaiY_YMV01_YMV01_170818_amplitudes.npy 
558c54fe_cc41_11e8_8c6a_0a1d4c5c824a_SenzaiY_YMV01_YMV01_170818_amplitudes.npy: HTML document, UTF-8 Unicode text, with very long lines

$> grep 'login' 558c54fe_cc41_11e8_8c6a_0a1d4c5c824a_SenzaiY_YMV01_YMV01_170818_amplitudes.npy     
  <link href="https://fonts.googleapis.com/css?family=Roboto" rel="stylesheet" type="text/css"> <!-- included for google login -->
  <body class="login-page">
        <h1>Additional logins needed to use 2a9f4.8443.dn.glob.us</h1>
      <h2>Use your existing organizational login</h2>
    Globus ID is an identity provider operated by Globus. If you don&#39;t want to use your existing organizational login (e.g. university, national lab, facility, project, Google) with Globus, you can create an account on Glob
us ID and use that to log into Globus.
<button id="login-btn" type="submit">Continue</button>
        var continue_btn = $('#login-btn');
        /* Load / save the selected IdP in localStorage for simple login convenience */
  <div id="login_btns_block" class="text-center">

I must confess that the remote configuration is also quite ad-hoc since I see no point in all of those endpoint specific settings to be able to download urls (which already carry that information) but they are needed for current code to function:

$> git show git-annex:remote.log                                                              
558c54fe-cc41-11e8-8c6a-0a1d4c5c824a encryption=none endpoint=558c54fe-cc41-11e8-8c6a-0a1d4c5c824a externaltype=globus name=globus type=external uuid=558c54fe-cc41-11e8-8c6a-0a1d4c5c824a timestamp=1593482271.014958164s

but the point is that the code should detect that I am not properly authenticated

This repo would ideally need to be "rewritten"

Currently

$> du -scm .git/objects 
9	.git/objects
9	total

which is too big for what it should be because various files which shouldn't be in git (venv , __pycache__) where committed. testremote branch already has them gone, but IMHO (earlier than later) with git filter-branch the whole history should be rewritten to shrink .git/objects

URL support functionality notes

Here is a protocol from running `git annex addurl` on `s3://` url which is handled via datalad special remote:
$> git annex initremote datalad type=external externaltype=datalad encryption=none 
...
$> git annex addurl --pathdepth=-1 --debug s3://openfmri/ds116/sub001/BOLD/task001_run001/QA/fd.txt
...
[2019-11-08 10:50:13.671664748] git-annex-remote-datalad[1] <-- CLAIMURL s3://openfmri/ds116/sub001/BOLD/task001_run001/QA/fd.txt
[2019-11-08 10:50:13.671791652] git-annex-remote-datalad[1] --> DEBUG Encodings: filesystem utf-8, default utf-8
[2019-11-08 10:50:13.67187777] Encodings: filesystem utf-8, default utf-8
[2019-11-08 10:50:13.672111287] git-annex-remote-datalad[1] --> DEBUG Claiming url 's3://openfmri/ds116/sub001/BOLD/task001_run001/QA/fd.txt'
[2019-11-08 10:50:13.672210539] Claiming url 's3://openfmri/ds116/sub001/BOLD/task001_run001/QA/fd.txt'
[2019-11-08 10:50:13.672270751] git-annex-remote-datalad[1] --> CLAIMURL-SUCCESS
[2019-11-08 10:50:13.67239921] git-annex-remote-datalad[1] <-- CHECKURL s3://openfmri/ds116/sub001/BOLD/task001_run001/QA/fd.txt
[2019-11-08 10:50:14.040837149] git-annex-remote-datalad[1] --> CHECKURL-CONTENTS 4250 ds116/sub001/BOLD/task001_run001/QA/fd.txt
addurl s3://openfmri/ds116/sub001/BOLD/task001_run001/QA/fd.txt (from datalad) (to ds116/sub001/BOLD/task001_run001/QA/fd.txt) [2019-11-08 10:50:14.041350273] read: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","symbolic-ref","-q","HEAD"]
[2019-11-08 10:50:14.050977324] process done ExitSuccess
[2019-11-08 10:50:14.051141617] read: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","show-ref","refs/heads/master"]
[2019-11-08 10:50:14.063291123] process done ExitSuccess
[2019-11-08 10:50:14.063796021] chat: git ["--git-dir=.git","--work-tree=.","check-ignore","-z","--stdin","--verbose","--non-matching"]

[2019-11-08 10:50:14.085449929] git-annex-remote-datalad[1] <-- TRANSFER RETRIEVE URL-s4250--s3://openfmri/ds116/sub001/BOLD/task001_run001/QA/fd.txt .git/annex/tmp/URL-s4250--s3&c%%openfmri%ds116%sub001%BOLD%task001_run001%QA%fd.txt
[2019-11-08 10:50:14.087122748] git-annex-remote-datalad[1] --> GETURLS URL-s4250--s3://openfmri/ds116/sub001/BOLD/task001_run001/QA/fd.txt http:
[2019-11-08 10:50:14.090879728] git-annex-remote-datalad[1] <-- VALUE 
[2019-11-08 10:50:14.09178823] git-annex-remote-datalad[1] --> GETURLS URL-s4250--s3://openfmri/ds116/sub001/BOLD/task001_run001/QA/fd.txt https:
[2019-11-08 10:50:14.094266045] git-annex-remote-datalad[1] <-- VALUE 
[2019-11-08 10:50:14.094895422] git-annex-remote-datalad[1] --> GETURLS URL-s4250--s3://openfmri/ds116/sub001/BOLD/task001_run001/QA/fd.txt s3:
[2019-11-08 10:50:14.098633146] git-annex-remote-datalad[1] <-- VALUE s3://openfmri/ds116/sub001/BOLD/task001_run001/QA/fd.txt
[2019-11-08 10:50:14.098871334] git-annex-remote-datalad[1] <-- VALUE 
[INFO] Downloading 's3://openfmri/ds116/sub001/BOLD/task001_run001/QA/fd.txt' into '.git/annex/tmp/URL-s4250--s3&c%%openfmri%ds116%sub001%BOLD%task001_run001%QA%fd.txt' 
[2019-11-08 10:50:14.245314399] git-annex-remote-datalad[1] --> PROGRESS 4250
100%  4.15 KiB         26 KiB/s 0s[2019-11-08 10:50:14.246005387] git-annex-remote-datalad[1] --> TRANSFER-SUCCESS RETRIEVE URL-s4250--s3://openfmri/ds116/sub001/BOLD/task001_run001/QA/fd.txt
[2019-11-08 10:50:14.24666711] chat: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","check-attr","-z","--stdin","annex.backend","annex.numcopies","annex.largefiles","--"]
[INFO] Successfully downloaded s3://openfmri/ds116/sub001/BOLD/task001_run001/QA/fd.txt into .git/annex/tmp/URL-s4250--s3&c%%openfmri%ds116%sub001%BOLD%task001_run001%QA%fd.txt 
ok
(recording state in git...)
...

So the workflow would be

  • react with CLAIMURL-SUCCES for CLAIMURL those urls which start with globus://[<globus-name>|<globus-uuid>]/<fileprefix> where <globus-name> (or <globus-uuid>) and <fileprefix> are options of the special remote
  • provide CHECKURL-CONTENTS Size|UNKNOWN Filename response for CHECKURL query by annex (for those matching URLs)
  • on TRANSFER RETRIEVE Key File we could analyze the provided Key.
    1. If Key is of URL- backend (starts with URL-) we could actually avoid using GETURLS (as we did in datalad) but just parse that key to extract the URL and corresponding path to be RETRIEVED
    2. while transferring we need to provide back PROGRESS reports
    3. the tricky part here is that for RETRIEVE we would like to support also regular/proper git annex special remote behavior, if data was stored in the layout of a regular special remote. So we might want first to check (yet to determine specifics) if a key is available as on regular annex special remote; and if not /no information -- run GETURLS, filter for the ones we care about and try to download using them instead
    4. instead of i. + iii. -- may be we should just do GETURLS, regardless of the key, and only if that doesn't provide us any URLs we handle, then get to "regular special remote" way.

but overall summary -- we should be able to make it work as a proper git annex external special remote with GET/PUT and EXPORT while also supporting regular annex addurl globus://... functionality (thus datalad addurs could be used to establish "import" of already existing directories on globus; until git annex provides protocol/support for proper "import").

BUT I believe that git-annex might be the one which seems to "register url" (thus storing the ad-hoc globus:// url in git-annex branch in .web file for the key) for the key upon addurl URL. Ideally we should avoid that url being stored, but rather just store the path to the file (and version info) to that file assuming globus://[<globus-name>|<globus-uuid>]/<fileprefix> prefix. that would allow for more flexible management (e.g. rename of the globus endpoint, or renaming/moving fileprefix), and minimize storage within git-annex branch. We might need to clarify that with @joeyh (Q: is it possible for special remote to announce that claimed url shouldn't be stored as a url for the file)

local support modules should be either absorbed or moved into a package

currently code relies on ambiguous (looks global but is local) imports:

from lookup_url import lookup_url
...
from globusclient import GlobusClient
from annexremote import Master
from annexremote import ExportRemote
from annexremote import SpecialRemote
from annexremote import RemoteError, ProtocolError

those files cannot be deployed (installed) on their own, so either would need to be moved into some local "proper package" (e.g. git_annex_remote_globus ;) ) or just absorbed within git-annex-remote-globus . In the former case, even the git-annex-remote-globus code could pretty much be moved into the package (e.g. as git_annex_remote_globus/cmd.py) and the actual git_annex_remote_globus would be just having the __main__ portion.
That would make it easier to (eventually) establish git_annex_remote_globus/tests which would test functionality within that cmd.py somehow

Joey-on-the-change-of-FRDR-end-point

I would like to ask you a question about git annex way to manage unused data. I
built a git annex special remote using Globus.org as a datastore and I am
setting up a new datalad dataset retrieving files content from Globus.

So I set the files keys present with 1 and register the related urls as
expected. Now, the input parameter (prefix) to initremote was wrong, hence my
registered urls were all wrong, so I run 'git annex enableremote [new_prefix]'.

Did you use "git-annex registerurl" for that, or did the external
special remote use the SETURLPRESENT message?

The only way that initremote/enableremote would affect the url is if a
special remote does use SETURLPRESENT.

If the special remote does use SETURLPRESENT, there is also a SETURLMISSING
that it could use to remove the urls you want to remove. But, that might
not be the best fit for your situation, I think.

There is the git-annex rmurl command, that can remove an url from an
annexed file. Probably what you want, I think.

This is all fine, git annex updates the prefix and builds new, now correct,
urls. As you can imagine at this point, git annex has 2 urls per file (one
wrong and one correct) and does not know what to do and fails.

Hmm, normally when git-annex has more than one url for a single file, it
will try all the urls, and only fails if all of the urls are not
accessible. So, I'm curious what this failure looks like.

Admittedly it would be useful to have a command that clears content of internal
git annex files (in the git annex branch) and provides a clean uninitialized
repository without needing to set to 0 or 1 the whole previous history. For
example, I have the case that the datasets in globus move to different
locations and git annex enableremote must be re-run with the new updated
parameters, and everything should work as expected. So I am thinking of
something like 'git annex enablremote [new_parameter] --override'

Hmm, what data in the git-annex branch would that remove, would it be
all the data about that remote? Including removing information about
what keys are stored in the remote?

There is a way to do that, it's just to run git-annex dead with the
name or uuid of the remote, and then git-annex initremote a new
remote name.

But if you don't want to remove the information about what keys are
stored in the remote, using git-annex rmurl seems like a better
approach.

Would it help to be able to run a pipeline like this to find and delete
them all?

    git-annex whereis --format '${file}\0${url}' | grep globus://wrong_prefix/ | git-annex rmurl --batch -z

(Not yet implemented but something like that could be added.)

suggestion: move from wget to requests

globus_sdk already depends on requests:

(git)lena:~datalad/GLOBUSannex[testremote]git
$> pip install globus_sdk
Collecting globus_sdk
  Using cached https://files.pythonhosted.org/packages/be/c1/ccf8e8ebeb229887e6355e87e9cc9f07e06b1661b3c6e50311566bae9f28/globus_sdk-1.8.0-py2.py3-none-any.whl
Requirement already satisfied: six<2.0.0,>=1.10.0 in /usr/lib/python3/dist-packages (from globus_sdk) (1.12.0)
Requirement already satisfied: pyjwt[crypto]<2.0.0,>=1.5.3 in /usr/lib/python3/dist-packages (from globus_sdk) (1.7.1)
Requirement already satisfied: requests<3.0.0,>=2.9.2 in /usr/lib/python3/dist-packages (from globus_sdk) (2.21.0)
Requirement already satisfied: cryptography>=1.4 in /usr/lib/python3/dist-packages (from pyjwt[crypto]<2.0.0,>=1.5.3->globus_sdk) (2.6.1)
Installing collected packages: globus-sdk
Successfully installed globus-sdk-1.8.0

so imho would be better just to use requests. wget module seems to be not maintained since 2015 so, even though it is quite minimalistic, I expect it to eventually break with newer python versions.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.