Coder Social home page Coder Social logo

justintime50 / github-archive Goto Github PK

View Code? Open in Web Editor NEW
183.0 5.0 46.0 3.37 MB

A powerful tool to concurrently clone, pull, or fork user and org repos and gists to create a GitHub archive.

License: MIT License

Python 96.92% Just 3.08%
github clone backup archive git pull repository repo gists tool

github-archive's Introduction

Hey, I'm Justin Hammond ๐Ÿ‘‹

Senior Software Engineer @EasyPost, IT Pro, Tech Enthusiast

I love all things tech. I've been programming for 18+ years, tinkering with electronics for 15+ years, and founding or building tech companies for 10+ years. I'm an open source fanatic, Apple fanboy, and love to explore new tech. I spend my time coding open source projects, tinkering with electronics and new tech products, and consulting teams on how to get things done.

Noteworthy Projects

The following are items that may not be represented on my GitHub profile but are noteworthy in the software space:

GitHub Stats

Metrics

Latest Blog Posts

github-archive's People

Contributors

danieldjewell avatar justintime50 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

github-archive's Issues

Deleting Log Files is Broken

Log files aren't being deleted after 30 days and is throwing an error at the end of the script:

Traceback (most recent call last):
  File "github_archive.py", line 272, in <module>
    Archive.run()
  File "github_archive.py", line 252, in run
    if file.mtime < Archive.LOG_LIFE:
AttributeError: 'str' object has no attribute 'mtime'

Will only pull org repos, gives error if user repos are used?

Ok, sorry to be such a pain, I am having issues getting the script to work with user repos included.

For example, trying to download openZFS and the associated personal repos with this command:

GITHUB_ARCHIVE_ORGS="openzfs, ahrens, gmelikov, grwilson, kusumi, prakashsurya" github-archive -uc -up -gc -gp -oc -op

Gives me this error:

`

GitHub Archive started...

Cloning personal repos...

Pulling personal repos...

Traceback (most recent call last):
File "/usr/local/bin/github-archive", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.9/site-packages/github_archive/cli.py", line 83, in main
CLI()._run()
File "/usr/local/lib/python3.9/site-packages/github_archive/cli.py", line 71, in _run
GithubArchive.run(
File "/usr/local/lib/python3.9/site-packages/github_archive/archive.py", line 59, in run
org_repos = GithubArchive.get_all_org_repos()
File "/usr/local/lib/python3.9/site-packages/github_archive/archive.py", line 123, in get_all_org_repos
all_org_repos.append(Github(GITHUB_TOKEN).get_organization(org.strip()).get_repos())
File "/usr/local/lib/python3.9/site-packages/github/MainClass.py", line 311, in get_organization
headers, data = self.__requester.requestJsonAndCheck("GET", "/orgs/" + login)
File "/usr/local/lib/python3.9/site-packages/github/Requester.py", line 315, in requestJsonAndCheck
return self.__check(
File "/usr/local/lib/python3.9/site-packages/github/Requester.py", line 340, in __check
raise self.__createException(status, responseHeaders, output)
github.GithubException.UnknownObjectException: 404 {"message": "Not Found", "documentation_url": "https://docs.github.com/rest/reference/orgs#get-an-organization"}
`

If I only include the openzfs repo then it works fine. If I include multiple org repos it works fine. but if any user repos are included it fails.

How do I download the personal repos as well? Is there a separate variable for user repos?

Ignore DMCA Repos

Hey, would be great if it could ignore DMCA Repos, currently it fails and kills the entire tool therefore skipping the rest of the repos.

Fails to authenticate when token is used?

So was testing actually going live with the pull and while everything works great during the dry run, when I remove the dry run and try running the command for real I get errors.

Running github-archive -t my_token -c -u 0x0ece

Fails with

# Making API calls to GitHub for user repos...
# Cloning missing user repos...
The authenticity of host 'github.com (140.82.113.3)' can't be established.
ECDSA key fingerprint is SHA256:fghd98fgh09dfh7g09d7fgh098d7fg09hd
Are you sure you want to continue connecting (yes/no/[fingerprint])? ^CTraceback (most recent call last):

If I try running it with --https it works. If I use the -v dry run option it works with the token. I can confirm it was using my token as well since if I checked my rate limits with this command it would show that my token was used: https://docs.github.com/en/rest/rate-limit?apiVersion=2022-11-28

I still have 4800 calls remaining BTW, so should not be a rate limit issue.

I tried creating a brand new token on github with all permissions, both the new and classic token as well but the same error. Also tried both org and user and same error on both.

Any ideas?

Small side question, if put into pull mode, it will not download a repo that has not been cloned previously, is this by design?

403 RateLimitExceededException

github.GithubException.RateLimitExceededException: 403 {"message": "API rate limit exceeded for 195.19.90.245. (But here's the good news: Authenticated requests get a higher rate limit. Check out the documentation for more details.)", "documentation_url": "https://docs.github.com/rest/overview/resources-in-the-rest-api#rate-limiting"}

should i just wait?

Bulk forking

Is it possible to do bulk forking of all repos of an account?

use search params

How can I add filter params to clone repos that are matching my searching params?

Clone Repos Using Fast Forward

When cloning repos, you'll receive the following error:

warning: Pulling without specifying how to reconcile divergent branches is
discouraged. You can squelch this message by running one of the following
commands sometime before your next pull:

  git config pull.rebase false  # merge (the default strategy)
  git config pull.rebase true   # rebase
  git config pull.ff only       # fast-forward only

You can replace "git config" with "git config --global" to set a default
preference for all repositories. You can also pass --rebase, --no-rebase,
or --ff-only on the command line to override the configured default per
invocation.

To fix this, let's make all clones fast forward.

Rewrite the Script in Python

Rewriting this script in Python would allow for various things to occur:

  1. Stability
  2. GitHub SDK
  3. Concurrency

I have a working version of the new script already. With a few tweaks to the Forks script, we could rework GitHub Archive to allow for concurrency which could be really exciting.

Clone from Branch Breaks with New "Main" Branches

Summary

With GitHub introducing new main branches as the default, github-archive breaks for newer repos that aren't the master branch by default.

Acceptance Criteria

  • This can be easily fixed by only passing the branch flag to the git clone command if the branch option is passed on the command line to github-archive, by default it will instead then pull the default branch for each repo.

Can this script handle the new Github pull limits?

Forgive me if this is the wrong way of asking a question, brand new to Github.

I have been using a script to download github projects but with the new pull limits that github implemented I have 2 issues.

1: It will fail to download all the repos for a project (lineageOS or signal for example)

2: It will not let me know that it failed or where it failed (aka, was rejected).

Does this script have a way of dealing with this issue? Can it pick up where it left off? Or set a limit of the number of pulls to do at a time and then wait 6 hours to do more so as to not get cut off?

I am not working with all the projects right away, I am archiving a lot of them as a backup so I might not know that something is missing for some time and it could be a real bummer if it was found missing later.

Planning on setting this script up in a docker when I have some time next week but figured I would ask before I put too much time into it.

Code Refactor

The code is a big blob right now and could use some love. Break it down into smaller, testable units.

Acceptance Criteria

  • Functions are small and easily testable
  • Logic is completely removed from the main() function
  • DRY - we could probably clean this up a bit

Getting 502 error

After doing all set up and ran python3 github_archive.py am getting this error:

Sorry, I'm very far from Python, so might be a dumb question.

Traceback (most recent call last):
File "github_archive.py", line 272, in
Archive.run()
File "github_archive.py", line 219, in run
for repo in git_org:
File "/Library/Python/3.7/site-packages/github/PaginatedList.py", line 59, in iter
newElements = self._grow()
File "/Library/Python/3.7/site-packages/github/PaginatedList.py", line 71, in _grow
newElements = self._fetchNextPage()
File "/Library/Python/3.7/site-packages/github/PaginatedList.py", line 201, in _fetchNextPage
"GET", self.__nextUrl, parameters=self.__nextParams, headers=self.__headers
File "/Library/Python/3.7/site-packages/github/Requester.py", line 319, in requestJsonAndCheck
verb, url, parameters, headers, input, self.__customConnection(url)
File "/Library/Python/3.7/site-packages/github/Requester.py", line 342, in __check
raise self.__createException(status, responseHeaders, output)
github.GithubException.GithubException: 502 {"message": "Server Error"}

token not working with users, is this by design?

Using the Token would only allow you to download your own repos, Is it possible to combine the users and a token to download other users repos? Am I misinterpreting the documentation here?

I get the error

github.GithubException.RateLimitExceededException: 403 {"message": "API rate limit exceeded for xxxxxxxx. (But here's the good news: Authenticated requests get a higher rate limit. Check out the documentation for more details.)", "documentation_url": "https://docs.github.com/rest/overview/resources-in-the-rest-api#rate-limiting"}

Cut a release?

The current latest release is buggy without #23. Could you release a new version from the current HEAD of main?

Add new CLI parameter for Github hostname

The second initialization option in the pyGithub readme

# Github Enterprise with custom hostname
g = Github(base_url="https://{hostname}/api/v3", login_or_token="access_token")

Doesn't work if there are spaces in the location name

Thanks for Github Archive - it cloned all my repos in less time than git clones just one.

One thing I noticed is that the location can't have spaces. I tried enclosing the location in quotes but the effect is the same.

The terminal claims that everything was cloned succesfully but there is nothing at the location specified. This works:

github-archive --users sketchbuch --clone --location ~/Archive/github-archive --token xxxxxx

This does not:

github-archive --users sketchbuch --clone --location "~/Archive/github-archive - 27-08-2023" --token xxxxxx

If it did somehow work I have no idea where it clones everything to. There is nothing in ~/Archive and searching for github-archive - 27-08-2023 finds nothing in gnome files

Error Removing Git Repos that Timed Out On Windows

I'm using below command line

github-archive.exe --https -s [user] -t [key] -l c:\git-archive --clone

What happens for me:

  • some very large starred repositories (https://github.com/jgraph/drawio) time out - there is an error during clone
  • this leaves jgraph/drawio/.git folder, with .git being the only item under drawio
  • I cannot tell if that .git is consistent and valid, don't know how to check
  • the next time I run github-archive, it assumes that this repo is OK

I probably wouldn't want to keep drawio locally, because local .git folder was more than 700 MB, but it doesn't look quite right.

And can github-archive "update" those cloned repositories? Is it possible to have up-to-date copy of those repositories?

Thanks!

Add a "Dry Run" Flag

Summary

Add the ability to print the repos that would be downloaded without downloading, "dry-run"

Acceptance Criteria

  • Add a "dry run" flag so you can see what would be done without actually doing it.

Attribution

Requested by @8465231

Feature request - pull starred repositories

I don't know how this fits into the picture, but I have a few starred repositories that I'll miss if they go away.

Could you consider implementing include starred repositories?

Transition all Env Variables to CLI Flags

For better control and flexibility, let's move all env variables to CLI flags and rework the whole app to properly pass those variables around as class variables instead of parameters to each function.

Clone if repo doesn't exist / pull if it exists

Hey!
Love this tool, works great for cloning all my repos, I have a question though โ€“
I would like to use this as a way to mirror all my repositories on a server, so I would set up a cron job to periodically run it into the same directory.
I can see there are separate flags for clone and pull, but for my scenario I imagine it would be ideal to have it clone only repositories that don't exist in the directory, and pull any repository that already exist.
Is that something that's already possible and I'm just missing something and if not, would you consider it in scope for your tool / would you accept a PR?

does not clone private repos

Private repos are not cloned. Looks like the curl call should look like this:

curl -s -H "Authorization: token YOUR_TOKEN_HERE" "https://api.github.com/user/repos?per_page=100"

as posted in this gist comment

FileNotFoundError: [WinError 2] The system cannot find the file specified

Getting errors while pulling labs from some ECCouncil content:
What file would it be looking for?
I installed via pip (today)

github-archive -o codered-by-ec-council --clone --https -v --log_level debug
# GitHub Archive started...

# Making API calls to GitHub for org repos...

codered-by-ec-council repos retrieved!
# Viewing org repos...

codered-by-ec-council/Black-Hat-Python-for-Pentesters
codered-by-ec-council/Micro-Degree-in-Python-Security
codered-by-ec-council/Hands-on-Android-Security
codered-by-ec-council/Ethical-Hacking---Capture-the-Flag-Walkthroughs---v1
codered-by-ec-council/Ethical-Hacking---Capture-the-Flag-Walkthroughs---v2
codered-by-ec-council/Introduction-to-Exploit-Zero-Day-Discovery-and-Development
codered-by-ec-council/Chrome-DevTools-Introduction-2020-Web-Developers-Guide
codered-by-ec-council/AZ-900_Microsoft_Azure_Fundamentals_Certification_2020
codered-by-ec-council/Buil-EU-GDPR-Data-Protection-Compliance-from-Scratch
codered-by-ec-council/Computer-Vision-Face-Recognition-Quick-Starter-in-Python
codered-by-ec-council/Windows-Penetration-Testing-Essentials
codered-by-ec-council/Secure-Full-Stack-MEAN-Developer
codered-by-ec-council/PowerShell-Security-Best-Practices
codered-by-ec-council/The-Comprehensive-Ethical-Hacking-Course
codered-by-ec-council/Secure-Programming-with-Go
codered-by-ec-council/Wireshark-for-Ethical-Hackers
codered-by-ec-council/The-Complete-Mobile-Ethical-Hacking
codered-by-ec-council/Shell-Scripting-with-Bash
codered-by-ec-council/Getting-Started-with-Blazor
codered-by-ec-council/AI-for-Finance
codered-by-ec-council/Data-Analysis-with-Python
codered-by-ec-council/Applied-Statistics-with-Python
codered-by-ec-council/Practical-Bug-Bounty-Hunting-for-Hackers-and-Pentesters
codered-by-ec-council/Python-and-Flask-Application-Development
codered-by-ec-council/Secure-Software-Architecture-and-Design-Patterns-in-Java-EE-Part-1
codered-by-ec-council/Computational-Mathematics-for-Data-Science
codered-by-ec-council/Beyond-the-Basics-Applied-Python-Part-One
codered-by-ec-council/Tunnels-in-Networking
codered-by-ec-council/Unleash-TensorFlow-2.0
codered-by-ec-council/Secure-and-Manage-Windows-Server-2016
codered-by-ec-council/Practical-Visit-to-Data-Mining
codered-by-ec-council/Practical-Applications-of-Machine-Learning
codered-by-ec-council/5G-Security-Deconstructed
codered-by-ec-council/Malware-Analysis-Fundamentals
codered-by-ec-council/Introduction-to-R-Programming
codered-by-ec-council/Ethical-Hacking-Capture-the-Flag-Walkthrough-V3
codered-by-ec-council/Digital-Forensics-for-Pentesters---Hands-on-Learning-
codered-by-ec-council/Microsoft-Security-Compliance-And-Identity-Fundamentals-Exam-Ref-SC-900
codered-by-ec-council/Practical-Spring-Security
codered-by-ec-council/Practical-Cyber-Threat-Intelligence
codered-by-ec-council/Applied-Threat-Hunting
codered-by-ec-council/Securing-Your-Data-Warehouse-with-Azure-Synapse-Analytics
codered-by-ec-council/Hands-on-SQL-for-Data-Science
codered-by-ec-council/Linux-32-bit-Reverse-Engineering
codered-by-ec-council/Installing-and-Mitigating-Linux-Rootkits
codered-by-ec-council/5G-Strategies-for-Businesses
codered-by-ec-council/Secure-Software-Architecture-and-Design-Patterns-in-Java-EE-Part-2
codered-by-ec-council/Cybersecurity-for-Telecom-Attack-and-Defend-Techniques-Tools-and-Frauds
codered-by-ec-council/Network-Automation-in-Python
codered-by-ec-council/Coding-with-Git
codered-by-ec-council/Enterprise-API-for-Advanced-Azure-Developers
codered-by-ec-council/Deep-Learning-Masked-Face-Detection-Recognition
codered-by-ec-council/The-Comprehensive-SQL-Course-2021
codered-by-ec-council/The-Advanced-SQL-Course-2021
codered-by-ec-council/Security-Information-and-Event-Management
codered-by-ec-council/Pattern-Recognition-in-Python
codered-by-ec-council/React-and-Secure-Your-Applications
codered-by-ec-council/Machine-Learning-Using-Python
codered-by-ec-council/Microsoft-Cybersecurity-Pro-Track-Threat-Detection
codered-by-ec-council/ASP.NET-Security
codered-by-ec-council/Data-Analysis-With-R-Masterclass
codered-by-ec-council/Introduction-to-Erlang-Programming
codered-by-ec-council/Secure-Programming-in-Golang
codered-by-ec-council/Digital-Twins-for-Cybersecurity
codered-by-ec-council/Data-Anonymization-Demystified
codered-by-ec-council/Advanced-Deep-Learning-Part-1
codered-by-ec-council/Advanced-Deep-Learning-Part-2
codered-by-ec-council/Hands-on-Python-Web-Scrapping-from-Scratch
codered-by-ec-council/Build-Security-Incident-Response-for-GDPR-Data-Protection
codered-by-ec-council/Ethical-Hacking-Capture-the-Flag-Walkthroughs-v1
codered-by-ec-council/Ethical-Hacking-Capture-the-Flag-Walkthroughs-v2
codered-by-ec-council/Ethical-Hacking-Capture-the-Flag-Walkthroughs-v3
codered-by-ec-council/Manage-Nano-Server-in-Hyper-V
codered-by-ec-council/Learn-Jmeter-from-Scratch-Web-Development
codered-by-ec-council/Microsoft-Cybersecurity-Pro-Track-Security-in-Office-365
codered-by-ec-council/GDPR-Privacy-Data-Protection-CASE-STUDIES-Explained
codered-by-ec-council/Cryptography-Demystified
codered-by-ec-council/Hands-on-Zero-Day-Exploit
codered-by-ec-council/Hands-on-JavaScript-for-Ethical-Hacking
codered-by-ec-council/Learn-Power-BI---Part-1
codered-by-ec-council/Self-Sovereign-Identity-SSI-the-Future-of-Trusted-Transactions
codered-by-ec-council/Big-Data-Analytics-for-Cybersecurity
codered-by-ec-council/Reverse-Engineering
codered-by-ec-council/Hands-on-TinyML
codered-by-ec-council/Entity-Fundamentals-Using-.NET-6
codered-by-ec-council/Digital-Forensics-for-Pentesters-Hands-on-Learning
codered-by-ec-council/Developing-IoT-Solutions-With-Azure-Suite
codered-by-ec-council/OSINT---Open-source-Intelligence
codered-by-ec-council/Applied-Python-for-Professionals
codered-by-ec-council/Hands-on-React-Native
codered-by-ec-council/Hands-on-Binary-Analysis-in-Linux-Part-1
codered-by-ec-council/Linux-Server-Administration-Made-Easy-with-Hands-on-Training
codered-by-ec-council/Applied-Data-Loss-Prevention
codered-by-ec-council/Practical-Introduction-to-Mainframe
codered-by-ec-council/Linux-Administration-With-Ansible
codered-by-ec-council/Mobile-Penetration-Testing-with-Kali-NetHunter
# Cloning missing org repos...

Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\lib\runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\ProgramData\Anaconda3\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\ProgramData\Anaconda3\Scripts\github-archive.exe\__main__.py", line 7, in <module>
  File "C:\ProgramData\Anaconda3\lib\site-packages\github_archive\cli.py", line 186, in main
    GithubArchiveCli().run()
  File "C:\ProgramData\Anaconda3\lib\site-packages\github_archive\cli.py", line 182, in run
    github_archive.run()
  File "C:\ProgramData\Anaconda3\lib\site-packages\github_archive\archive.py", line 133, in run
    failed_repos = self.iterate_repos_to_archive(org_repos, CLONE_OPERATION)
  File "C:\ProgramData\Anaconda3\lib\site-packages\github_archive\archive.py", line 306, in iterate_repos_to_archive
    failed_repos = [repo.result() for repo in thread_list if repo.result()]
  File "C:\ProgramData\Anaconda3\lib\site-packages\github_archive\archive.py", line 306, in <listcomp>
    failed_repos = [repo.result() for repo in thread_list if repo.result()]
  File "C:\ProgramData\Anaconda3\lib\concurrent\futures\_base.py", line 438, in result
    return self.__get_result()
  File "C:\ProgramData\Anaconda3\lib\concurrent\futures\_base.py", line 390, in __get_result
    raise self._exception
  File "C:\ProgramData\Anaconda3\lib\concurrent\futures\thread.py", line 52, in run
    result = self.fn(*self.args, **self.kwargs)
  File "C:\ProgramData\Anaconda3\lib\site-packages\github_archive\archive.py", line 373, in archive_repo
    subprocess.run(
  File "C:\ProgramData\Anaconda3\lib\subprocess.py", line 505, in run
    with Popen(*popenargs, **kwargs) as process:
  File "C:\ProgramData\Anaconda3\lib\subprocess.py", line 951, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "C:\ProgramData\Anaconda3\lib\subprocess.py", line 1420, in _execute_child
    hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
FileNotFoundError: [WinError 2] The system cannot find the file specified

Possible to have the script include org members as well?

I finally got this script running in a docker (ended up finding a python docker with everything setup properly, was a nightmare trying to do that manually as someone still new to linux) and it is really great, much better then my old one.

Only thing I miss from my old script was it would not only crawl through the org but also include all the org members.

Is it possible to add this functionality to this script by chance?

This is the script I used to use: https://github.com/mazen160/GithubCloner

Thanks for your work on this, saves me a ton of time and no way I could do this myself.

error

github.GithubException.RateLimitExceededException: 403 {"message": "API rate limit exceeded for 195.19.90.245. (But here's the good news: Authenticated requests get a higher rate limit. Check out the documentation for more details.)", "documentation_url": "https://docs.github.com/rest/overview/resources-in-the-rest-api#rate-limiting"}

should i just wait?

Add a check when no parameters are passed

We have a check in place where you must specify an action when a list is present and a list when an action is present but no check to ensure that at least one parameter is passed. This means you can run the tool without specifying a thing and it'll exit successfully without providing details as to why nothing was cloned.

Overhaul Logging to use Python Logging Package

This project is currently using a home grown logger, instead we should be using the logging package and roll-over old logs.

Acceptance Criteria

  • Items are logged to console instead of printed with the appropriate logging level for the type of message it is
  • Logged messages are successfully saved to a file
  • Old log files are rolled over to conserve disk space
  • Log file names play nicely with all OS's (Windows especially)

Unable to download user repos?

So I was testing out your script again and while it seems to work for orgs, anytime I try to download a user account it gives the following error

  File "/usr/local/bin/github-archive", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/lib/python3.11/site-packages/github_archive/cli.py", line 191, in main
    GithubArchiveCli().run()
  File "/usr/local/lib/python3.11/site-packages/github_archive/cli.py", line 187, in run
    github_archive.run()
  File "/usr/local/lib/python3.11/site-packages/github_archive/archive.py", line 126, in run
    self.users.remove(self.authenticated_username)
ValueError: list.remove(x): x not in list```


Is it possible to download user repos with this script?

Add Unit Tests

None of this project is currently covered by unit tests. Add a testing suite and coveralls to track coverage.

Timeout and running out of Ram issues on very large Repos, possible to limit number of parallel git calls?

So been using this script a bit now and it is great for smaller repos but ran into a fairly consistent issue with very large repos such as LineageOS (around ~400-500GB and ~2500 sub-repos).

It will first have a lot of timeout errors in the logs no matter how long I set the timeout to. I have also tried increasing the delay on the calls to 20 seconds but this causes this to take forever and I still got timeout errors so I canceled it.

Then it will eventually cause my system with 64GB of ram to run out of memory.

It appears that it is starting a limitless number of git calls as a top on the system shows git commands running as far as I can scroll.

I am thinking that limiting the number of active git calls would fix both issues. Is it possible to set the number of threads / git calls that will be run in parallel?

Windows clone error

I tried to use this tool o clone the github profile, but got an error
The log and command:
`>github-archive -u "sberbank-ai" --clone

GitHub Archive started...

Making API calls to GitHub for user repos...

Cloning missing user repos...

Failed to clone ai-academy-2019
Command '['git', 'clone', '[email protected]:sberbank-ai/ai-academy-2019.git', 'C:\Users\richard\github-archive\repos\sberbank-ai\ai-academy-2019']' returned non-zero exit status 128.
Failed to clone DigiTeller
Command '['git', 'clone', '[email protected]:sberbank-ai/DigiTeller.git', 'C:\Users\richard\github-archive\repos\sberbank-ai\DigiTeller']' returned non-zero exit status 128.
Failed to clone classic-ai
Command '['git', 'clone', '[email protected]:sberbank-ai/classic-ai.git', 'C:\Users\richard\github-archive\repos\sberbank-ai\classic-ai']' returned non-zero exit status 128.
Failed to clone ai-journey-2019
Command '['git', 'clone', '[email protected]:sberbank-ai/ai-journey-2019.git', 'C:\Users\richard\github-archive\repos\sberbank-ai\ai-journey-2019']' returned non-zero exit status 128.
Failed to clone digital_peter_aij2020
Command '['git', 'clone', '[email protected]:sberbank-ai/digital_peter_aij2020.git', 'C:\Users\richard\github-archive\repos\sberbank-ai\digital_peter_aij2020']' returned non-zero exit status 128.
Failed to clone combined_solution_aij2019
Command '['git', 'clone', '[email protected]:sberbank-ai/combined_solution_aij2019.git', 'C:\Users\richard\github-archive\repos\sberbank-ai\combined_solution_aij2019']' returned non-zero exit status 128.
Failed to clone DetIE
Command '['git', 'clone', '[email protected]:sberbank-ai/DetIE.git', 'C:\Users\richard\github-archive\repos\sberbank-ai\DetIE']' returned non-zero exit status 128.
Failed to clone Graph-Research
Command '['git', 'clone', '[email protected]:sberbank-ai/Graph-Research.git', 'C:\Users\richard\github-archive\repos\sberbank-ai\Graph-Research']' returned non-zero exit status 128.
Failed to clone fusion_brain_aij2021
Command '['git', 'clone', '[email protected]:sberbank-ai/fusion_brain_aij2021.git', 'C:\Users\richard\github-archive\repos\sberbank-ai\fusion_brain_aij2021']' returned non-zero exit status 128.
Failed to clone ControlledNST
Command '['git', 'clone', '[email protected]:sberbank-ai/ControlledNST.git', 'C:\Users\richard\github-archive\repos\sberbank-ai\ControlledNST']' returned non-zero exit status 128.
Failed to clone data-science-journey-2017
Command '['git', 'clone', '[email protected]:sberbank-ai/data-science-journey-2017.git', 'C:\Users\richard\github-archive\repos\sberbank-ai\data-science-journey-2017']' returned non-zero exit status 128.
Failed to clone model-zoo
Command '['git', 'clone', '[email protected]:sberbank-ai/model-zoo.git', 'C:\Users\richard\github-archive\repos\sberbank-ai\model-zoo']' returned non-zero exit status 128.
Failed to clone holdem-challenge
Command '['git', 'clone', '[email protected]:sberbank-ai/holdem-challenge.git', 'C:\Users\richard\github-archive\repos\sberbank-ai\holdem-challenge']' returned non-zero exit status 128.
Failed to clone ner-bert
Command '['git', 'clone', '[email protected]:sberbank-ai/ner-bert.git', 'C:\Users\richard\github-archive\repos\sberbank-ai\ner-bert']' returned non-zero exit status 128.
Failed to clone no_flood_with_ai_aij2020
Command '['git', 'clone', '[email protected]:sberbank-ai/no_flood_with_ai_aij2020.git', 'C:\Users\richard\github-archive\repos\sberbank-ai\no_flood_with_ai_aij2020']' returned non-zero exit status 128.
Failed to clone music-composer
Command '['git', 'clone', '[email protected]:sberbank-ai/music-composer.git', 'C:\Users\richard\github-archive\repos\sberbank-ai\music-composer']' returned non-zero exit status 128.
Failed to clone htr_datasets
Command '['git', 'clone', '[email protected]:sberbank-ai/htr_datasets.git', 'C:\Users\richard\github-archive\repos\sberbank-ai\htr_datasets']' returned non-zero exit status 128.
Failed to clone lastochka
Command '['git', 'clone', '[email protected]:sberbank-ai/lastochka.git', 'C:\Users\richard\github-archive\repos\sberbank-ai\lastochka']' returned non-zero exit status 128.
Failed to clone mchs-wildfire
Command '['git', 'clone', '[email protected]:sberbank-ai/mchs-wildfire.git', 'C:\Users\richard\github-archive\repos\sberbank-ai\mchs-wildfire']' returned non-zero exit status 128.
Failed to clone no_fire_with_ai_aij2021
Command '['git', 'clone', '[email protected]:sberbank-ai/no_fire_with_ai_aij2021.git', 'C:\Users\richard\github-archive\repos\sberbank-ai\no_fire_with_ai_aij2021']' returned non-zero exit status 128.
Failed to clone OCR-model
Command '['git', 'clone', '[email protected]:sberbank-ai/OCR-model.git', 'C:\Users\richard\github-archive\repos\sberbank-ai\OCR-model']' returned non-zero exit status 128.
Failed to clone railway_infrastructure_detection_aij2021
Command '['git', 'clone', '[email protected]:sberbank-ai/railway_infrastructure_detection_aij2021.git', 'C:\Users\richard\github-archive\repos\sberbank-ai\railway_infrastructure_detection_aij2021']' returned non-zero exit status 128.
Failed to clone Real-ESRGAN
Command '['git', 'clone', '[email protected]:sberbank-ai/Real-ESRGAN.git', 'C:\Users\richard\github-archive\repos\sberbank-ai\Real-ESRGAN']' returned non-zero exit status 128.
Failed to clone ruGPT3_essays
Command '['git', 'clone', '[email protected]:sberbank-ai/ruGPT3_essays.git', 'C:\Users\richard\github-archive\repos\sberbank-ai\ruGPT3_essays']' returned non-zero exit status 128.
Failed to clone ru-gpts
Command '['git', 'clone', '[email protected]:sberbank-ai/ru-gpts.git', 'C:\Users\richard\github-archive\repos\sberbank-ai\ru-gpts']' returned non-zero exit status 128.
Failed to clone ru-dolph
Command '['git', 'clone', '[email protected]:sberbank-ai/ru-dolph.git', 'C:\Users\richard\github-archive\repos\sberbank-ai\ru-dolph']' returned non-zero exit status 128.
Failed to clone ruGPT3_demos
Command '['git', 'clone', '[email protected]:sberbank-ai/ruGPT3_demos.git', 'C:\Users\richard\github-archive\repos\sberbank-ai\ruGPT3_demos']' returned non-zero exit status 128.
Failed to clone ru-dalle
Command '['git', 'clone', '[email protected]:sberbank-ai/ru-dalle.git', 'C:\Users\richard\github-archive\repos\sberbank-ai\ru-dalle']' returned non-zero exit status 128.
Failed to clone ru-prompts
Command '['git', 'clone', '[email protected]:sberbank-ai/ru-prompts.git', 'C:\Users\richard\github-archive\repos\sberbank-ai\ru-prompts']' returned non-zero exit status 128.
Failed to clone sber-swap
Command '['git', 'clone', '[email protected]:sberbank-ai/sber-swap.git', 'C:\Users\richard\github-archive\repos\sberbank-ai\sber-swap']' returned non-zero exit status 128.
Failed to clone ru-clip
Command '['git', 'clone', '[email protected]:sberbank-ai/ru-clip.git', 'C:\Users\richard\github-archive\repos\sberbank-ai\ru-clip']' returned non-zero exit status 128.
Failed to clone sdsj2018-automl
Command '['git', 'clone', '[email protected]:sberbank-ai/sdsj2018-automl.git', 'C:\Users\richard\github-archive\repos\sberbank-ai\sdsj2018-automl']' returned non-zero exit status 128.
Failed to clone wing
Command '['git', 'clone', '[email protected]:sberbank-ai/wing.git', 'C:\Users\richard\github-archive\repos\sberbank-ai\wing']' returned non-zero exit status 128.
Failed to clone StackMix-OCR
Command '['git', 'clone', '[email protected]:sberbank-ai/StackMix-OCR.git', 'C:\Users\richard\github-archive\repos\sberbank-ai\StackMix-OCR']' returned non-zero exit status 128.
Failed to clone sberocr
Command '['git', 'clone', '[email protected]:sberbank-ai/sberocr.git', 'C:\Users\richard\github-archive\repos\sberbank-ai\sberocr']' returned non-zero exit status 128.
Failed to clone sber-vq-gan
Command '['git', 'clone', '[email protected]:sberbank-ai/sber-vq-gan.git', 'C:\Users\richard\github-archive\repos\sberbank-ai\sber-vq-gan']' returned non-zero exit status 128.
Cleaning up repos...

GitHub Archive complete! Execution time: 0:00:09.308107.`

Also, git clone https://github.com/sberbank-ai/sber-vq-gan.git works

Add a "Forks" flag to clone/pull forks

Summary

By default, forks are included in cloning/pulling. This setting cannot be altered. Let's add a flag that allows the user to choose whether to include forks or not.

Acceptance Criteria

  • Add a flag that allows users to include forks in what is cloned/pulled. This will be off by default.

Error TypeError: iterate_repos_to_archive() missing 1 required positional argument: 'operation'

Hello,

Here is the command I try, Python 3.9, Windows

github-archive.exe -s bybor -t XXXXX -l c:\git-archive --clone

Here is the output

# GitHub Archive started...

# Making API call to GitHub for starred repos...

# Cloning missing starred repos...

Traceback (most recent call last):
  File "C:\Python39\lib\runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Python39\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\Python39\Scripts\github-archive.exe\__main__.py", line 7, in <module>
  File "C:\Python39\lib\site-packages\github_archive\cli.py", line 134, in main
    GithubArchiveCli().run()
  File "C:\Python39\lib\site-packages\github_archive\cli.py", line 130, in run
    github_archive.run()
  File "C:\Python39\lib\site-packages\github_archive\archive.py", line 140, in run
    self.iterate_repos_to_archive(starred_repos, CLONE_OPERATION)
TypeError: iterate_repos_to_archive() missing 1 required positional argument: 'operation'

Could you help with what's going wrong?

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.