mtarsel / containeranalysis Goto Github PK
View Code? Open in Web Editor NEWAn application to query different registries and repos to run analytics on Docker Containers.
An application to query different registries and repos to run analytics on Docker Containers.
Maybe keep the .csv too in case people prefer it
From The wiki
With the shift towards OCP, there is a need to categorize the images from this repo:
https://github.com/openshift/library
Similar to how we do it for ICP images, we would need to gather image,container and arch information.
Just need to add some spacing
There are some sys.exit() commands spread about which can have an argument for an error message. It would be useful to know what failed with those and to make all of the logging that goes along with that have the same language
Outputs useful information about how the program ran that is less verbose, more concise than container-output.log. metrics.log will include:
As I think of other metrics to add, I'll comment them below
Using a venv, ensure this project executes properly using python3. Modules will likely need to be updated and a new requirements.txt will be created.
High priority because if implemented properly, it has the potential to reduce the runtime by >25%, as well as helping to make #31 quite a bit easier because there would be less info to save between runs.
Currently there are requests made to dockerhub in 4 functions while repo_crawling: get_image_tag_count, get_image_tag_names, get_image_tag, and sometimes get_archs.
get_image_tag_count and get_image_tag_names each make the exact same request:
image_url = ('https://' + regis + '/v2/repositories/' + self.org + '/'+
self.name +'/tags/?page=1&page_size=100')
get_alot_image_tag_names uses a very similar request which can be re-configured using the 'next' field in the return from the get_image_tag_count request.
get_image_tag and get_archs also both make the exact same request:
tag_url = ('https://' + regis + '/v2/repositories/' + self.org + '/'+
self.name +'/tags/'+ image_tag_name + '/')
Just to test I saved the data from get_image_tag_count and passed it into get_image_tag_names, removing the request from get_image_tag_names. that brought the runtime of python3 get-image-info.py user.yaml -k
from 11:09 to 10:19, almost a 9% reduction in runtime already. I am also fairly sure that the info used in get_image_tag and get_image_archs can be found in the first request, thus reducing runtime further.
rather than just results.csv, there would be an archives folder with file names formated as RESULTS-07-12-2019.csv. The old results from that day would be overwitten but previous days would not
[ENHANCEMENT]
Currently there is no way to see how long the application has been running for or how close it is to finishing. A simple progress bar would do that
Not just dockerhub.com
from The wiki
To reduce run time for testing, add option to use local values.yaml files and skip process of pulling them from the interwebs
See 4cffb18
This is helpful to test a single application rather than waiting for all of them to finish and output. The code in tests.py is out-dated and needs to be changed to better utilize more functions instead of copy code.
This issue will be marked complete once it is possible to test a single application with proper output.
Currently there are a lot of nested for loop that could be turned each into an individual stage, for instance, to get, extract, and parse the values.yaml the deepest (sorta) stack trace looks like this:
(get-image-info.py) main -> parse_index_yaml -> (indexparser.py) get_tarfile -> obtain_values_yaml -> get_app_info -> parse-image_repo
and the flow currently looks like this:
for main_image:
download, open, extract tarfile
for item in tarfile:
if (values.yaml):
find images, tags, repos in values.yaml
for each potential_repo:
if potential_repo is dict:
for item in dict:
if item is dict:
for sub_item in item:
if correct_format:
append(sub_item) to app_obj
for repos in app_obj:
clean the repos
I think that could be cleaned up considerably if you did some thing like
# Stage1: download and extract
for each main_image:
download, open, extract tarfile
# stage2: find values.yaml
for each main_image:
find values.yaml
find find images, tags, repos in values.yaml
# stage3: get potential repos
for each potential_repo:
if potential_repo is dict:
for item in dict:
if item is dict:
for sub_item in item:
if correct_format:
append(sub_item) to app_obj
# stage4: clean the repos
this is a bigger undertaking and might take a while, but this way the code is more modular and easier to read, with fewer traces of "what calls what calls what?"
Since #40 is going to stay "private", imo the best way to share the results would be to:
When there is a diff or something is very wrong, this should be shared in our slack channel.
Make everything PEP8 and increase in-code documentation (comments)
This task depends on #7
The Product name is obtained by grabbing the first line of the README for each Application. Sometimes in the readme, there is comments which are not seen once the markdown is rendered and so we don't actually get the Product name in the results.csv
Since there is only a few applications with this problem, #7 should be closed and then we can utilize the single test cases to save ourselves some time.
From The wiki
Since we are no longer using the github api token, there is no need to have a travis secret environment variable and thus there need not be a distinction between PR builds and non-PR builds.
There exists a dashboard with architecture information. This task is focused on:
This issue is a different type of cross-validation than #39 in that this task is about verifying our results against a different data set (GSA dashboard) and not verifying our own past results.
Rather than downloading the Application's tarball from github and extracting just the files we need, we should just use the Github API and download the raw values.yaml and Chart.yaml into the Applications/{app_name}/
This should reduce run time by removing all the file system calls. It should be noted this will increase the number of requests to github so authentication will be mandatory to run this project to avoid the rate limit of the API.
Since the majority of the runtime comes from crawling dockerhub, what if we were able to just save all of the information we pulled from dockerhub locally? A couple ideas to do this:
This should reduce the runtime even further than #13 did
Need to update tests and .travis.yml to avoid environment variables when running a PR build on Travis
eg: Crawl git repos parsing READMEs
from The wiki
progress bar: [### ] %done TIME
logs: ------TIME------
There can be way more Travis tests utilizing the new --test feature to ensure proper outputs, as well as creating individual app objects and using those to test individual functions
Alert the user if there was a change from the last archived result to the currently result, and what that change is.
Caught in travis-ci cron build on my fork:
https://travis-ci.com/1ethanhansen/ContainerAnalysis/builds/131578728
parse_repos_1()
needs to be updated
Not a lot of context for this one, more of a stretch goal. Not even sure how we would go about it.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.