mtarsel / containeranalysis Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 3.0 981 KB

An application to query different registries and repos to run analytics on Docker Containers.

Python 96.90% Shell 2.54% Dockerfile 0.55%

containeranalysis's People

Contributors

Watchers

Forkers

1ethanhansen pombredanne ryescholin

containeranalysis's Issues

Change the output from CSV to a more interactive GUI

Maybe keep the .csv too in case people prefer it

From The wiki

Gather info from OCP repo

With the shift towards OCP, there is a need to categorize the images from this repo:

https://github.com/openshift/library

Similar to how we do it for ICP images, we would need to gather image,container and arch information.

Slack info formatted incorrectly

Just need to add some spacing

There are some sys.exit() commands spread about which can have an argument for an error message. It would be useful to know what failed with those and to make all of the logging that goes along with that have the same language

Add Performance Tests

Outputs useful information about how the program ran that is less verbose, more concise than container-output.log. metrics.log will include:

time taken to run
# of applications checked total
# of apps w/ unparsed images

As I think of other metrics to add, I'll comment them below

Move to python3

Using a venv, ensure this project executes properly using python3. Modules will likely need to be updated and a new requirements.txt will be created.

Remove unneeded dockerhub requests

High priority because if implemented properly, it has the potential to reduce the runtime by >25%, as well as helping to make #31 quite a bit easier because there would be less info to save between runs.

Currently there are requests made to dockerhub in 4 functions while repo_crawling: get_image_tag_count, get_image_tag_names, get_image_tag, and sometimes get_archs.

get_image_tag_count and get_image_tag_names each make the exact same request:

image_url = ('https://' + regis + '/v2/repositories/' + self.org + '/'+
			self.name +'/tags/?page=1&page_size=100')

get_alot_image_tag_names uses a very similar request which can be re-configured using the 'next' field in the return from the get_image_tag_count request.

get_image_tag and get_archs also both make the exact same request:

tag_url = ('https://' + regis + '/v2/repositories/' + self.org + '/'+
			self.name +'/tags/'+ image_tag_name + '/')

Just to test I saved the data from get_image_tag_count and passed it into get_image_tag_names, removing the request from get_image_tag_names. that brought the runtime of python3 get-image-info.py user.yaml -k from 11:09 to 10:19, almost a 9% reduction in runtime already. I am also fairly sure that the info used in get_image_tag and get_image_archs can be found in the first request, thus reducing runtime further.

Travis CI Unit tests

Various input to get_image_info.py, like --debug
results.csv file contains the correct number of columns
General function tests

Make results.csv name include date

rather than just results.csv, there would be an archives folder with file names formated as RESULTS-07-12-2019.csv. The old results from that day would be overwitten but previous days would not

Add Progress bar to project

[ENHANCEMENT]

Currently there is no way to see how long the application has been running for or how close it is to finishing. A simple progress bar would do that

Support other Docker repos

Not just dockerhub.com

from The wiki

Add keep values.yaml files option

To reduce run time for testing, add option to use local values.yaml files and skip process of pulling them from the interwebs

Update /utils/tests.py

See 4cffb18

This is helpful to test a single application rather than waiting for all of them to finish and output. The code in tests.py is out-dated and needs to be changed to better utilize more functions instead of copy code.

This issue will be marked complete once it is possible to test a single application with proper output.

Make program flow stage-based

Currently there are a lot of nested for loop that could be turned each into an individual stage, for instance, to get, extract, and parse the values.yaml the deepest (sorta) stack trace looks like this:

(get-image-info.py) main -> parse_index_yaml -> (indexparser.py) get_tarfile -> obtain_values_yaml -> get_app_info -> parse-image_repo

and the flow currently looks like this:

for main_image:
  download, open, extract tarfile
  for item in tarfile:
    if (values.yaml):
      find images, tags, repos in values.yaml
      for each potential_repo:
        if potential_repo is dict:
          for item in dict:
            if item is dict:
              for sub_item in item:
                if correct_format:
                  append(sub_item) to app_obj
      for repos in app_obj:
        clean the repos

I think that could be cleaned up considerably if you did some thing like

# Stage1: download and extract
for each main_image:
  download, open, extract tarfile
# stage2: find values.yaml
for each main_image:
  find values.yaml
  find find images, tags, repos in values.yaml
# stage3: get potential repos
for each potential_repo:
  if potential_repo is dict:
    for item in dict:
      if item is dict:
        for sub_item in item:
          if correct_format:
            append(sub_item) to app_obj
# stage4: clean the repos

this is a bigger undertaking and might take a while, but this way the code is more modular and easier to read, with fewer traces of "what calls what calls what?"

Add travis badge to README

[ENHANCEMENT]

Since everything in the code is ready for Travis CI integration (on a basic level), there should be a Travis Badge in the README. That requires @mtarsel to link his github to Travis CI first, then we can get the markdown with a link to the travis badge image

related to #2

Share results from program output

Since #40 is going to stay "private", imo the best way to share the results would be to:

Execute get-image-info.py on a Jenkins server
Cross-validate (#40), get results, get diff (#39)
Share info via Slack API from Jenkins

When there is a diff or something is very wrong, this should be shared in our slack channel.

So this issue depends on #39 and #40 being closed first.

PEP8 Enablement including Comments

Make everything PEP8 and increase in-code documentation (comments)

Properly obtain Product name for Application

This task depends on #7

The Product name is obtained by grabbing the first line of the README for each Application. Sometimes in the readme, there is comments which are not seen once the markdown is rendered and so we don't actually get the Product name in the results.csv

Since there is only a few applications with this problem, #7 should be closed and then we can utilize the single test cases to save ourselves some time.

Create more thorough wiki & README

From The wiki

Remove non-pr travis tests

Since we are no longer using the github api token, there is no need to have a travis secret environment variable and thus there need not be a distinction between PR builds and non-PR builds.

Obtain info from GSA dashboard and cross-validate against our results

There exists a dashboard with architecture information. This task is focused on:

Downloading arch info from GSA dashboard
Extracting arch info
Validating GSA dashboard info with our info from dockerhub

This issue is a different type of cross-validation than #39 in that this task is about verifying our results against a different data set (GSA dashboard) and not verifying our own past results.

Utilize Github API to obtain Chart.yaml and values.yaml

Rather than downloading the Application's tarball from github and extracting just the files we need, we should just use the Github API and download the raw values.yaml and Chart.yaml into the Applications/{app_name}/

This should reduce run time by removing all the file system calls. It should be noted this will increase the number of requests to github so authentication will be mandatory to run this project to avoid the rate limit of the API.

Create way to save docker info locally

Since the majority of the runtime comes from crawling dockerhub, what if we were able to just save all of the information we pulled from dockerhub locally? A couple ideas to do this:

Save the information we pull as json files, since it come in json format already
save all of the information we have on apps as json files (so we don't need to pull anything, no values.yaml, no README.md, no Chart.yaml)

This should reduce the runtime even further than #13 did

mtarsel / containeranalysis Goto Github PK

containeranalysis's People

Contributors

Watchers

Forkers

containeranalysis's Issues

Recommend Projects

Recommend Topics

Recommend Org