Coder Social home page Coder Social logo

ilyafaer / github-scraper Goto Github PK

View Code? Open in Web Editor NEW
10.0 3.0 1.0 436 KB

GitHub Scraper is a tool for tracking several repositories within one Google Spreadsheet, making task management and status info sharement between teammates easier.

License: Apache License 2.0

Python 100.00%
github tracker issue task-management repository-management

github-scraper's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

mf2199

github-scraper's Issues

Process percentage

First table building iteration can take a lot of time (in case of liveful repositories). As GitHub API returns total number of issues, retrieved by filter, we can count a percentage of processed issues/PRs, and show it in logs.

See:

def build_table(self):
"""Build list of issues/PRs from given repositories.
Returns:
dict:
Index of issue in format:
{(issue.number, repo_short_name): github.Issue.Issue}
"""
issue_index = {}
repo_names = list(self._repo_names.keys()) + list(self._in_repo_names.keys())
for repo_name in repo_names:
repo = self._repos.setdefault(repo_name, gh_client.get_repo(repo_name))
self._index_closed_prs(repo)
# process open PRs and issues
for issue in repo.get_issues():
id_ = self._build_issues_id(issue, repo)
if id_:
issue_index[id_] = issue
return issue_index

and
def _index_closed_prs(self, repo):
"""Add closed pull requests into PRs index.
Method remembers last PR's update time and doesn't
indexate PRs which weren't updated since last
spreadsheet update.
Args:
repo (github.Repository.Repository):
Repository object.
"""
pulls = repo.get_pulls(state="closed", sort="updated", direction="desc")
if pulls.totalCount:
for pull in pulls:
if pull.updated_at < self._last_pr_updates.setdefault(
repo.full_name, datetime.datetime(1, 1, 1)
):
break
key_phrases = self._try_match_keywords(pull.body)
for key_phrase in key_phrases:
self.prs_index.add(repo, self._get_repo_lts(repo), pull, key_phrase)
self._last_pr_updates[repo.full_name] = pulls[0].updated_at

Speed up

Speed up Scraper by adding date filter for issues. Issues should be retrieved with sorting by "updated_at", then we could avoid reading and processing issues which were updated long ago. Also since filter can be used to get only recently changed issues.

Pull requests without related issues?

It may worth to track pull requests, which are not related to any issues. The problem in here is how to show them in table, how to track, how not to technically mix them with issues.

UPDATE: issues are now determined by their URLs. Issues and PRs numbers are unique with each other, so PRs without related issues feature can be now implemented.

Add cleanup function

User should have an opportunity to tweak conditions on which issue must be deleted from a table (mostly to avoid overfilling). For example, issue, which was closed within three days without pull request may not be very interesting to table owners.

It would probably optimal to call cleanup function on every issue (after it got all the data updates). fill_funcs.py is a good place for it. Function itself should return bool: True - delete issue, False - let issue stay in a table. Internal Scraper code will be deleting issue marked in such fashion.

Scraper code should be covered with tests

For now Scraper testing includes only running it on a live data, that can take some time. It would be great to have a bunch of unit/system tests to be able to easily check Scrapper's health, when new features arrived.

Tests should be added into new "tests" folder.

While writing the tests coverage package should be used to make sure that functions/objects are completely covered with tests.

Class for PRs indexes

Move PR index functionality into separate class, as it's becoming too complex to read it from other classes.

Speed up with inserting issues one by one

For now issues are inserted monolith - the whole table with a single request. This takes time and forces Scraper to recalculate all the highlighting, which is tens and hundreds of requests. Plus to this, before sending new highlighting requests we have to clear all of the current highlighting.

The best way to implement speedup is to do operations one by one: first sort the updated backend table, and send requests to move updated rows. Then, insert new rows into the backend table, sort it, and send requests to insert new rows into spreadsheet. Then delete deleted rows from the backend table, sort it and send deleting rows requests. With this system itself will become more stable.

Set cell color with filling function

User should have an ability to set cell color while designating it's value within filling function. Better add new field into old_issue object for such causes.

Ignoring specific issues

Probably some issues can be wanted not to be tracked by Scraper. Implement functionality which can give user an ability to set rules for ignoring issues. This should be located in filling_funcs.py for easier access, and should be called on every tables update.

Detect external issue-PR relations

For now relations between issues and PRs can be designated only if both instances are in one repository. It makes sense to search for connections between all of the repositories tracked on a single one sheet.

try_match_keywords() function probably must be widen to understand if issue posted in another repository.

Send requests in smaller portions

For now we're sending formatting requests with single one list:

def _apply_formating_data(self, requests):
"""Apply formating data with batch update.
Args:
requests (list):
Dicts, each of which represents single request.
"""
if requests:
service.spreadsheets().batchUpdate(
spreadsheetId=self._id, body={"requests": requests}
).execute()

In some repositories number of requests can be up to hundreds and thousands - processing them on backend can take time. During this processing visual artifacts can appear at the spreadsheet (URLs can be erased, highlighting cleared, etc.), which actually can look really weird.

It's better to send requests with smaller batches. This practise can also reduce number of non operational highlights as batch falls entirely. Meaning 300-sized failed batch will skip 300 requests, while failing 1 of 30 batches with 10 requests per batch will avoid failing of 290 requests.

Archive table

Add functionality to move done and closed issues into archive table to avoid overwhelming the existing tables.

Function to_be_archived() should be added into fill_funcs.py. It should designate, if issue should be archived. Function should return list ready to be inserted into archive sheet.

For this feature new constant is required to be added into config.py. It must have the same structure as other sheets configurations have

Old issues appears in table with New status

It's probably caused by using updated_at time for since filter. It'll be good to use time of last table updates instead, as on a first update after start Scraper analyzes only opened issue. That means the last update date will be equal to last opened issue update. If any issue was closed after that time, Scraper will add it into table.

Implement spreadsheet name with attributes

Spreadsheet name for now is used only in config.py file, which is not very convenient. Some people may wanna work with names of their spreadsheets instead of ids. Thus, it'll be good to implement name attribute for Spreadsheet() class, definitely with setter, which will be changing spreadsheet name on the service.

Get rid of [Issue, Repository] ids

Scraper uses issue number and repository short name as a row ids in internal calculations. It's not very convenient as users have to keep Issue and Repository columns in their own tables, which can be not wanted.

Add Sheet class

Sheet processing logic became large enough to move it into separate class Sheet(), and make Spreadsheet() little bit clearer and easier to read.

Statistics table

Functionality to build statistics tables should be implemented to show how many issues were created/closed/PRs merged, and other actions were made during given periods of time.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.