The github-scraper from ilyafaer

Process percentage

First table building iteration can take a lot of time (in case of liveful repositories). As GitHub API returns total number of issues, retrieved by filter, we can count a percentage of processed issues/PRs, and show it in logs.

See:

GitHub-Scraper/scraper/sheet_builder.py

Lines 30 to 51 in 427dba4

    
               def build_table(self): 
        
                   """Build list of issues/PRs from given repositories. 
        
                   Returns: 
        
                       dict: 
        
                           Index of issue in format: 
        
                           {(issue.number, repo_short_name): github.Issue.Issue} 
        
                   """ 
        
                   issue_index = {} 
        
                   repo_names = list(self._repo_names.keys()) + list(self._in_repo_names.keys()) 
        
                   for repo_name in repo_names: 
        
                       repo = self._repos.setdefault(repo_name, gh_client.get_repo(repo_name)) 
        
                       self._index_closed_prs(repo) 
        
                       # process open PRs and issues 
        
                       for issue in repo.get_issues(): 
        
                           id_ = self._build_issues_id(issue, repo) 
        
                           if id_: 
        
                               issue_index[id_] = issue 
        
                   return issue_index

and

GitHub-Scraper/scraper/sheet_builder.py

Lines 111 to 134 in 427dba4

    
               def _index_closed_prs(self, repo): 
        
                   """Add closed pull requests into PRs index. 
        
                   Method remembers last PR's update time and doesn't 
        
                   indexate PRs which weren't updated since last 
        
                   spreadsheet update. 
        
                   Args: 
        
                       repo (github.Repository.Repository): 
        
                           Repository object. 
        
                   """ 
        
                   pulls = repo.get_pulls(state="closed", sort="updated", direction="desc") 
        
                   if pulls.totalCount: 
        
                       for pull in pulls: 
        
                           if pull.updated_at < self._last_pr_updates.setdefault( 
        
                               repo.full_name, datetime.datetime(1, 1, 1) 
        
                           ): 
        
                               break 
        
                           key_phrases = self._try_match_keywords(pull.body) 
        
                           for key_phrase in key_phrases: 
        
                               self.prs_index.add(repo, self._get_repo_lts(repo), pull, key_phrase) 
        
                       self._last_pr_updates[repo.full_name] = pulls[0].updated_at

Speed up

Speed up Scraper by adding date filter for issues. Issues should be retrieved with sorting by "updated_at", then we could avoid reading and processing issues which were updated long ago. Also since filter can be used to get only recently changed issues.

Pull requests without related issues?

It may worth to track pull requests, which are not related to any issues. The problem in here is how to show them in table, how to track, how not to technically mix them with issues.

UPDATE: issues are now determined by their URLs. Issues and PRs numbers are unique with each other, so PRs without related issues feature can be now implemented.

Add cleanup function

User should have an opportunity to tweak conditions on which issue must be deleted from a table (mostly to avoid overfilling). For example, issue, which was closed within three days without pull request may not be very interesting to table owners.

It would probably optimal to call cleanup function on every issue (after it got all the data updates). fill_funcs.py is a good place for it. Function itself should return bool: True - delete issue, False - let issue stay in a table. Internal Scraper code will be deleting issue marked in such fashion.

Scraper code should be covered with tests

For now Scraper testing includes only running it on a live data, that can take some time. It would be great to have a bunch of unit/system tests to be able to easily check Scrapper's health, when new features arrived.

Tests should be added into new "tests" folder.

While writing the tests coverage package should be used to make sure that functions/objects are completely covered with tests.

Class for PRs indexes

Move PR index functionality into separate class, as it's becoming too complex to read it from other classes.

Speed up with inserting issues one by one

For now issues are inserted monolith - the whole table with a single request. This takes time and forces Scraper to recalculate all the highlighting, which is tens and hundreds of requests. Plus to this, before sending new highlighting requests we have to clear all of the current highlighting.

The best way to implement speedup is to do operations one by one: first sort the updated backend table, and send requests to move updated rows. Then, insert new rows into the backend table, sort it, and send requests to insert new rows into spreadsheet. Then delete deleted rows from the backend table, sort it and send deleting rows requests. With this system itself will become more stable.

Set cell color with filling function

User should have an ability to set cell color while designating it's value within filling function. Better add new field into old_issue object for such causes.

Ignoring specific issues

Probably some issues can be wanted not to be tracked by Scraper. Implement functionality which can give user an ability to set rules for ignoring issues. This should be located in filling_funcs.py for easier access, and should be called on every tables update.

Detect external issue-PR relations

For now relations between issues and PRs can be designated only if both instances are in one repository. It makes sense to search for connections between all of the repositories tracked on a single one sheet.

try_match_keywords() function probably must be widen to understand if issue posted in another repository.

Send requests in smaller portions

For now we're sending formatting requests with single one list:

GitHub-Scraper/scraper/spreadsheet.py

Lines 398 to 408 in 427dba4

    
               def _apply_formating_data(self, requests): 
        
                   """Apply formating data with batch update. 
        
                   Args: 
        
                       requests (list): 
        
                           Dicts, each of which represents single request. 
        
                   """ 
        
                   if requests: 
        
                       service.spreadsheets().batchUpdate( 
        
                           spreadsheetId=self._id, body={"requests": requests} 
        
                       ).execute()

In some repositories number of requests can be up to hundreds and thousands - processing them on backend can take time. During this processing visual artifacts can appear at the spreadsheet (URLs can be erased, highlighting cleared, etc.), which actually can look really weird.

It's better to send requests with smaller batches. This practise can also reduce number of non operational highlights as batch falls entirely. Meaning 300-sized failed batch will skip 300 requests, while failing 1 of 30 batches with 10 requests per batch will avoid failing of 290 requests.

Archive table

Add functionality to move done and closed issues into archive table to avoid overwhelming the existing tables.

Function to_be_archived() should be added into fill_funcs.py. It should designate, if issue should be archived. Function should return list ready to be inserted into archive sheet.

For this feature new constant is required to be added into config.py. It must have the same structure as other sheets configurations have

Old issues appears in table with New status

It's probably caused by using updated_at time for since filter. It'll be good to use time of last table updates instead, as on a first update after start Scraper analyzes only opened issue. That means the last update date will be equal to last opened issue update. If any issue was closed after that time, Scraper will add it into table.

Implement spreadsheet name with attributes

Spreadsheet name for now is used only in config.py file, which is not very convenient. Some people may wanna work with names of their spreadsheets instead of ids. Thus, it'll be good to implement name attribute for Spreadsheet() class, definitely with setter, which will be changing spreadsheet name on the service.

	def build_table(self):
	"""Build list of issues/PRs from given repositories.

	Returns:
	dict:
	Index of issue in format:
	{(issue.number, repo_short_name): github.Issue.Issue}
	"""
	issue_index = {}
	repo_names = list(self._repo_names.keys()) + list(self._in_repo_names.keys())

	for repo_name in repo_names:
	repo = self._repos.setdefault(repo_name, gh_client.get_repo(repo_name))
	self._index_closed_prs(repo)

	# process open PRs and issues
	for issue in repo.get_issues():
	id_ = self._build_issues_id(issue, repo)
	if id_:
	issue_index[id_] = issue

	return issue_index

	def _index_closed_prs(self, repo):
	"""Add closed pull requests into PRs index.

	Method remembers last PR's update time and doesn't
	indexate PRs which weren't updated since last
	spreadsheet update.

	Args:
	repo (github.Repository.Repository):
	Repository object.
	"""
	pulls = repo.get_pulls(state="closed", sort="updated", direction="desc")
	if pulls.totalCount:
	for pull in pulls:
	if pull.updated_at < self._last_pr_updates.setdefault(
	repo.full_name, datetime.datetime(1, 1, 1)
	):
	break

	key_phrases = self._try_match_keywords(pull.body)
	for key_phrase in key_phrases:
	self.prs_index.add(repo, self._get_repo_lts(repo), pull, key_phrase)

	self._last_pr_updates[repo.full_name] = pulls[0].updated_at

	def _apply_formating_data(self, requests):
	"""Apply formating data with batch update.

	Args:
	requests (list):
	Dicts, each of which represents single request.
	"""
	if requests:
	service.spreadsheets().batchUpdate(
	spreadsheetId=self._id, body={"requests": requests}
	).execute()

ilyafaer / github-scraper Goto Github PK

github-scraper's People

Stargazers

Watchers

Forkers

github-scraper's Issues

Recommend Projects

Recommend Topics

Recommend Org