Coder Social home page Coder Social logo

harvester's Introduction

Harvester

This is a script for collecting public search results for filename or file-extension queries on GitHub.

It's used by GitHub Linguist contributors to gauge real-world usage of languages and file extensions, especially when submitting new languages for registration on GitHub.

Due to weird access restrictions, the script must be run

  • from a browser context,
  • from a page hosted on github.com,
  • whilst signed-in with a registered GitHub account.

Attempting to load search results from an unauthorised or headless context will fail. See for yourself by opening this page while signed in, and compare it with what an incognito browser window sees.

Usage

  1. Copy harvester.js to your clipboard.

  2. Navigate to a GitHub-hosted page in your browser.
    Remember, the URL's domain must be github.com due to CORS restrictions.

  3. In your browser's console,

    1. Paste the contents of harvester.js
      This defines the commands you'll use in the next step. It might request your permission to display desktop notifications. This is used to notify you when a harvest has finished.

    2. Run harvest(" … ") to begin a search.
      To search for entire filenames instead of extensions, prepend the query with filename::

      harvest("filename:.bashrc");

      Arguments are optional if Harvester's running from a search results page. E.g., if you have this page open in your browser, harvest(); is the same as

      harvest("extension:asy", "NOT SymbolType");

      For any other page, a query must be specified.

  4. Wait for it to finish.
    This may take a while, depending on how many results there are. You'll see a desktop notification when it finishes.

  5. Run copy(that) in the browser console.
    This copies the collected URLs to your clipboard.

Bookmarklet

If you find yourself using this script often, consider adding bookmarklet.js to your browser's toolbar as a bookmarklet.

Note: This won't work on Firefox.

Ideally, this script would load the latest version of harvester.js and attach it to the page. This isn't possible due to CORS restrictions, so the entire script needs to be embedded as a single URL.

JavaScript interface

Running harvester.js adds three properties to global context:

window.harvest(query[, searchHack])

The function used for starting a search. It takes two arguments:

  • query
    Your search query: either an extension or a filename.

    harvest("extension:foo"); // Extension
    harvest("filename:foo");  // Filename

    Because extensions are more frequently searched for than filenames, the "extension:" prefix is optional. Ergo, the first line above can be shortened to just this:

    harvest("foo"); // Extension

    This is the format used throughout the rest of this documentation.

  • searchHack
    An optional legitimate search query to include. The default is "NOT nothack" followed by random hex digits:

    "NOT nothack" + Math.random(1e6).toString(16).replace(/\./, "").toUpperCase();

    However, sometimes you want to narrow searches down to files which contain a certain substring. In those cases, you include the second parameter:

    // Match `*.foo` files which contain the word "bar".
    harvest("foo", "bar");

    Sidenote: The "nothack" above is necessitated by the requirement that advanced searches include specific search criteria. This makes site-wide searching of extensions impossible, so this hacky workaround is used instead.

window.silo

An Object where successful searches are cached, keyed by query. Its contents look like this:

window.silo = {
	"extension:foo": {
		length: 6528,
		
		"/user/repo/blob/6eb5537/path/file.foo": "https://raw.githubusercontent.com/…",
		 6527 more results
	},
};

The silo contains a helper method called reap which extracts, sorts, and joins a list of results as a string. It's called internally when accessing window.that to extract a sorted URL list.

The silo exists to provide some way of resuming an interrupted harvest, such as in the case of a lost connection. It isn't some persistent storage mechanism: navigating to another page causes its contents to be lost.

window.that

Reference to the results of the last successful harvest. Meant for use with the console's copy command:

copy(that);

Which is essentially a shortcut for

copy(silo.reap("extension:foo"));

Downloading files

The list copied by running copy(that); is a plain-text list of URLs that can be passed to wget(1), curl(1), or a similar utility to download files en masse:

# Using `wget` (recommended)
wget -nv -i /path/to/url.list

# Using `curl`
sed -e "s/'/%27/g" /path/to/url.list | xargs -n1 curl -# -O

Alternatively, to preserve the directory hierarchy while downloading files, you can use the following command:

wget -nv -x -i /path/to/url.list

Helpful scripts

Some useful shell commands to help with reporting in-the-wild usage on GitHub:

Listing unique repositories

This reads from urls.log and writes the output to unique-repos.log.

grep < urls.log -iEoe '^https?://raw\.githubusercontent\.com/([^/]+/){2}' \
| sort | uniq | sed -Ee 's,raw\.(github)usercontent,\1,g' > unique-repos.log

Listing unique users

This reads from unique-repos.log and saves to unique-users.log:

grep < unique-repos.log -oE '^https://github\.com\/[^/]+' \
| sort | uniq > unique-users.log

Tallying results

This prints a summary of how many unique users and repositories were found in total.

wc -l unique-{repos,users}.log | grep -vE '\stotal$' | grep -oE '^\s*[0-9]+' |\
xargs printf '\
Unique repos: %s
Unique users: %s
'

The following utilities are also of interest:

  • gh-search
    Opens the URL of an extension/filename search using the system's default browser.

    gh-search -e foo; # Search by extension
    gh-search -f foo; # Search by filename
  • fixext
    Fixes the suffixes added by wget(1) when downloading files with the same name.

    $ fixext foo *
    Renamed: saved.foo.1 -> saved.1.foo

harvester's People

Contributors

alhadis avatar pchaigno avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

harvester's Issues

Single request sent

It looks like there's a new bug with the scrapper. I tried harvest("extension:h");. I can see a single request for the first page of results then nothing. No error are thrown. Haven't found time to debug further yet.

RFC: Reimplementing Harvester as a CLI program

I intend to rewrite Harvester as a dedicated CLI tool running atop Node.js. This implementation will query GitHub's APIs to obtain search results, eliminating the need to copy+paste huge chunks of code into one's dev-console (which is inelegant, runs slowly, and prone to breakage).

I've opened this issue to collect feedback, ideas, and discuss any issues users foresee with the requirement to use an authentication token (see “Caveats” below).

Features

  1. Concurrent operation:
    Multiple pages of search-results can be fetched simultaneously, reusing a single connection and pooling results either in memory or in a file (probably a CBOR file with a .silo extension).
  2. Downloading actual files:
    1. Faster harvests. Since we can download files concurrently, Harvester will be potentially faster than tools like wget or curl which operate linearly (that is, they wait for each URL to finish transferring before requesting the next).
    2. Duplicate file detection is easier. Especially if GitHub's APIs provide a checksum when querying file metadata.
    3. Smarter subdirectory structure: Files will be downloaded to
      ./user/repo/file.ext rather than file.ext.452 or ./raw.githubusercontent.com/user/repo/branch/full/path/to/file.ext).
  3. Storing harvested URLs in .silo files:
    1. Harvests can be resumed: Aside from resuming an interrupted harvest, users can update a previous harvest without re-downloading files that were already reaped in an earlier run (feature #2 can track the locations of downloaded files, providing a means of tracking which files have been evaluated).
    2. Easier tracking of usage over time: .silo files will contain timestamps of each run, which should make github/linguist#4219 easier to manage.
  4. Ability to filter files with permissive and unambiguous licenses:
    No more poking around to determine if a sample is copylefted or unlicensed!
  5. Easier wrangling of user/repo counts:
    Reports can be produced by running something like
    λ harvest -s foo
    Summary of `.foo` usage:
    * Users: 424
    * Repos: 653
    * Files: 1453 (2428 total)
    * Last updated: 2020-07-22T14:56:28.196Z (3 minutes ago)
    Or even in Markdown format (for pasting into issues/PRs):
    λ harvest -s -m foo
    ## `.foo` usage
    * Users: 424
    * Repos: 653
    * Files: 1453 ([2428 total](https://github.com/advanced/search/link))
    * Last updated: 2020-07-22T14:56:28.196Z (3 minutes ago)
    
    <details><summary><b>1,523 total files:</b></summary>
    
    ```
    file:///1
    file:///2
    file:///3/and/so/forth
    ```
    
    </details>
  • I finally get to write a man page: So I don't lose my shit at Markdown again for being the worst markup format for anything more elaborate than a readme.

Caveats

Things to consider:

  • Users will need to generate authentication tokens before they can use GitHub's APIs. This can be passed to the Harvester executable by setting an environment variable.
  • I actually haven't delved too far into GitHub's API docs. Ergo, there may be limitations to its site-wide file search that I'm unaware of (ones that our NOT nothack trick is designed to circumvent).

Motivation

Currently, Harvester is using ~4-year old code I slapped together as a quick workaround for an unexpected change in GitHub's handling of search results (specifically, preventing anonymous users from viewing them). Since then, it's required manual maintenance and hotfixes each time GitHub updated their front-end code. The slop I wrote in 2016 was never intended to last as long as it did… I knew it'd need rewriting eventually, but as long as it worked and the fixes were trivial, I considered it low priority.

Recently, real-world issues reminded me I won't always be able to respond to issues or e-mails in a timely fashion… or even at all. Ergo, I'm reevaluating the priority of outstanding tasks like this one.

Footnotes

  •  This is actually the sane thing to do, when you consider race conditions with conflicting filenames, or URLs which influence subsequent server responses.
  •  Markdown's only feature is readability. Prove me wrong.

/cc @pchaigno, @lildude, @smola

Notification permission deprecation in Firefox

Just a heads up: I'm getting the following warning with Firefox 71.0 when I paste the script:

Requesting Notification permission outside a short running user-generated event handler is deprecated and will not be supported in the future.

Instructions to download files

The instructions to download the files in the README download the HTML pages, not the actual files. Not sure whether that was the intention.

I made the following command to download the raw files:

for f in $(cat urls.txt); do dir=$(echo $f | sed 's#https://github.com/##' | sed -r 's#/blob/\w+##' | sed 's#[^/]*\.[hH]$##'); mkdir -p $dir; wget -q --directory-prefix=$dir $(echo $f | sed 's#/blob/#/raw/#'); done

or with correct indentation:

for f in $(cat urls.txt); do
  dir=$(echo $f | sed 's#https://github.com/##' | sed -r 's#/blob/\w+##' | sed 's#[^/]*\.[hH]$##')
  mkdir -p $dir
  wget -q --directory-prefix=$dir $(echo $f | sed 's#/blob/#/raw/#')
done

SyntaxError: Unable to extract total number of results from header

I followed the instructions from the README and get this in the console:

SyntaxError
​
columnNumber: 20
​
fileName: "harvester.js"
​
lineNumber: 222
​
message: "Unable to extract total number of results from header"
​
stack: "die@debugger eval code:222:20\nnext@debugger eval code:164:11\n"
​
title: "Unexpected Markup Error"
​
__proto__: Object { stack: "", … }
debugger eval code:80:4
Object { lastPageSnapshot: DocumentFragment […] }

Avatars downloaded during scraping

I wasn't expecting XHR to download images with HTML pages, but it looks like it does download avatars while scraping search results:
harvester

Is this the expected behavior? It doesn't slow down the scraping anyway, but it does amount to a fair bit of content.

Invalid regexp group on Firefox

When I try to paste the Harvester's code in the Firefox debug console, I get a SyntaxError: invalid regexp group error. It still works fine with Chromium. I've been using Harvester with Firefox since the first time I used it so the error was probably added recently. I'm going to investigate...

Hitting rate limit...

When trying to run haverster I get a rate limit error. Even if I follow the steps on the page.

   Whoa there!
      You have triggered an abuse detection mechanism.
        Please wait a few minutes before you try again;
        in some cases this may take up to an hour.
        
        Contact Support —
        GitHub Status —
        @githubstatus

Is there anyway around this, or other ways to judge the usage of certain file types before submitting to linguist?

Harvester not scraping links?

Hello,

I am a bit new to Javascript and this sort of stuff so apologies if I missed something. I am using Chrome Version 77.0.3865.90 (Official Build) (64-bit). I have disabled all extensions to rule out anything strange from those.

I am trying to scrape this search query page while waiting for a linguist pull request to go through. I followed the instructions to copy paste the javascript file into my console and I get undefined.

I run harvest(); as per the README, it displays Promise {<pending>} in console and starts searching pages. In the network tab, it goes through about 80 pages over about 6 minutes before sending me the notice to use copy(that);

I run that, try to paste whats in my clipboard, and it's blank.

Console shows this (immediately after I paste the script):

undefined
harvest();
Promise {<pending>}
copy(that);
undefined

I have also tried with harvest("mrc","mirc"); (on a new search page) to no avail. The network page URLs all seem to be valid.

Please let me know if I missed anything. Thanks :)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.