Coder Social home page Coder Social logo

Single request sent about harvester HOT 19 CLOSED

alhadis avatar alhadis commented on June 3, 2024
Single request sent

from harvester.

Comments (19)

pchaigno avatar pchaigno commented on June 3, 2024 1

Thanks @Alhadis!

from harvester.

Alhadis avatar Alhadis commented on June 3, 2024

It's breaking at line 130 with a TypeError, because it's assuming this pattern will match:

/<h3>\s*(?:We.ve\s+found\s+)?([,\d]+)\s+code\s+results/i

... since it's expecting this markup:

<h3>We've found 282,649,870 code results</h3>

To the surprise of nobody, GitHub have changed their markup (again):

  <div class="d-flex flex-justify-between border-bottom pb-3">
    <h3>
    <span class="d-inline-block v-align-middle">
      Showing 282,649,870 available
      code results
    </span>

This is where I want to scream obscenities at Bootstrap and Grid Systems for not doing the smart thing and identifying a landmark element with a unique ID, but naaaaaah...
/end diatribe

Basically what I'm saying is this markup is prone to breakage in future the next time they decide they want a different wording to be used in the heading, or a different HTML tag to enclose the text. There's no foolproof way of extracting the total number of results from the page, which the harvester uses to determine if it's worth going through the search results with a different Sort filter applied (which keeps it from taking three times as long if there are less than 1000 results, etc).

*sigh* I hate frameworks...

from harvester.

Alhadis avatar Alhadis commented on June 3, 2024

For the record, this is what I'd use:

<h1 id="page-title">Showing 282,649,870 available code results</h1>

... which would at least make targeting unique page elements possible, and make extracting a bunch of numerals a lot easier. All of which could still be avoided if GitHub gave us a better way to do what we're currently doing. 😅

from harvester.

pchaigno avatar pchaigno commented on June 3, 2024

Should we be bold and try to match it with the following?

/>[\s\w]*([,\d]+)\s+code\s+results/i

from harvester.

Alhadis avatar Alhadis commented on June 3, 2024

Actually, the correct solution is to parse the loaded page as a detached DOM tree, and query its elements with the correct web interfaces (document.querySelector, element.querySelector), etc. I confess the only reason I went the regex route was because I was in the midst of a Perl coding-sprint when I wrote the original code. 😅 That carried through to the JavaScript-based version, and I wrote it expecting it to become throwaway code.

If I'd known I'd be curating it formally like we are now, I would've done the smarter thing. ;) It'd still require updates whenever GitHub changed their markup, but they'd be as easy as changing a CSS selector (as opposed to carving through layers of messy expressions... which I have a perverse love of doing for some reason).

PS: I didn't sleep because I was watching the World Cup at 4am, so I'm rambling as usual. I hope Croatia roasts your country on a spit next week. :-) IDEMO HVRATSKA!! 🎆

from harvester.

pchaigno avatar pchaigno commented on June 3, 2024

It'd still require updates whenever GitHub changed their markup, but they'd be as easy as changing a CSS selector

But I don't want to change it at all! 😁

from harvester.

Alhadis avatar Alhadis commented on June 3, 2024

Hah! Then you'd better hope GitHub's new overlords Microsoft will one day bless us with a proper file-harvesting facility. 😁 Ideally, this repository shouldn't exist and there'd be no need for a nothack at all... =(

from harvester.

Alhadis avatar Alhadis commented on June 3, 2024

Anyway, whenever updates are needed, it'll be as simple as right-clicking the correct element on the page, opening Inspect element and copying the node's CSS selector in the inspector tools (I believe both Firefox and Chrome have this feature, though I only have the latter available at the moment...)

from harvester.

Alhadis avatar Alhadis commented on June 3, 2024

@pchaigno Just curious, does GitHub display this page as 263,443,068, or do they preserve locale-specific stuff like thousands separators (meaning that figure would be formatted to you as "263.443.068" instead)?

from harvester.

pchaigno avatar pchaigno commented on June 3, 2024

I've got English locales 😬

from harvester.

pchaigno avatar pchaigno commented on June 3, 2024

Hm, looks like GitHub.com is English-only anyway. I've just checked with IE on another computer.

from harvester.

Alhadis avatar Alhadis commented on June 3, 2024

I've added both separators:

// This needs to point to the title that says "Showing 263,443,068 code results"
const h3 = $(".codesearch-results > .pl-2 h3");
if(h3 && h3.textContent.match(/\b([0-9,.\s]+)\s.*?code\s+results/si)){
	resultCount = +(RegExp.$1.replace(/\D/g, ""));

... which is unlikely to have issues with "1.200" or "1,200", because GitHub probably aren't gonna format that as a floating-point value anyway. :D

from harvester.

pchaigno avatar pchaigno commented on June 3, 2024

I've just tried your changes. The scrawling went as planned but then it tried to get page 101 and failed with the Search-result list not found debug message :-/

from harvester.

Alhadis avatar Alhadis commented on June 3, 2024

Yup, I'm still testing them too. I should snapshot the HTML and and leave it on the silo for easier inspection whenever this stuff happens...

from harvester.

Alhadis avatar Alhadis commented on June 3, 2024

Okay, it should be working. Please test and report anything strange or weird-looking.

from harvester.

pchaigno avatar pchaigno commented on June 3, 2024

Works for me!

from harvester.

Alhadis avatar Alhadis commented on June 3, 2024

Awesome, thanks for staying vigilant about it. ^^

Question: Would you find a progress meter helpful? (I could add a real overlaying widget to the page, or use a strictly console-based one that feels a bit closer to proper command-line.

It'd be nice to see feedback like

[ ".ext" 62 / 100: ▓▓▓▓▓▓▓▓▒░░░░░░░░░░░░ | 1 pass out of 3? 

(.. or something a hell of lot nicer than that, because that example sucks as, but I'm sure you get the gist of what I'm referring to . 😁

from harvester.

pchaigno avatar pchaigno commented on June 3, 2024

I'm usually tracking progress through the network tab of Firefox' developer tools. The end of each queried url has the page number. It's only annoying when there are less than a hundred pages and I need to do a search myself to find out how close to the end I am.

I'd be much more interested in a way to leave this running at night (on a server with no GUI).

from harvester.

Alhadis avatar Alhadis commented on June 3, 2024

I'd be much more interested in a way to leave this running at night (on a server with no GUI).

Originally, that was the plan. I wanted a command-line utility that would take as input a filename or extension, and leave me with a folder full of files when it finished running. Something like this:

$ harvest -e ".gif" -d /path/to/save/files

Unfortunately, that became impossible when GitHub decided to restrict code-search results to authenticated users. And that ruined everything.

I could use a headless browser like Phantom.js or a synthesised browser environment using the right Node modules, but that isn't the problem. The problem is not imperilling users who trigger GitHub's abuse detection systems. You've seen how they block access to pages that're loaded too quickly, yes? Well, that's just one of many dead give-aways that a user's account is under the control of an automated process. Headless browsers are even easier to detect, and I don't want to jeopardise users whose accounts get locked and blacklisted automatically because they triggered GitHub's abuse detection monitors.

Furthermore, having this run in the background would require login credentials to be manually entered each time. Even if the user's password were stored in a keychain, 2FA would still require a passcode to be entered, and the whole process would be even slower than simply running it from inside a browser window.

And, of course, if something breaks for any reason, it becomes much harder to investigate why. 😉

from harvester.

Related Issues (11)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.