It looks like there's a new bug with the scrapper. I tried h

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

It's breaking at <a href="https://github.com/Alhadis/Harvester/blob/aa8776f76c53c42632

For the record, this is what I'd use: <div class="highlight highlight-text-html-ba

Should we be bold and try to match it with the following? <div class="highlight hi

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

I've got English locales <g-emoji class="g-emoji" alias="grimacing" fallback-src="http

Single request sent about harvester HOT 19 CLOSED

alhadis commented on June 3, 2024

Single request sent

from harvester.

Comments (19)

pchaigno commented on June 3, 2024 1

Thanks @Alhadis!

from harvester.

Alhadis commented on June 3, 2024

It's breaking at line 130 with a TypeError, because it's assuming this pattern will match:

/<h3>\s*(?:We.ve\s+found\s+)?([,\d]+)\s+code\s+results/i

... since it's expecting this markup:

<h3>We've found 282,649,870 code results</h3>

To the surprise of nobody, GitHub have changed their markup (again):

  <div class="d-flex flex-justify-between border-bottom pb-3">
    <h3>
    <span class="d-inline-block v-align-middle">
      Showing 282,649,870 available
      code results
    </span>

This is where I want to scream obscenities at Bootstrap and Grid Systems for not doing the smart thing and identifying a landmark element with a unique ID, but naaaaaah...
/end diatribe

Basically what I'm saying is this markup is prone to breakage in future the next time they decide they want a different wording to be used in the heading, or a different HTML tag to enclose the text. There's no foolproof way of extracting the total number of results from the page, which the harvester uses to determine if it's worth going through the search results with a different Sort filter applied (which keeps it from taking three times as long if there are less than 1000 results, etc).

*sigh* I hate frameworks...

from harvester.

Alhadis commented on June 3, 2024

For the record, this is what I'd use:

<h1 id="page-title">Showing 282,649,870 available code results</h1>

... which would at least make targeting unique page elements possible, and make extracting a bunch of numerals a lot easier. All of which could still be avoided if GitHub gave us a better way to do what we're currently doing. 😅

from harvester.

pchaigno commented on June 3, 2024

Should we be bold and try to match it with the following?

/>[\s\w]*([,\d]+)\s+code\s+results/i

from harvester.

Alhadis commented on June 3, 2024

Actually, the correct solution is to parse the loaded page as a detached DOM tree, and query its elements with the correct web interfaces (document.querySelector, element.querySelector), etc. I confess the only reason I went the regex route was because I was in the midst of a Perl coding-sprint when I wrote the original code. 😅 That carried through to the JavaScript-based version, and I wrote it expecting it to become throwaway code.

If I'd known I'd be curating it formally like we are now, I would've done the smarter thing. ;) It'd still require updates whenever GitHub changed their markup, but they'd be as easy as changing a CSS selector (as opposed to carving through layers of messy expressions... which I have a perverse love of doing for some reason).

PS: I didn't sleep because I was watching the World Cup at 4am, so I'm rambling as usual. I hope Croatia roasts your country on a spit next week. :-) IDEMO HVRATSKA!! 🎆

from harvester.

pchaigno commented on June 3, 2024

It'd still require updates whenever GitHub changed their markup, but they'd be as easy as changing a CSS selector

But I don't want to change it at all! 😁

from harvester.

Alhadis commented on June 3, 2024

Hah! Then you'd better hope GitHub's new overlords Microsoft will one day bless us with a proper file-harvesting facility. 😁 Ideally, this repository shouldn't exist and there'd be no need for a nothack at all... =(

from harvester.

Alhadis commented on June 3, 2024

Anyway, whenever updates are needed, it'll be as simple as right-clicking the correct element on the page, opening Inspect element and copying the node's CSS selector in the inspector tools (I believe both Firefox and Chrome have this feature, though I only have the latter available at the moment...)

from harvester.

Alhadis commented on June 3, 2024

@pchaigno Just curious, does GitHub display this page as 263,443,068, or do they preserve locale-specific stuff like thousands separators (meaning that figure would be formatted to you as "263.443.068" instead)?

from harvester.

pchaigno commented on June 3, 2024

I've got English locales 😬

from harvester.

pchaigno commented on June 3, 2024

Hm, looks like GitHub.com is English-only anyway. I've just checked with IE on another computer.

from harvester.

Alhadis commented on June 3, 2024

I've added both separators:

// This needs to point to the title that says "Showing 263,443,068 code results"
const h3 = $(".codesearch-results > .pl-2 h3");
if(h3 && h3.textContent.match(/\b([0-9,.\s]+)\s.*?code\s+results/si)){
	resultCount = +(RegExp.$1.replace(/\D/g, ""));

... which is unlikely to have issues with "1.200" or "1,200", because GitHub probably aren't gonna format that as a floating-point value anyway. :D

from harvester.

pchaigno commented on June 3, 2024

I've just tried your changes. The scrawling went as planned but then it tried to get page 101 and failed with the Search-result list not found debug message :-/

from harvester.

Alhadis commented on June 3, 2024

Yup, I'm still testing them too. I should snapshot the HTML and and leave it on the silo for easier inspection whenever this stuff happens...

from harvester.

Alhadis commented on June 3, 2024

Okay, it should be working. Please test and report anything strange or weird-looking.

from harvester.

pchaigno commented on June 3, 2024

Works for me!

from harvester.

Alhadis commented on June 3, 2024

Awesome, thanks for staying vigilant about it. ^^

Question: Would you find a progress meter helpful? (I could add a real overlaying widget to the page, or use a strictly console-based one that feels a bit closer to proper command-line.

It'd be nice to see feedback like

[ ".ext" 62 / 100: ▓▓▓▓▓▓▓▓▒░░░░░░░░░░░░ | 1 pass out of 3?

(.. or something a hell of lot nicer than that, because that example sucks as, but I'm sure you get the gist of what I'm referring to . 😁

from harvester.

pchaigno commented on June 3, 2024

I'm usually tracking progress through the network tab of Firefox' developer tools. The end of each queried url has the page number. It's only annoying when there are less than a hundred pages and I need to do a search myself to find out how close to the end I am.

I'd be much more interested in a way to leave this running at night (on a server with no GUI).

from harvester.

Alhadis commented on June 3, 2024

I'd be much more interested in a way to leave this running at night (on a server with no GUI).

Originally, that was the plan. I wanted a command-line utility that would take as input a filename or extension, and leave me with a folder full of files when it finished running. Something like this:

$ harvest -e ".gif" -d /path/to/save/files

Unfortunately, that became impossible when GitHub decided to restrict code-search results to authenticated users. And that ruined everything.

I could use a headless browser like Phantom.js or a synthesised browser environment using the right Node modules, but that isn't the problem. The problem is not imperilling users who trigger GitHub's abuse detection systems. You've seen how they block access to pages that're loaded too quickly, yes? Well, that's just one of many dead give-aways that a user's account is under the control of an automated process. Headless browsers are even easier to detect, and I don't want to jeopardise users whose accounts get locked and blacklisted automatically because they triggered GitHub's abuse detection monitors.

Furthermore, having this run in the background would require login credentials to be manually entered each time. Even if the user's password were stored in a keychain, 2FA would still require a passcode to be entered, and the whole process would be even slower than simply running it from inside a browser window.

And, of course, if something breaks for any reason, it becomes much harder to investigate why. 😉

from harvester.

Single request sent about harvester HOT 19 CLOSED

Comments (19)

Related Issues (11)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent