Comments (19)
Thanks @Alhadis!
from harvester.
It's breaking at line 130 with a TypeError
, because it's assuming this pattern will match:
/<h3>\s*(?:We.ve\s+found\s+)?([,\d]+)\s+code\s+results/i
... since it's expecting this markup:
<h3>We've found 282,649,870 code results</h3>
To the surprise of nobody, GitHub have changed their markup (again):
<div class="d-flex flex-justify-between border-bottom pb-3">
<h3>
<span class="d-inline-block v-align-middle">
Showing 282,649,870 available
code results
</span>
This is where I want to scream obscenities at Bootstrap and Grid Systems for not doing the smart thing and identifying a landmark element with a unique ID, but naaaaaah...
/end diatribe
Basically what I'm saying is this markup is prone to breakage in future the next time they decide they want a different wording to be used in the heading, or a different HTML tag to enclose the text. There's no foolproof way of extracting the total number of results from the page, which the harvester uses to determine if it's worth going through the search results with a different Sort filter applied (which keeps it from taking three times as long if there are less than 1000 results, etc).
*sigh* I hate frameworks...
from harvester.
For the record, this is what I'd use:
<h1 id="page-title">Showing 282,649,870 available code results</h1>
... which would at least make targeting unique page elements possible, and make extracting a bunch of numerals a lot easier. All of which could still be avoided if GitHub gave us a better way to do what we're currently doing.
from harvester.
Should we be bold and try to match it with the following?
/>[\s\w]*([,\d]+)\s+code\s+results/i
from harvester.
Actually, the correct solution is to parse the loaded page as a detached DOM tree, and query its elements with the correct web interfaces (document.querySelector
, element.querySelector
), etc. I confess the only reason I went the regex route was because I was in the midst of a Perl coding-sprint when I wrote the original code.
If I'd known I'd be curating it formally like we are now, I would've done the smarter thing. ;) It'd still require updates whenever GitHub changed their markup, but they'd be as easy as changing a CSS selector (as opposed to carving through layers of messy expressions... which I have a perverse love of doing for some reason).
PS: I didn't sleep because I was watching the World Cup at 4am, so I'm rambling as usual. I hope Croatia roasts your country on a spit next week. :-) IDEMO HVRATSKA!!
from harvester.
It'd still require updates whenever GitHub changed their markup, but they'd be as easy as changing a CSS selector
But I don't want to change it at all!
from harvester.
Hah! Then you'd better hope GitHub's new overlords Microsoft will one day bless us with a proper file-harvesting facility. nothack
at all... =(
from harvester.
Anyway, whenever updates are needed, it'll be as simple as right-clicking the correct element on the page, opening Inspect element and copying the node's CSS selector in the inspector tools (I believe both Firefox and Chrome have this feature, though I only have the latter available at the moment...)
from harvester.
@pchaigno Just curious, does GitHub display this page as 263,443,068, or do they preserve locale-specific stuff like thousands separators (meaning that figure would be formatted to you as "263.443.068" instead)?
from harvester.
I've got English locales
from harvester.
Hm, looks like GitHub.com is English-only anyway. I've just checked with IE on another computer.
from harvester.
I've added both separators:
// This needs to point to the title that says "Showing 263,443,068 code results"
const h3 = $(".codesearch-results > .pl-2 h3");
if(h3 && h3.textContent.match(/\b([0-9,.\s]+)\s.*?code\s+results/si)){
resultCount = +(RegExp.$1.replace(/\D/g, ""));
... which is unlikely to have issues with "1.200" or "1,200", because GitHub probably aren't gonna format that as a floating-point value anyway. :D
from harvester.
I've just tried your changes. The scrawling went as planned but then it tried to get page 101 and failed with the Search-result list not found
debug message :-/
from harvester.
Yup, I'm still testing them too. I should snapshot the HTML and and leave it on the silo for easier inspection whenever this stuff happens...
from harvester.
Okay, it should be working. Please test and report anything strange or weird-looking.
from harvester.
Works for me!
from harvester.
Awesome, thanks for staying vigilant about it. ^^
Question: Would you find a progress meter helpful? (I could add a real overlaying widget to the page, or use a strictly console
-based one that feels a bit closer to proper command-line.
It'd be nice to see feedback like
[ ".ext" 62 / 100: ▓▓▓▓▓▓▓▓▒░░░░░░░░░░░░ | 1 pass out of 3?
(.. or something a hell of lot nicer than that, because that example sucks as, but I'm sure you get the gist of what I'm referring to .
from harvester.
I'm usually tracking progress through the network tab of Firefox' developer tools. The end of each queried url has the page number. It's only annoying when there are less than a hundred pages and I need to do a search myself to find out how close to the end I am.
I'd be much more interested in a way to leave this running at night (on a server with no GUI).
from harvester.
I'd be much more interested in a way to leave this running at night (on a server with no GUI).
Originally, that was the plan. I wanted a command-line utility that would take as input a filename or extension, and leave me with a folder full of files when it finished running. Something like this:
$ harvest -e ".gif" -d /path/to/save/files
Unfortunately, that became impossible when GitHub decided to restrict code-search results to authenticated users. And that ruined everything.
I could use a headless browser like Phantom.js or a synthesised browser environment using the right Node modules, but that isn't the problem. The problem is not imperilling users who trigger GitHub's abuse detection systems. You've seen how they block access to pages that're loaded too quickly, yes? Well, that's just one of many dead give-aways that a user's account is under the control of an automated process. Headless browsers are even easier to detect, and I don't want to jeopardise users whose accounts get locked and blacklisted automatically because they triggered GitHub's abuse detection monitors.
Furthermore, having this run in the background would require login credentials to be manually entered each time. Even if the user's password were stored in a keychain, 2FA would still require a passcode to be entered, and the whole process would be even slower than simply running it from inside a browser window.
And, of course, if something breaks for any reason, it becomes much harder to investigate why.
from harvester.
Related Issues (11)
- Invalid regexp group on Firefox HOT 7
- Harvester not scraping links? HOT 2
- Notification permission deprecation in Firefox HOT 1
- RFC: Reimplementing Harvester as a CLI program HOT 12
- Hitting rate limit... HOT 3
- SyntaxError: Expected at least one entry to match `.code-list-item` HOT 3
- Instructions to download files HOT 7
- SyntaxError: Unable to extract total number of results from header HOT 7
- Avatars downloaded during scraping HOT 4
- Harvester needs better recovery from unexpected blank pages HOT 7
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from harvester.