Coder Social home page Coder Social logo

Not crawling all links? about spider HOT 11 CLOSED

buckyroberts avatar buckyroberts commented on May 20, 2024
Not crawling all links?

from spider.

Comments (11)

lwgray avatar lwgray commented on May 20, 2024

The purpose of this tool is to only gather links

from spider.

 avatar commented on May 20, 2024

I understand, but when I crawl thenewboston.com, it gathers ~20-30 links. You can find in Chrome Dev Tools that there are links under the domain on that. The crawler does what it does and gathers all the links and puts them in the crawled.txt file. However, when I go to twitter.com, it doesn't get ANY links, just the http://www.twitter.com part. There are no other links in the crawled.txt file.

from spider.

lwgray avatar lwgray commented on May 20, 2024

maybe we can solve this together... I will take a look at it and will respond if I find something. Maybe buckyroberts might have a better idea of how to address it.

from spider.

lwgray avatar lwgray commented on May 20, 2024

I submitted a pull request. I was able to get it to work...Lets just hope buckyroberts accepts it.

from spider.

 avatar commented on May 20, 2024

Fixes the issue for Twitter, but now when I do a site like https://github.com or https://youtube.com, it has the same problem. I removed the content checker all together and that seemed to fix the problem, but kind of leaves issued for sites down the road.

from spider.

lwgray avatar lwgray commented on May 20, 2024

I doubt there are that many variations so maybe we could add them to a list and loop through them.

idb

from spider.

buckyroberts avatar buckyroberts commented on May 20, 2024

Agreed, a lot of sites use different techniques for determining what is a "bot", what is allowed to make a request to their server, how often requests are allowed to be made, etc...

So I am sure we will often come across sites that pose different types of problems, but I like that idea lwgray. Before crawling, we can loop through until we find a spider that works with that specific site. Once we find a variation that is compatible, we will use that.

What we could also do is develop a generic Spider class (like the original one). Then, anytime a specific site had a problem (like Twitter) we could just inherit from the Spider class and overwrite whatever methods we need to make it compatible with that site. That way we won't clutter up spider.py with a bunch of code that attempts to fix all the issues for every site.

Also, thanks for the bug fixes guys!

from spider.

lwgray avatar lwgray commented on May 20, 2024

that does sound better

from spider.

 avatar commented on May 20, 2024

I like the idea of a generic class for a spider. Maybe a temporary fix for this could be instead of the spider checking whether a website Content-Type is text/html, maybe check that it isn't a PDF, .exe file etc.. Like a blacklist. But then again, that could go back to the problem where we don't want a long list of items..

from spider.

lwgray avatar lwgray commented on May 20, 2024

I am not quite sure how to structure the generic spider... If you explain, I don't mind implementing it.

Cheers,
Larry

from spider.

buckyroberts avatar buckyroberts commented on May 20, 2024

Basically like it is right now. Then later on when we find a problem crawling some site like Facebook for example, we will just make a new class called FacebookSpider that inherits from Spider and change whatever functionality we need to to make it work with Facebook. That way we can solve specific problems for specific sites, without cluttering up the Spider class.

from spider.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.