Coder Social home page Coder Social logo

simplecrawler's Introduction

Simple web crawler for node.js [UNMAINTAINED]

This project is unmaintained and active projects relying on it are advised to migrate to alternative solutions.

NPM version Linux Build Status Windows Build Status Dependency Status devDependency Status Greenkeeper badge

simplecrawler is designed to provide a basic, flexible and robust API for crawling websites. It was written to archive, analyse, and search some very large websites and has happily chewed through hundreds of thousands of pages and written tens of gigabytes to disk without issue.

What does simplecrawler do?

  • Provides a very simple event driven API using EventEmitter
  • Extremely configurable base for writing your own crawler
  • Provides some simple logic for auto-detecting linked resources - which you can replace or augment
  • Automatically respects any robots.txt rules
  • Has a flexible queue system which can be frozen to disk and defrosted
  • Provides basic statistics on network performance
  • Uses buffers for fetching and managing data, preserving binary data (except when discovering links)

Documentation

Installation

npm install --save simplecrawler

Getting Started

Initializing simplecrawler is a simple process. First, you require the module and instantiate it with a single argument. You then configure the properties you like (eg. the request interval), register a few event listeners, and call the start method. Let's walk through the process!

After requiring the crawler, we create a new instance of it. We supply the constructor with a URL that indicates which domain to crawl and which resource to fetch first.

var Crawler = require("simplecrawler");

var crawler = new Crawler("http://www.example.com/");

You can initialize the crawler with or without the new operator. Being able to skip it comes in handy when you want to chain API calls.

var crawler = Crawler("http://www.example.com/")
    .on("fetchcomplete", function () {
        console.log("Fetched a resource!")
    });

By default, the crawler will only fetch resources on the same domain as that in the URL passed to the constructor. But this can be changed through the crawler.domainWhitelist property.

Now, let's configure some more things before we start crawling. Of course, you're probably wanting to ensure you don't take down your web server. Decrease the concurrency from five simultaneous requests - and increase the request interval from the default 250 ms like this:

crawler.interval = 10000; // Ten seconds
crawler.maxConcurrency = 3;

You can also define a max depth for links to fetch:

crawler.maxDepth = 1; // Only first page is fetched (with linked CSS & images)
// Or:
crawler.maxDepth = 2; // First page and discovered links from it are fetched
// Or:
crawler.maxDepth = 3; // Etc.

For a full list of configurable properties, see the configuration section.

You'll also need to set up event listeners for the events you want to listen to. crawler.fetchcomplete and crawler.complete are good places to start.

crawler.on("fetchcomplete", function(queueItem, responseBuffer, response) {
    console.log("I just received %s (%d bytes)", queueItem.url, responseBuffer.length);
    console.log("It was a resource of type %s", response.headers['content-type']);
});

Then, when you're satisfied and ready to go, start the crawler! It'll run through its queue finding linked resources on the domain to download, until it can't find any more.

crawler.start();

Events

simplecrawler's API is event driven, and there are plenty of events emitted during the different stages of the crawl.

"crawlstart"

Fired when the crawl starts. This event gives you the opportunity to adjust the crawler's configuration, since the crawl won't actually start until the next processor tick.

"discoverycomplete" (queueItem, resources)

Fired when the discovery of linked resources has completed

Param Type Description
queueItem QueueItem The queue item that represents the document for the discovered resources
resources Array An array of discovered and cleaned URL's

"invaliddomain" (queueItem)

Fired when a resource wasn't queued because of an invalid domain name

Param Type Description
queueItem QueueItem The queue item representing the disallowed URL

"fetchdisallowed" (queueItem)

Fired when a resource wasn't queued because it was disallowed by the site's robots.txt rules

Param Type Description
queueItem QueueItem The queue item representing the disallowed URL

"fetchconditionerror" (queueItem, error)

Fired when a fetch condition returns an error

Param Type Description
queueItem QueueItem The queue item that was processed when the error was encountered
error *

"fetchprevented" (queueItem, fetchCondition)

Fired when a fetch condition prevented the queueing of a URL

Param Type Description
queueItem QueueItem The queue item that didn't pass the fetch conditions
fetchCondition function The first fetch condition that returned false

"queueduplicate" (queueItem)

Fired when a new queue item was rejected because another queue item with the same URL was already in the queue

Param Type Description
queueItem QueueItem The queue item that was rejected

"queueerror" (error, queueItem)

Fired when an error was encountered while updating a queue item

Param Type Description
error QueueItem The error that was returned by the queue
queueItem QueueItem The queue item that the crawler tried to update when it encountered the error

"queueadd" (queueItem, referrer)

Fired when an item was added to the crawler's queue

Param Type Description
queueItem QueueItem The queue item that was added to the queue
referrer QueueItem The queue item representing the resource where the new queue item was found

"fetchtimeout" (queueItem, timeout)

Fired when a request times out

Param Type Description
queueItem QueueItem The queue item for which the request timed out
timeout Number The delay in milliseconds after which the request timed out

"fetchclienterror" (queueItem, error)

Fired when a request encounters an unknown error

Param Type Description
queueItem QueueItem The queue item for which the request has errored
error Object The error supplied to the error event on the request

"fetchstart" (queueItem, requestOptions)

Fired just after a request has been initiated

Param Type Description
queueItem QueueItem The queue item for which the request has been initiated
requestOptions Object The options generated for the HTTP request

"cookieerror" (queueItem, error, cookie)

Fired when an error was encountered while trying to add a cookie to the cookie jar

Param Type Description
queueItem QueueItem The queue item representing the resource that returned the cookie
error Error The error that was encountered
cookie String The Set-Cookie header value that was returned from the request

"fetchheaders" (queueItem, response)

Fired when the headers for a request have been received

Param Type Description
queueItem QueueItem The queue item for which the headers have been received
response http.IncomingMessage The http.IncomingMessage for the request's response

"downloadconditionerror" (queueItem, error)

Fired when a download condition returns an error

Param Type Description
queueItem QueueItem The queue item that was processed when the error was encountered
error *

"downloadprevented" (queueItem, response)

Fired when the downloading of a resource was prevented by a download condition

Param Type Description
queueItem QueueItem The queue item representing the resource that was halfway fetched
response http.IncomingMessage The http.IncomingMessage for the request's response

"notmodified" (queueItem, response, cacheObject)

Fired when the crawler's cache was enabled and the server responded with a 304 Not Modified status for the request

Param Type Description
queueItem QueueItem The queue item for which the request returned a 304 status
response http.IncomingMessage The http.IncomingMessage for the request's response
cacheObject CacheObject The CacheObject returned from the cache backend

"fetchredirect" (queueItem, redirectQueueItem, response)

Fired when the server returned a redirect HTTP status for the request

Param Type Description
queueItem QueueItem The queue item for which the request was redirected
redirectQueueItem QueueItem The queue item for the redirect target resource
response http.IncomingMessage The http.IncomingMessage for the request's response

"fetch404" (queueItem, response)

Fired when the server returned a 404 Not Found status for the request

Param Type Description
queueItem QueueItem The queue item for which the request returned a 404 status
response http.IncomingMessage The http.IncomingMessage for the request's response

"fetch410" (queueItem, response)

Fired when the server returned a 410 Gone status for the request

Param Type Description
queueItem QueueItem The queue item for which the request returned a 410 status
response http.IncomingMessage The http.IncomingMessage for the request's response

"fetcherror" (queueItem, response)

Fired when the server returned a status code above 400 that isn't 404 or 410

Param Type Description
queueItem QueueItem The queue item for which the request failed
response http.IncomingMessage The http.IncomingMessage for the request's response

"fetchcomplete" (queueItem, responseBody, response)

Fired when the request has completed

Param Type Description
queueItem QueueItem The queue item for which the request has completed
responseBody String | Buffer If decodeResponses is true, this will be the decoded HTTP response. Otherwise it will be the raw response buffer.
response http.IncomingMessage The http.IncomingMessage for the request's response

"gziperror" (queueItem, responseBody, response)

Fired when an error was encountered while unzipping the response data

Param Type Description
queueItem QueueItem The queue item for which the unzipping failed
responseBody String | Buffer If decodeResponses is true, this will be the decoded HTTP response. Otherwise it will be the raw response buffer.
response http.IncomingMessage The http.IncomingMessage for the request's response

"fetchdataerror" (queueItem, response)

Fired when a resource couldn't be downloaded because it exceeded the maximum allowed size

Param Type Description
queueItem QueueItem The queue item for which the request failed
response http.IncomingMessage The http.IncomingMessage for the request's response

"robotstxterror" (error)

Fired when an error was encountered while retrieving a robots.txt file

Param Type Description
error Error The error returned from getRobotsTxt

"complete"

Fired when the crawl has completed - all resources in the queue have been dealt with

A note about HTTP error conditions

By default, simplecrawler does not download the response body when it encounters an HTTP error status in the response. If you need this information, you can listen to simplecrawler's error events, and through node's native data event (response.on("data",function(chunk) {...})) you can save the information yourself.

Waiting for asynchronous event listeners

Sometimes, you might want to wait for simplecrawler to wait for you while you perform some asynchronous tasks in an event listener, instead of having it racing off and firing the complete event, halting your crawl. For example, if you're doing your own link discovery using an asynchronous library method.

simplecrawler provides a wait method you can call at any time. It is available via this from inside listeners, and on the crawler object itself. It returns a callback function.

Once you've called this method, simplecrawler will not fire the complete event until either you execute the callback it returns, or a timeout is reached (configured in crawler.listenerTTL, by default 10000 ms.)

Example asynchronous event listener

crawler.on("fetchcomplete", function(queueItem, data, res) {
    var continue = this.wait();

    doSomeDiscovery(data, function(foundURLs) {
        foundURLs.forEach(function(url) {
            crawler.queueURL(url, queueItem);
        });

        continue();
    });
});

Configuration

simplecrawler is highly configurable and there's a long list of settings you can change to adapt it to your specific needs.

crawler.initialURL : String

Controls which URL to request first

crawler.host : String

Determines what hostname the crawler should limit requests to (so long as filterByDomain is true)

crawler.interval : Number

Determines the interval at which new requests are spawned by the crawler, as long as the number of open requests is under the maxConcurrency cap.

crawler.maxConcurrency : Number

Maximum request concurrency. If necessary, simplecrawler will increase node's http agent maxSockets value to match this setting.

crawler.timeout : Number

Maximum time we'll wait for headers

crawler.listenerTTL : Number

Maximum time we'll wait for async listeners

crawler.userAgent : String

Crawler's user agent string

Default: "Node/simplecrawler <version> (https://github.com/simplecrawler/simplecrawler)"

crawler.queue : FetchQueue

Queue for requests. The crawler can use any implementation so long as it uses the same interface. The default queue is simply backed by an array.

crawler.respectRobotsTxt : Boolean

Controls whether the crawler respects the robots.txt rules of any domain. This is done both with regards to the robots.txt file, and <meta> tags that specify a nofollow value for robots. The latter only applies if the default discoverResources method is used, though.

crawler.allowInitialDomainChange : Boolean

Controls whether the crawler is allowed to change the host setting if the first response is a redirect to another domain.

crawler.decompressResponses : Boolean

Controls whether HTTP responses are automatically decompressed based on their Content-Encoding header. If true, it will also assign the appropriate Accept-Encoding header to requests.

crawler.decodeResponses : Boolean

Controls whether HTTP responses are automatically character converted to standard JavaScript strings using the iconv-lite module before emitted in the fetchcomplete event. The character encoding is interpreted from the Content-Type header firstly, and secondly from any <meta charset="xxx" /> tags.

crawler.filterByDomain : Boolean

Controls whether the crawler fetches only URL's where the hostname matches host. Unless you want to be crawling the entire internet, I would recommend leaving this on!

crawler.scanSubdomains : Boolean

Controls whether URL's that points to a subdomain of host should also be fetched.

crawler.ignoreWWWDomain : Boolean

Controls whether to treat the www subdomain as the same domain as host. So if http://example.com/example has already been fetched, http://www.example.com/example won't be fetched also.

crawler.stripWWWDomain : Boolean

Controls whether to strip the www subdomain entirely from URL's at queue item construction time.

crawler.cache : SimpleCache

Internal cache store. Must implement SimpleCache interface. You can save the site to disk using the built in file system cache like this:

crawler.cache = new Crawler.cache('pathToCacheDirectory');

crawler.useProxy : Boolean

Controls whether an HTTP proxy should be used for requests

crawler.proxyHostname : String

If useProxy is true, this setting controls what hostname to use for the proxy

crawler.proxyPort : Number

If useProxy is true, this setting controls what port to use for the proxy

crawler.proxyUser : String

If useProxy is true, this setting controls what username to use for the proxy

crawler.proxyPass : String

If useProxy is true, this setting controls what password to use for the proxy

crawler.needsAuth : Boolean

Controls whether to use HTTP Basic Auth

crawler.authUser : String

If needsAuth is true, this setting controls what username to send with HTTP Basic Auth

crawler.authPass : String

If needsAuth is true, this setting controls what password to send with HTTP Basic Auth

crawler.acceptCookies : Boolean

Controls whether to save and send cookies or not

crawler.cookies : CookieJar

The module used to store cookies

crawler.customHeaders : Object

Controls what headers (besides the default ones) to include with every request.

crawler.domainWhitelist : Array

Controls what domains the crawler is allowed to fetch from, regardless of host or filterByDomain settings.

crawler.allowedProtocols : Array.<RegExp>

Controls what protocols the crawler is allowed to fetch from

crawler.maxResourceSize : Number

Controls the maximum allowed size in bytes of resources to be fetched

Default: 16777216

crawler.supportedMimeTypes : Array.<(RegExp|string)>

Controls what mimetypes the crawler will scan for new resources. If downloadUnsupported is false, this setting will also restrict what resources are downloaded.

crawler.downloadUnsupported : Boolean

Controls whether to download resources with unsupported mimetypes (as specified by supportedMimeTypes)

crawler.urlEncoding : String

Controls what URL encoding to use. Can be either "unicode" or "iso8859"

crawler.stripQuerystring : Boolean

Controls whether to strip query string parameters from URL's at queue item construction time.

crawler.sortQueryParameters : Boolean

Controls whether to sort query string parameters from URL's at queue item construction time.

crawler.discoverRegex : Array.<(RegExp|function())>

Collection of regular expressions and functions that are applied in the default discoverResources method.

crawler.parseHTMLComments : Boolean

Controls whether the default discoverResources should scan for new resources inside of HTML comments.

crawler.parseScriptTags : Boolean

Controls whether the default discoverResources should scan for new resources inside of <script> tags.

crawler.maxDepth : Number

Controls the max depth of resources that the crawler fetches. 0 means that the crawler won't restrict requests based on depth. The initial resource, as well as manually queued resources, are at depth 1. From there, every discovered resource adds 1 to its referrer's depth.

crawler.ignoreInvalidSSL : Boolean

Controls whether to proceed anyway when the crawler encounters an invalid SSL certificate.

crawler.httpAgent : HTTPAgent

Controls what HTTP agent to use. This is useful if you want to configure eg. a SOCKS client.

crawler.httpsAgent : HTTPAgent

Controls what HTTPS agent to use. This is useful if you want to configure eg. a SOCKS client.

Fetch conditions

simplecrawler has an concept called fetch conditions that offers a flexible API for filtering discovered resources before they're put in the queue. A fetch condition is a function that takes a queue item candidate and evaluates (synchronously or asynchronously) whether it should be added to the queue or not. Please note: with the next major release, all fetch conditions will be asynchronous.

You may add as many fetch conditions as you like, and remove them at runtime. simplecrawler will evaluate every fetch condition in parallel until one is encountered that returns a falsy value. If that happens, the resource in question will not be fetched.

This API is complemented by download conditions that determine whether a resource's body data should be downloaded.

crawler.addFetchCondition(callback) โ‡’ Number

Adds a callback to the fetch conditions array. simplecrawler will evaluate all fetch conditions for every discovered URL, and if any of the fetch conditions returns a falsy value, the URL won't be queued.

Returns: Number - The index of the fetch condition in the fetch conditions array. This can later be used to remove the fetch condition.

Param Type Description
callback addFetchConditionCallback Function to be called after resource discovery that's able to prevent queueing of resource

Crawler~addFetchConditionCallback : function

Evaluated for every discovered URL to determine whether to put it in the queue.

Param Type Description
queueItem QueueItem The resource to be queued (or not)
referrerQueueItem QueueItem The resource where queueItem was discovered
callback function

crawler.removeFetchCondition(id) โ‡’ Boolean

Removes a fetch condition from the fetch conditions array.

Returns: Boolean - If the removal was successful, the method will return true. Otherwise, it will throw an error.

Param Type Description
id Number | function The numeric ID of the fetch condition, or a reference to the fetch condition itself. This was returned from addFetchCondition

Download conditions

While fetch conditions let you determine which resources to put in the queue, download conditions offer the same kind of flexible API for determining which resources' data to download. Download conditions support both a synchronous and an asynchronous API, but with the next major release, all download conditions will be asynchronous.

Download conditions are evaluated after the headers of a resource have been downloaded, if that resource returned an HTTP status between 200 and 299. This lets you inspect the content-type and content-length headers, along with all other properties on the queue item, before deciding if you want this resource's data or not.

crawler.addDownloadCondition(callback) โ‡’ Number

Adds a callback to the download conditions array. simplecrawler will evaluate all download conditions for every fetched resource after the headers of that resource have been received. If any of the download conditions returns a falsy value, the resource data won't be downloaded.

Returns: Number - The index of the download condition in the download conditions array. This can later be used to remove the download condition.

Param Type Description
callback addDownloadConditionCallback Function to be called when the headers of the resource represented by the queue item have been downloaded

Crawler~addDownloadConditionCallback : function

Evaluated for every fetched resource after its header have been received to determine whether to fetch the resource body.

Param Type Description
queueItem QueueItem The resource to be downloaded (or not)
response http.IncomingMessage The response object as returned by node's http API
callback function

crawler.removeDownloadCondition(id) โ‡’ Boolean

Removes a download condition from the download conditions array.

Returns: Boolean - If the removal was successful, the method will return true. Otherwise, it will throw an error.

Param Type Description
id Number | function The numeric ID of the download condition, or a reference to the download condition itself. The ID was returned from addDownloadCondition

The queue

Like any other web crawler, simplecrawler has a queue. It can be directly accessed through crawler.queue and implements an asynchronous interface for accessing queue items and statistics. There are several methods for interacting with the queue, the simplest being crawler.queue.get, which lets you get a queue item at a specific index in the queue.

fetchQueue.get(index, callback)

Get a queue item by index

Param Type Description
index Number The index of the queue item in the queue
callback function Gets two parameters, error and queueItem. If the operation was successful, error will be null.

All queue method are in reality synchronous by default, but simplecrawler is built to be able to use different queues that implement the same interface, and those implementations can be asynchronous - which means they could eg. be backed by a database.

Manually adding to the queue

To add items to the queue, use crawler.queueURL.

crawler.queueURL(url, [referrer], [force]) โ‡’ Boolean

Queues a URL for fetching after cleaning, validating and constructing a queue item from it. If you're queueing a URL manually, use this method rather than Crawler#queue#add

Returns: Boolean - The return value used to indicate whether the URL passed all fetch conditions and robots.txt rules. With the advent of async fetch conditions, the return value will no longer take fetch conditions into account.
Emits: invaliddomain, fetchdisallowed, fetchconditionerror, fetchprevented, queueduplicate, queueerror, queueadd

Param Type Description
url String An absolute or relative URL. If relative, processURL will make it absolute to the referrer queue item.
[referrer] QueueItem The queue item representing the resource where this URL was discovered.
[force] Boolean If true, the URL will be queued regardless of whether it already exists in the queue or not.

Queue items

Because when working with simplecrawler, you'll constantly be handed queue items, it helps to know what's inside them. Here's the formal documentation of the properties that they contain.

QueueItem : Object

QueueItems represent resources in the queue that have been fetched, or will be eventually.

Properties

Name Type Description
id Number A unique ID assigned by the queue when the queue item is added
url String The complete, canonical URL of the resource
protocol String The protocol of the resource (http, https)
host String The full domain/hostname of the resource
port Number The port of the resource
path String The URL path, including the query string
uriPath String The URL path, excluding the query string
depth Number How many steps simplecrawler has taken from the initial page (which is depth 1) to this resource.
referrer String The URL of the resource where the URL of this queue item was discovered
fetched Boolean Has the request for this item been completed? You can monitor this as requests are processed.
status 'queued' | 'spooled' | 'headers' | 'downloaded' | 'redirected' | 'notfound' | 'failed' The internal status of the item.
stateData Object An object containing state data and other information about the request.
stateData.requestLatency Number The time (in ms) taken for headers to be received after the request was made.
stateData.requestTime Number The total time (in ms) taken for the request (including download time.)
stateData.downloadTime Number The total time (in ms) taken for the resource to be downloaded.
stateData.contentLength Number The length (in bytes) of the returned content. Calculated based on the content-length header.
stateData.contentType String The MIME type of the content.
stateData.code Number The HTTP status code returned for the request. Note that this code is 600 if an error occurred in the client and a fetch operation could not take place successfully.
stateData.headers Object An object containing the header information returned by the server. This is the object node returns as part of the response object.
stateData.actualDataSize Number The length (in bytes) of the returned content. Calculated based on what is actually received, not the content-length header.
stateData.sentIncorrectSize Boolean True if the data length returned by the server did not match what we were told to expect by the content-length header.

Queue statistics and reporting

First of all, the queue can provide some basic statistics about the network performance of your crawl so far. This is done live, so don't check it 30 times a second. You can test the following properties:

  • requestTime
  • requestLatency
  • downloadTime
  • contentLength
  • actualDataSize

You can get the maximum, minimum, and average values for each with the crawler.queue.max, crawler.queue.min, and crawler.queue.avg functions respectively.

fetchQueue.max(statisticName, callback)

Gets the maximum value of a stateData property from all the items in the queue. This means you can eg. get the maximum request time, download size etc.

Param Type Description
statisticName String Can be any of the strings in _allowedStatistics
callback function Gets two parameters, error and max. If the operation was successful, error will be null.

fetchQueue.min(statisticName, callback)

Gets the minimum value of a stateData property from all the items in the queue. This means you can eg. get the minimum request time, download size etc.

Param Type Description
statisticName String Can be any of the strings in _allowedStatistics
callback function Gets two parameters, error and min. If the operation was successful, error will be null.

fetchQueue.avg(statisticName, callback)

Gets the average value of a stateData property from all the items in the queue. This means you can eg. get the average request time, download size etc.

Param Type Description
statisticName String Can be any of the strings in _allowedStatistics
callback function Gets two parameters, error and avg. If the operation was successful, error will be null.

For general filtering or counting of queue items, there are two methods: crawler.queue.filterItems and crawler.queue.countItems. Both take an object comparator and a callback.

fetchQueue.filterItems(comparator, callback)

Filters and returns the items in the queue that match a selector

Param Type Description
comparator Object Comparator object used to filter items. Queue items that are returned need to match all the properties of this object.
callback function Gets two parameters, error and items. If the operation was successful, error will be null and items will be an array of QueueItems.

fetchQueue.countItems(comparator, callback, callback)

Counts the items in the queue that match a selector

Param Type Description
comparator Object Comparator object used to filter items. Queue items that are counted need to match all the properties of this object.
callback FetchQueue~countItemsCallback
callback function Gets two parameters, error and items. If the operation was successful, error will be null and items will be an array of QueueItems.

The object comparator can also contain other objects, so you may filter queue items based on properties in their stateData object as well.

crawler.queue.filterItems({
    stateData: { code: 301 }
}, function(error, items) {
    console.log("These items returned a 301 HTTP status", items);
});

Saving and reloading the queue (freeze/defrost)

It can be convenient to be able to save the crawl progress and later be able to reload it if your application fails or you need to abort the crawl for some reason. The crawler.queue.freeze and crawler.queue.defrost methods will let you do this.

A word of warning - they are not CPU friendly as they rely on JSON.parse and JSON.stringify. Use them only when you need to save the queue - don't call them after every request or your application's performance will be incredibly poor - they block like crazy. That said, using them when your crawler commences and stops is perfectly reasonable.

Note that the methods themselves are asynchronous, so if you are going to exit the process after you do the freezing, make sure you wait for callback - otherwise you'll get an empty file.

fetchQueue.freeze(filename, callback)

Writes the queue to disk in a JSON file. This file can later be imported using defrost

Param Type Description
filename String Filename passed directly to fs.writeFile
callback function Gets a single error parameter. If the operation was successful, this parameter will be null.

fetchQueue.defrost(filename, callback)

Import the queue from a frozen JSON file on disk.

Param Type Description
filename String Filename passed directly to fs.readFile
callback function Gets a single error parameter. If the operation was successful, this parameter will be null.

Cookies

simplecrawler has an internal cookie jar, which collects and resends cookies automatically and by default. If you want to turn this off, set the crawler.acceptCookies option to false. The cookie jar is accessible via crawler.cookies, and is an event emitter itself.

Cookie events

"addcookie" (cookie)

Fired when a cookie has been added to the jar

Param Type Description
cookie Cookie The cookie that has been added

"removecookie" (cookie)

Fired when one or multiple cookie have been removed from the jar

Param Type Description
cookie Array.<Cookie> The cookies that have been removed

Link Discovery

simplecrawler's discovery function is made to be replaceable โ€” you can easily write your own that discovers only the links you're interested in.

The method must accept a buffer and a queueItem, and return the resources that are to be added to the queue.

It is quite common to pair simplecrawler with a module like cheerio that can correctly parse HTML and provide a DOM like API for querying โ€” or even a whole headless browser, like phantomJS.

The example below demonstrates how one might achieve basic HTML-correct discovery of only link tags using cheerio.

crawler.discoverResources = function(buffer, queueItem) {
    var $ = cheerio.load(buffer.toString("utf8"));

    return $("a[href]").map(function () {
        return $(this).attr("href");
    }).get();
};

FAQ/Troubleshooting

There are a couple of questions that pop up more often than others in the issue tracker. If you're having trouble with simplecrawler, please have a look at the list below before submitting an issue.

  • Q: Why does simplecrawler discover so many invalid URLs?

    A: simplecrawler's built-in discovery method is purposefully naive - it's a brute force approach intended to find everything: URLs in comments, binary files, scripts, image EXIF data, inside CSS documents, and more โ€” useful for archiving and use cases where it's better to have false positives than fail to discover a resource.

    It's definitely not a solution for every case, though โ€” if you're writing a link checker or validator, you don't want erroneous 404s throwing errors. Therefore, simplecrawler allows you to tune discovery in a few key ways:

    • You can either add to (or remove from) the crawler.discoverRegex array, tweaking the search patterns to meet your requirements; or
    • Swap out the discoverResources method. Parsing HTML pages is beyond the scope of simplecrawler, but it is very common to combine simplecrawler with a module like cheerio for more sophisticated resource discovery.

    Further documentation is available in the link discovery section.

  • Q: Why did simplecrawler complete without fetching any resources?

    A: When this happens, it is usually because the initial request was redirected to a different domain that wasn't in the crawler.domainWhitelist.

  • Q: How do I crawl a site that requires a login?

    A: Logging in to a site is usually fairly simple and most login procedures look alike. We've included an example that covers a lot of situations, but sadly, there isn't a one true solution for how to deal with logins, so there's no guarantee that this code works right off the bat.

    What we do here is:

    1. fetch the login page,
    2. store the session cookie assigned to us by the server,
    3. extract any CSRF tokens or similar parameters required when logging in,
    4. submit the login credentials.
    var Crawler = require("simplecrawler"),
        url = require("url"),
        cheerio = require("cheerio"),
        request = require("request");
    
    var initialURL = "https://example.com/";
    
    var crawler = new Crawler(initialURL);
    
    request("https://example.com/login", {
        // The jar option isn't necessary for simplecrawler integration, but it's
        // the easiest way to have request remember the session cookie between this
        // request and the next
        jar: true
    }, function (error, response, body) {
        // Start by saving the cookies. We'll likely be assigned a session cookie
        // straight off the bat, and then the server will remember the fact that
        // this session is logged in as user "iamauser" after we've successfully
        // logged in
        crawler.cookies.addFromHeaders(response.headers["set-cookie"]);
    
        // We want to get the names and values of all relevant inputs on the page,
        // so that any CSRF tokens or similar things are included in the POST
        // request
        var $ = cheerio.load(body),
            formDefaults = {},
            // You should adapt these selectors so that they target the
            // appropriate form and inputs
            formAction = $("#login").attr("action"),
            loginInputs = $("input");
    
        // We loop over the input elements and extract their names and values so
        // that we can include them in the login POST request
        loginInputs.each(function(i, input) {
            var inputName = $(input).attr("name"),
                inputValue = $(input).val();
    
            formDefaults[inputName] = inputValue;
        });
    
        // Time for the login request!
        request.post(url.resolve(initialURL, formAction), {
            // We can't be sure that all of the input fields have a correct default
            // value. Maybe the user has to tick a checkbox or something similar in
            // order to log in. This is something you have to find this out manually
            // by logging in to the site in your browser and inspecting in the
            // network panel of your favorite dev tools what parameters are included
            // in the request.
            form: Object.assign(formDefaults, {
                username: "iamauser",
                password: "supersecretpw"
            }),
            // We want to include the saved cookies from the last request in this
            // one as well
            jar: true
        }, function (error, response, body) {
            // That should do it! We're now ready to start the crawler
            crawler.start();
        });
    });
    
    crawler.on("fetchcomplete", function (queueItem, responseBuffer, response) {
        console.log("Fetched", queueItem.url, responseBuffer.toString());
    });
  • Q: What does it mean that events are asynchronous?

    A: One of the core concepts of node.js is its asynchronous nature. I/O operations (like network requests) take place outside of the main thread (which is where your code is executed). This is what makes node fast, the fact that it can continue executing code while there are multiple HTTP requests in flight, for example. But to be able to get back the result of the HTTP request, we need to register a function that will be called when the result is ready. This is what asynchronous means in node - the fact that code can continue executing while I/O operations are in progress - and it's the same concept as with AJAX requests in the browser.

  • Q: Promises are nice, can I use them with simplecrawler?

    A: No, not really. Promises are meant as a replacement for callbacks, but simplecrawler is event driven, not callback driven. Using callbacks to any greater extent in simplecrawler wouldn't make much sense, since you normally need to react more than once to what happens in simplecrawler.

  • Q: Something's happening and I don't see the output I'm expecting!

    Before filing an issue, check to see that you're not just missing something by logging all crawler events with the code below:

    var originalEmit = crawler.emit;
    crawler.emit = function(evtName, queueItem) {
        crawler.queue.countItems({ fetched: true }, function(err, completeCount) {
            if (err) {
                throw err;
            }
    
            crawler.queue.getLength(function(err, length) {
                if (err) {
                    throw err;
                }
    
                console.log("fetched %d of %d โ€” %d open requests, %d open listeners",
                    completeCount,
                    length,
                    crawler._openRequests.length,
                    crawler._openListeners);
            });
        });
    
        console.log(evtName, queueItem ? queueItem.url ? queueItem.url : queueItem : null);
        originalEmit.apply(crawler, arguments);
    };

    If you don't see what you need after inserting that code block, and you still need help, please attach the output of all the events fired with your email/issue.

Node Support Policy

Simplecrawler will officially support stable and LTS versions of Node which are currently supported by the Node Foundation.

Currently supported versions:

  • 8.x
  • 10.x
  • 12.x

Current Maintainers

Contributing

Please see the contributor guidelines before submitting a pull request to ensure that your contribution is able to be accepted quickly and easily!

Contributors

simplecrawler has benefited from the kind efforts of dozens of contributors, to whom we are incredibly grateful. We originally listed their individual contributions but it became pretty unwieldy - the full list can be found here.

License

Copyright (c) 2017, Christopher Giffard.

All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

  • Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
  • Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

simplecrawler's People

Contributors

autarc avatar bfwg avatar cgiffard avatar deton avatar distracteddev avatar emlama avatar fredrikekelund avatar ghhutch avatar greenkeeper[bot] avatar gregmolnar avatar hitman401 avatar huijar avatar joscha avatar kbychkov avatar keichan34 avatar konstantinblaesi avatar lanceli avatar leeroybrun avatar lexmark-haputman avatar lgraubner avatar mauricebutler avatar mikeiannacone avatar mmoulton avatar nisaacson avatar notatestuser avatar stilliard avatar thejoshcrawford avatar tmpfs avatar venning avatar xhmikosr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

simplecrawler's Issues

Custom headers

Is there any way to add custom headers without overriding Crawler.prototype.fetchQueueItem method ?

thanks

node 0.10.28 causes several issues

I used node 0.8.26 and the crawler worked fine.
but several days ago i upgraded node to 0.10.28,
several problems occurred.

1\ In handleResponse function , every if clause seems to need: response.on("data",function(){});
to fire the "end" event of the response. Otherwise the process will freeze .
2\In handleResponse function, memory leak( process out of memory error) occurred after crawling several hours.It is very strange.I didn't see this kind of problems while using node 0.8.26.
3\In https mode,using node 0.10.28,I encountered "self_signed_cert_in_chain" error.
I googled this answer: http://loutilities.wordpress.com/2013/04/08/using-ssl-self-signed-certs-with-node-js/

I spent a whole day trying to solve these issues.Sadly i got no success .

PhantomJS support/plugin?

A lot of sites post render with JavaScript; therefore crawling without post-rendering the JavaScript deems this library useless for those sites.

I wrote a quick little PhantomJS middle man proxy that served the library the proxy URL to crawl the post-rendered JS pages.

Would be nice to have this support built-in! Literally 10-20 lines of PhantomJS and makes the difference.

Other than that; great library! Fan

No base href support

I try to download site with base href and found that you dont have base href support.

Link to documentation
http://www.w3schools.com/tags/tag_base.asp

If base href like
base href="http://example.com/"

i can set it manually in lib/crawler.js

    return urlMatch
        .map(cleanURL)
        .reduce(function(list,URL) {
            //console.log(queueItem);
            // Ensure URL is whole and complete
            try {
                URL = URI(URL)
                        .absoluteTo(queueItem.protocol+'://'+queueItem.host+'/')
                        .normalize()
                        .toString();

Some ugly patch below (not fully tested)

Crawler.prototype.discoverResources = function(resourceData,queueItem) {
    // Convert to UTF-8
    // TODO: account for text-encoding.
    var resources = [],
        resourceText = resourceData.toString("utf8"),
        crawler = this;
                var baseReg = /base\s+href\s?=\s?['"]+(([^"']+))/ig
                    var baseMatch = baseReg.exec(resourceText);
                var base = (baseMatch)?baseMatch[2]:queueItem.url;

...

    return urlMatch
        .map(cleanURL)
        .reduce(function(list,URL) {
            //console.log(queueItem);
            // Ensure URL is whole and complete
            try {
                URL = URI(URL)
                        .absoluteTo(base)
                        .normalize()

SimpleCrawler finds no links

I have an issue crawling http://kulturradet.no.

Every now and then, the crawler doesn't move past the front. Hooking into the discoverycomplete event, I can see that no links are found when this happens. I was baffled by this behavior, as oftentimes it finds 80 links. The HTML-code itself has not changed.

I spread some console.log(..) statemens in the crawler.js code, and found that the downloaded buffer seems garbled when this happens. When it works well, I get normal HTML from the page when i output the buffer. When the error hist, and no links are found, I get completely garbled output. This is both before and after the .toString(UTF8) conversion has been performed by the discover links-function.

A strange thing concerning this error is that there seems to be a timeout involved. The crawler stays in it's erroneous state for some 10-15 minutes, and then when I run it again, it finds links just fine again. Very strange. Downloading the page in a browser or using curl displays the HTML-code just fine without being garbled.

Any ideas on what can be causing this? What baffles me is the irratic behavior. Is there some caching involved, can a network error cause this so it's cached afterwards for example?

Calls through self signed certificate returns 599

Making calls for a site with self signed certificate returns stateData.code = 599.
Has this got to do with some underlying networking module or am i doing it wrong?

Obsolete issue
You are able to instruct node to pass theese by the following code:

process.env['NODE_TLS_REJECT_UNAUTHORIZED'] = '0';

not working?

var Crawler = require("simplecrawler").Crawler;

var myCrawler = new Crawler("http://www.google.com","/",80,300);

myCrawler.on("fetchcomplete",function(queueItem, responseBuffer, response) {
console.log("I just received %s (%d bytes)",queueItem.url,responseBuffer.length);
console.log("It was a resource of type %s",response.headers['content-type']);
});

myCrawler.start();

I'm missing something?

Manual discovery within async calls

Hi!

I like your crawler, but I'm having trouble solving following problem.

I want to crawl only resources linked with anchors (a elements). So I wrote following sample code (in coffeescript):

Crawler = require "simplecrawler"
$       = require "jquery"
jsdom   = require "jsdom"
url     = require "url"

crawler = new Crawler
crawler.initialProtocol     = "http"
crawler.host                = "nodejs.org"
crawler.initialPort         = 80
crawler.initialPath         = "/"
crawler.discoverResources   = no


crawler.on "crawlstart", ->
  console.log "Crawling inside #{@host}"

crawler.on "fetchstart", (item) ->
  console.log "Looking into #{item.url}"

crawler.on "fetchcomplete", (item, buffer, response) ->
  dom = jsdom.env
    html    : buffer
    scripts : ["http://code.jquery.com/jquery.js"],
    (error, window) ->
      if error then throw error
      $ = window.jQuery
      links = $ "a"

      for link in links
        link = $ link
        target =
          'href': url.resolve item.url, link.prop 'href' # make it absolute
          'url' : url.parse link.prop 'href'             # parsed url

        if target.url.host is crawler.host
          console.log "Adding #{target.href}"
          crawler.queueURL target.href
        else
          console.log "Ommiting #{target.href} (not in #{crawler.host})"

      window.close()

crawler.on "complete", ->
  console.log "That's all there is to see. See you!"


crawler.start()

What I get back is:

Crawling inside nodejs.org
Looking into http://nodejs.org/
That's all there is to see. See you!
Ommiting http://code.google.com/p/v8/ (not in nodejs.org)
Adding http://nodejs.org/dist/v0.8.20/node-v0.8.20.tar.gz
Ommiting file://home/www/crawler/download/ (not in nodejs.org)
Ommiting file://home/www/crawler/api/ (not in nodejs.org)
Ommiting http://github.com/joyent/node (not in nodejs.org)
Ommiting http://www.ebaytechblog.com/2011/11/30/announcing-ql-io/ (not in nodejs.org)
Ommiting http://developer.yahoo.com/blogs/ydn/posts/2011/11/yahoo-announces-cocktails-%E2%80%93-shaken-not-stirred/ (not in nodejs.org)
Ommiting http://thenodefirm.com/ (not in nodejs.org)
Ommiting http://storify.com/ (not in nodejs.org)
Ommiting http://www.youtube.com/watch?v=jo_B4LTHi3I (not in nodejs.org)
Ommiting file://home/www/crawler/about/ (not in nodejs.org)
Ommiting http://search.npmjs.org/ (not in nodejs.org)
Adding http://nodejs.org/api/
Ommiting http://blog.nodejs.org/ (not in nodejs.org)
Ommiting file://home/www/crawler/community/ (not in nodejs.org)
Ommiting file://home/www/crawler/logos/ (not in nodejs.org)
Ommiting http://jobs.nodejs.org/ (not in nodejs.org)
Ommiting file://home/www/crawler/crawler2.coffee (not in nodejs.org)
Ommiting file://download/ (not in nodejs.org)
Ommiting file://about/ (not in nodejs.org)
Ommiting http://search.npmjs.org/ (not in nodejs.org)
Adding http://nodejs.org/api/
Ommiting http://blog.nodejs.org/ (not in nodejs.org)
Ommiting file://community/ (not in nodejs.org)
Ommiting file://logos/ (not in nodejs.org)
Ommiting http://jobs.nodejs.org/ (not in nodejs.org)
Ommiting http://joyent.com/ (not in nodejs.org)
Ommiting file://trademark-policy.pdf/ (not in nodejs.org)
Ommiting https://raw.github.com/joyent/node/v0.8.20/LICENSE (not in nodejs.org)

As you can see crawler assumes that it's work is complete before discovery is finished. I can understand why (crawler.queue.add is called after complete event was fired).

Is there an easy way to make it work?

I tried to call crawler.start() after each discovery, before window.close(). It had no effect.

Authentication for Google Sites

My plan is to crawl our school intranet for projects of students that have to do with programming. The intranet is hosted on google apps and needs authentication with a google account. I have that account. But how can I enter the intranet with simplecrawler?

Fix 0.8 .domain compatibility issues...

So Node's EventEmitter has now stolen the .domain keyword I was using on the crawler. I've got to either shift all the crawler options into an internal data store and add get/setters to the Crawler object to enable their access, or rename the keyword.

Choices.

Some questions about link discovery

I'm actually having a couple of issues, but first wanted to open a discussion before making changes and pull requesting anything. There are two cases that are currently not working as I want:

  • Hash links internal to a page should not be picked up as crawlable resources
  • When I include Google Analytics, I end up with the faulty "links" of "https://ssl" and "http://www" which create false 404's

For hash links (e.g. <a href="#foo">foo</a>), IMO there's no reason for the crawler to pick these up, and I would personally have the crawler always ignore any link starting with #. It's not a big deal, but in my case, I'm tracking stats around links closely and internal links needlessly clutter my stats. However, if there is a valid use case for the crawler to discover internal links, then perhaps a config like ignoreInternalLinks would be useful (and I think should default to false, although perhaps that would be considered a breaking change).

For the second issue, the problem is that the GA script has this line:

ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';

Because the crawler's discoverRegex property includes /http(s)?\:\/\/[^?\s><\'\"]+/ig, it unfortunately matches the GA script and results in those protocol fragments being "discovered" incorrectly. For this, there are several potential fixes:

  • Change the regex string to require at least one "." character in the url (not sure if that's 100% valid, but would at least fix this case), e.g. http(s)?\:\/\/[^?\s><\'\"]+[.]+
  • Change the regex string to look for a more complete link syntax including "href=", e.g. href=[\s\"\']+http(s)?\:\/\/[^?\s><\'\"]+
  • Declare the discoverRegex as an overrideable config on the crawler's prototype, rather than privately inside the discoverResources function

Personally, I would be in favor of implementing the third option for sure for max flexibility, but also implementing either 1 or 2 (or something similar) to avoid this case by default. As it is, at the moment there's no easy way for me to fix this without editing the source directly or overriding that entire function.

How to sideload the crawler?

At the moment my crawler is fired when I do node server.js due to the the var crawler = require('./server/config/crawler'); inside the server.js

The crawler.js code is, as per above, in /server/config/crawler

The problem is that when the server is started I have to wait for all the crawling to be done.

I want to be able to start node server.js normally in the cmd and in another cmd to start the crawler and have all the crawling happen there.

Moreover, I want to be able to add new websites to crawl as the crawler is already running, without needing a restart, even if the js has been modified.

Is all of this possible? How? I'm very new to node.js.

Simplify basic crawls.

It'd be nice to have a one-three line solution for crawling a website. Think about ways to make simplecrawler even simpler to instantiate and run.

Events not cleared in node v0.10

Hi

Im having a problem using this module with node 0.10 (it works fine in 0.8)
The problem is that the events are fired twice when im reusing the variable...

โ””โ”€โ”ฌ [email protected]
  โ””โ”€โ”€ [email protected]

Testcase:

var util = require("util");
var Crawler = require("simplecrawler");

var test;

function event() {
  util.log("Are test an instance of Crawler: " + (test instanceof Crawler));

  if(test instanceof Crawler) {
    util.log("Running: " + test.running);
    util.log("Stopping crawler");
    test.stop();
    util.log("Running: " + test.running);

    util.log("Listners:");
    console.log(test.listeners('fetchcomplete'));
    util.log("Removing listeners");
    test.removeAllListeners();
    util.log("Listners:");
    console.log(test.listeners('fetchcomplete'));

    test = null;
  }

  test = Crawler.crawl("http://deewr.gov.au/");

  test.on("fetchcomplete", function datahandler(queueItem){
    util.log(queueItem.url);
  });
}

(function() {
  event();
  setTimeout(event, 500);
})();

0.8.23

$ nvm run v0.8.23 test_case2.js
Running node v0.8.23
6 May 11:17:15 - Are test an instance of Crawler: false
6 May 11:17:15 - Are test an instance of Crawler: true
6 May 11:17:15 - Running: true
6 May 11:17:15 - Stopping crawler
6 May 11:17:15 - Running: false
6 May 11:17:15 - Listners:
[ [Function: datahandler] ]
6 May 11:17:15 - Removing listeners
6 May 11:17:15 - Listners:
[]
6 May 11:17:16 - http://deewr.gov.au/
6 May 11:17:18 - http://deewr.gov.au/rss/deewr-news.xml

0.10.5

$ nvm run 0.10.5 test_case2.js
Running node v0.10.5
6 May 12:39:03 - Are test an instance of Crawler: false
6 May 12:39:03 - Are test an instance of Crawler: true
6 May 12:39:03 - Running: true
6 May 12:39:03 - Stopping crawler
6 May 12:39:03 - Running: false
6 May 12:39:03 - Listners:
[ [Function: datahandler] ]
6 May 12:39:03 - Removing listeners
6 May 12:39:03 - Listners:
[]
6 May 12:39:05 - http://deewr.gov.au/
6 May 12:39:05 - http://deewr.gov.au/
6 May 12:39:07 - http://deewr.gov.au/rss/deewr-videos.xml
6 May 12:39:07 - http://deewr.gov.au/rss/deewr-videos.xml

SimpleCrawler "complete" event not fired

Performing a crawl, SimpleCrawlers "complete" event is not fired.

I created an event handler for "fetchcomplete" that outputs crawler.queue.complete, which matches crawler.queue.length. All elements are complete, but the "complete" event does not fire.

Debugging is kinda hard, as the website I'm crawling (kulturradet.no) has thousands of links. I've tried making a fetchcondition that returns false once the queue grows past 5 elements, and this actually reproduces the error in shorter time.

I'm wondering if you can explain what actually triggers the "complete" event. Might there be something in my code that blocks it somehow?

Is there somewhere in crawler.js or the orther files I can insert some log statements to help me debug?

I will try to strip down my code to make a minimal example that illustrates the error.

[Feature Request] allow configuring URIjs with iso-8859-1

Hello,

  URIjs's default behavior use to encode/decodeURIComponent which is utf-8 based but some websites uses iso-8859-1 (i.e. in redirect's Location header). Calling URI.iso8859() will make it switch to espace/unescape. This could be a configuration parameter.

Regards,
Thiago Souza

Problem with "addFetchCondition"

I Have problems with the addFetchCondition method, when i add this code:

var http = require('http');
var scrawler = require("simplecrawler");
var conditionID = scrawler.addFetchCondition(function(parsedURL) {
    return !parsedURL.path.match(/\.pdf$/i);
});

and run the script i get this error:

var conditionID = scrawler.addFetchCondition(function(parsedURL) {
                           ^
TypeError: Object function (host,initialPath,initialPort,interval) {
......
        ];
} has no method 'addFetchCondition'
    at Object.<anonymous> (/home/nodetests/wc0/sample0.js:37:28)
    at Module._compile (module.js:449:26)
    at Object.Module._extensions..js (module.js:467:10)
    at Module.load (module.js:356:32)
    at Function.Module._load (module.js:312:12)
    at Module.runMain (module.js:492:10)
    at process.startup.processNextTick.process._tickCallback (node.js:244:9)

It is weird becasue i see the method in the crawler.js file.

Correct Cache API Pattern

At the moment, the callback is passed a single parameter, which could be null (if an item doesn't exist) or an actual value.

To fit better with node's internal patterns, the first parameter should be an error (or null), and the second parameter should be the result.

Multiple addFetchConditions aren't behaving nicely

Hey there, I want to make sure I understand that I'm using this function correctly. It seems that when I create two fetch conditions the first one isn't being used but the last one is. So for instance, this code would crawl my site and return html files and JS files but not CSS files. I was expecting it to filter both CSS and JS, any idea what might be at the root of this?

Thanks!

  var myCrawler = new Crawler("www.pixelhacking.com");

  var jsFetchCondition = myCrawler.addFetchCondition(function (parsedURL) {
    return !parsedURL.path.match(/\.js$/i);
  });

  var cssFetchCondition = myCrawler.addFetchCondition(function (parsedURL) {
    return !parsedURL.path.match(/\.css$/i);
  });

  myCrawler.initialPath = "/";
  myCrawler.initialPort = 80;
  myCrawler.initialProtocol = "http";

  myCrawler.on("fetchcomplete", function (queueItem, responseBuffer, response) {
    console.log("I just received %s (%d bytes)", queueItem.url, responseBuffer.length);
    console.log("It was a resource of type %s", response.headers['content-type']);

    // Do something with the data in responseBuffer
  });

  myCrawler.start();

Ability to limit depth of crawl

It would be awesome if you could specify how deep the crawler should follow links (e.g. only go two levels deep from the initial link). Looking at the code I'm not sure how this would be implemented, or even if the queue based approach makes it impossible.

Thoughts?

Urls with spaces are not crawled correctly

Any url that contains spaces (after url decoding) is not fully parsed: only the first word/part is extracted.

This is because the matching regexes exclude \s right now. I think it's OK to accept \s by removing them from the exclude pattern (quote check is enough).

Queue Error on every page

simplecrawler keeps throwing queue errors for every page the crawler visits. The error object is empty and the URL object only contains (for example)
{"protocol":"http","host":"www.reddit.com","port":80,"path":"/r/news/","uriPath":"/r/news/"}
It does this for every URL it finds and adds nothing to the queue

I've tried this on several websites including http://www.google.com and http://www.reddit.com

Default cookie's domain

If the domain is not set by the server, the cookie's domain will be defined to * (default value). But when it resend the cookies, it will not resend this one because it does not match the domain.

This is just a small error, I think we just need to be able to get the domain from where the cookie was sent. We need to pass this domain from Crawler.handleResponse to CookieJar.addFromHeaders and then to Cookie.fromString to use it if the domain in the cookie is not set.

I was thinking, why are you not using a module like "tough-cookie" ? It's used by the "request" module and is compliant with RFC6265.

This will add a dependency, but it's a small one. And this will take cookies' logic out of this module.

Best regards,

Leeroy

URIjs errors not handled

I've noticed that in Crawler.prorotype.processURL, errors thrown by URIjs are not handled and consequently crash the app.

I've solved this by wrapping everything in a try catch and returning false on error so that the url is skipped. This method works for me but you may want to add proper URIjs error handling.

How to avoid crawling images, pdf and other files, only fetch html pages?

var Crawler = require("simplecrawler").Crawler;
var myCrawler = new Crawler("www.example.com","/");
myCrawler.interval = 5000; // Ten seconds
myCrawler.maxConcurrency = 1;

myCrawler.on("crawlstart",function() {
console.log("Crawl starting");
});

myCrawler.on("fetchstart",function(queueItem) {
console.log("fetchStart",queueItem);
});

myCrawler.on("fetchcomplete",function(queueItem, responseBuffer, response) {
    console.log("I just received %s (%d bytes)",queueItem.url,responseBuffer.length);
    console.log("It was a resource of type %s",response.headers['content-type']);

    // Do something with the data in responseBuffer
});

myCrawler.on("complete",function() {
console.log("Finished!");
});

myCrawler.start(); 

I am using the above code and the crawler gets all types of data from the site ie .png, .eot, etc files when i require to crawl only the web pages of the site
How can i whitelist or blacklist certain content-types?

Incorrect redirect location in headers on non-standard ports

Hi Chris it's me again. ๐Ÿ‘ฏ
After solving the problem of rejected self-signed SSL certificates, I encounter another problem in my attempt to crawl an HTTPS server on port 8080.

When I call node's HTTPS.request() directly on https://10.36.15.39:8080/ I get the following headers:

{ server: 'Apache-Coyote/1.1',
  'x-arequestid': '743x55102x1',
  'set-cookie':
   [ 'atlassian.xsrf.token=B9N0-HGM9-4KO8-IV0I|23dc21ea0a16f825c9ce16310014707725b2e041|lout; Path=/; Secure',
     'JSESSIONID=1E14FC99CB147B71B3501B64DBB30376; Path=/; Secure; HttpOnly' ],
  'x-ausername': 'anonymous',
  'x-content-type-options': 'nosniff',
  location: 'https://10.36.15.39:8080/secure/MyJiraHome.jspa',
  'content-type': 'text/html;charset=UTF-8',
  'content-length': '0',
  date: 'Wed, 16 Apr 2014 16:23:31 GMT' }

Please note that the redirect location has the :8080 port.

However, when I call simplecrawler and capture the fetchheaders event, I get the following headers instead:

{ server: 'Apache-Coyote/1.1',
  'x-arequestid': '739x55100x1',
  'set-cookie':
   [ 'atlassian.xsrf.token=B9N0-HGM9-4KO8-IV0I|0da97328326928fd318ce0271f99bc30e523d39a|lout; Path=/; Secure',
     'JSESSIONID=9FE0186A73D259C216F65BDDE69BCE42; Path=/; Secure; HttpOnly' ],
  'x-ausername': 'anonymous',
  'x-content-type-options': 'nosniff',
  location: 'https://10.36.15.39/secure/MyJiraHome.jspa',
  'content-type': 'text/html;charset=UTF-8',
  'content-length': '0',
  date: 'Wed, 16 Apr 2014 16:19:52 GMT' }

Please note that the redirect location doesn't have the port.

Fetcherror is emitted with variable set of parameters

Fetcherror event is emitted with variable set of parameters. On line
https://github.com/cgiffard/node-simplecrawler/blob/master/lib/crawler.js#L553 it is emitted as crawler.emit("fetcherror",queueItem);
and on line https://github.com/cgiffard/node-simplecrawler/blob/master/lib/crawler.js#L991 as crawler.emit("fetcherror",queueItem,response);
Listening to fetcherror event could then potentially lead to crawl not completing as described in issue #48.

Manually adding to the queue, manual fetching

I'm having trouble understanding how to manually add to the queue and manually tell the queue to fetch. I've looked at queue.js and created my own instance, but I can't determine which part of the simplecrawler is calling the queue continuously.

Is there a global setting which will turn off automated adding and fetching, allowing me to build my own loops to do these things?

Thank you!

fetchcomplete event is missing

fetchstart is triggered but fetchcomplete is never triggered.
I set:
crawler.timeout = 5000

and it still not helping.

Redirect causes crawling to complete prematurely

Trying to crawl NyTimes web site I found a strange behaviour. The pages seem to be redirected on themselves causing the queue not to be updated and the crawlers stops quite immediately without any fetches or discoveries. Try this url as initial point:

http://www.nytimes.com/2013/03/19/business/use-of-generics-produces-an-unusual-drop-in-drug-spending.html

No block is in place, but crawling the website seems impossible. Another strange thing is that some pages can be crawled (http://www.nytimes.com/video for example), but absolutely NOT articles.

Any Idea?

Question: How can i trigger complete after finding X results

It is a weird question but i need to force the complete method after finding 10 files with contentType text/html, it is possible?

Also another question:
is it possible to abort the connection on the event fetchheaders, so i can stop downloading files if i know they are there.

Missing Timer for managing slow requests

There's a timeout value, but for some odd reason, the actual timer that handles that value was never added (whoops!!)

Should be easy - this issue is just to track the progress of the fix.

crawl never starts

Hey,

I am having this simple code for a test but in never fires the fetchcomplete event:

var http = require('http');
var Crawler = require("simplecrawler").Crawler;

//----------------------------------
var mycrawler = new Crawler("www.example.com");

mycrawler.on("fetchcomplete",function(queueItem, responseBuffer, response) {

    console.log("I just received %s (%d bytes)",queueItem.url,responseBuffer.length);
    console.log("It was a resource of type %s",response.headers['content-type']);

    // Do something with the dat3a in responseBuffer
});

mycrawler.start();

http.createServer(function (request, response) {
  response.writeHead(200, {'Content-Type': 'text/plain'});
  response.end('Hello World\n');
}).listen(3000);

using node v0.6.21-pre or v0.8.9

URI path issue

simplecrawler uses slightly different terminology to URIjs. Sorry!

I noticed the simplecrawler URI path includes the query string too. Is there any particular reason for this?
It causes an issue for me when I want to exclude certain files from the queue using addFetchCondition. The problem is if I use regex like this: /\.jpg$/i it won't match if the target uses a get parameter for cache purposes, resizing etc.
If simplecrawler needs to return the path like that would it be ok to add another property for the path excluding the query string? I mean somewhere here https://github.com/cgiffard/node-simplecrawler/blob/master/lib/crawler.js#L262
If you think it is ok I am happy to implement it.

crawler.queue.defrost(file) still starts crawl from scratch

I tried according to the documentation to store the queue to a file and to resume the crawl later, in a new process.

I found that this does not work: The "resumed" crawl still starts from scratch, ignoring which pages have already been received by the previous run and not using any links that have been discovered in the first run.

Looking at the source, it seems that the reason for this bug is:

crawler.queue.freeze correctly stores the whole queue object (including the scanIndex) to the file. crawler.queue.defrost pulls the queue items from the defrostedQueue and pushes them to the queue.

But it ignores the scanIndex member of the defrostedQueue and does not adjust its own scanIndex accordingly. I assume the other members need to be addressed as well.

It would be nice if the crawler could be stopped and resumed (with freeze and defrost) as if nothing happened --- continuing the work where the previous run has left off.

Crawler not visiting other whitelisted domains?

Edit: Hi Chris, didn't realise you were in Australia (possibly at another Govt department?) :-)

Thanks for this tool, it's doing great work for me, I'm using it to look for links to documents which will need to be moved/regenerated in the process of a big website content transition (from a crappy old commercial CMS to Drupal, yay!).

I have a crawler setup similar to below:

var crawler = Crawler.crawl("http://www.domain1.gov.au/");

crawler.interval = 1000;
crawler.maxConcurrency = 3;
crawler.scanSubdomains = true;
crawler.domainWhitelist = ['domain1.gov.au', 'domain2.gov.au', 'www.domain1.gov.au', 'www.domain2.gov.au'];

However, the crawler appears to never visit domain2.gov.au (I have it console.log each page it is crawling). I know there are links from domain1 to domain2 because they are different parts of the same government department.

Is there another configuration switch I should have flipped on or off??

proxy authentication

Currently it's impossible to use simplecrawler under authenticated proxy!
Is it this feature in roadmap?

Thank you,
Alessandro.

HTTP Statuscode 410 should emit "fetch404" event.

Didn't test it, but i'm pretty sure that this inconvenience exists ;o)

HTTP 410 is basically the same as HTTP 404. The former just says that the requested URL will definitely never come back again. So both should be handled in the same way. (As far as i know Google does it too. The only difference is that deindexation of 410 URLs is faster.)

Notice: Of course this issue is not very important, i just think it could make sense at some point.

Document Fetch Conditions

As I discovered thanks to issue #14, I'd completely forgotten to document fetch conditions. I'd better rectify this ASAP, and add some example code.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.