chriswren / grunt-link-checker Goto Github PK

View Code? Open in Web Editor NEW

33.0 4.0 9.0 64 KB

Run node-simple-crawler to discover broken links on your website

License: MIT License

JavaScript 100.00%

grunt-link-checker's Introduction

grunt-link-checker

Run node-simple-crawler to discover broken links on your website.

Getting Started

If you haven't used grunt before, be sure to check out the Getting Started guide, as it explains how to create a gruntfile as well as install and use grunt plugins. Once you're familiar with that process, install this plugin with this command:

npm install grunt-link-checker --save-dev

Then add this line to your project's Gruntfile.js gruntfile:

grunt.loadNpmTasks('grunt-link-checker');

Documentation

grunt-link-checker will by default find any broken internal links on the given site and will also find broken fragment identifiers by using cheerio to ensure that an element exists with the given identifier. You can figure more options that are available via node-simplecrawler.

Minimal Usage

The minimal usage of grunt-link-checker runs with a site specified and an optional options.initialPort:

linkChecker: {
  dev: {
    site: 'localhost',
    options: {
      initialPort: 9001
    }
  }
}

Recommended Usage

In addition to the above config which tests a local version of your site before deployment, you can add an additional target to run post-deployment. This will verify that your assets were deployed correctly and are being resolved correctly after any revisioning or path modifications during deployment:

linkChecker: {
  // Use a large amount of concurrency to speed up check
  options: {
    maxConcurrency: 20
  },
  dev: {
    site: 'localhost',
    options: {
      initialPort: 9001
    }
  },
  postDeploy: {
    site: 'mysite.com'
  }
}

Custom options

checkRedirect

Type: Boolean
Default: false

Set this to true to check for redirects.

noFragment

Type: Boolean
Default: false

Set this to true to speed up your test by not verfiying fragment identifiers.

callback

Type: Function

Function that receives the instantiated crawler object so that you can add events or other listeners/config to the crawler.

Here is an example config using the callback option to ignore localhost links which have different ports:

linkChecker: {
  dev: {
    site: 'localhost',
    options: {
      initialPort: 9001,
      callback: function (crawler) {
        crawler.addFetchCondition(function (url) {
          return url.port === '9001';
        });
      }
    }
  }
}

simple-crawler options

Every option specified in the node-simplecrawler is available:

https://github.com/cgiffard/node-simplecrawler#configuring-the-crawler

Changelog

0.2.0 - Updated dependencies.
0.1.0 - Updated dependencies, changed task name to linkChecker.
0.0.6 - Added logging for initially fetched URL and logged status codes for failed fetches.
0.0.5 - Added error reporting if initial site URL fails.
0.0.4 - Added callback option.
0.0.3 - Fixed repo link in package.json and fixed error reporting for a failed initial URL.
0.0.2 - Added noFragment flag.
0.0.1 - Check to make sure # URLs resolve to content with a corresponding ID.
0.0.0 - Initial release.

grunt-link-checker's People

Contributors

Stargazers

Watchers

Forkers

nschonni adam-lynch dylanpowers sgaestel 3drobotics gayoushie greggman

grunt-link-checker's Issues

Anchors throwing 404s

I've got a few anchors on my page, and when links to those anchors are being followed, they're getting marked as 404s. A typical example:

<a href="#fast-service" id="fast-service-link" title="Fast service times">Fast service times</a>

As a link to:

<div class="hiccup-right light-grey how-we-are-here-panel" id="fast-service"></div>

And the error:

Resource not found linked from https://[mydomain]/about-us to https://[mydomain]/about-us#fast-service
Status code: 404

Is this expected behaviour?

grunt-link-checker could benefit from a progress indication

grunt-link-checker is really powerful, but it can be slow as it steps through all the pages in a site. It would be nice if there was some kind of progress indication (pages checked/remaining) or similar to give an idea of progress. Not essential, just nice to have.

Check redirects too

Is it possible to check for redirects too?

Check for valid #anchor links

Decouple from Grunt

Hey, this looks great! It would be good if there was a plain Node module though. That would make it a lot more accessible. Let's say there would be a new "link-checker" (core) module.

Then this project would have the new "link-checker" module as a dependency, so this would just wrap it for grunt.

Then I could also make a gulp module if that made sense (maybe a Gulp plugin isn't needed; maybe using the plain Node module with gulp on its own would make most sense, I don't know).

Would you be open to that? I had a quick look over the source and it doesn't seem like it would be hard. I'd be up for helping anyway.

CONTRIBUTING.md needs work

It looks like the CONTRIBUTING.md is copied from other projects. I started trying to fix the links but I guess you'd probably just want to re-write the whole thing and you'd know better than me what should go there.

De-duplicate URL fetching when only difference is fragment identifier

Fatal error: Maximum call stack size exceeded

Probably because the site that I'm crawling has an pretty high amount of resources. Though, I wonder if this isn't preventable? Am I overlooking an option here?

readme.md

When registering the Grunt task, took me a while to figure our that you need to enter 'linkChecker' — perhaps it might be good to include in the readme? Thanks :-)

Rename task to `linkChecker`

This should make things simpler to work with grunt templates.

    connect: {
      server: {
        options: {
          port: '<% linkChecker.options.initialPort %>',
          base: 'test/fixtures'
        }
      }
    }

A major version bump should be made though.

Default config throwing error

Hi. When attempting to run a basic setup on a local server, I'm getting this error:

$ grunt checklinks
Running "link-checker:dev" (link-checker) task
Fatal error: Cannot read property 'cyan' of undefined

'link-checker': {
  dev: {
    site: '0.0.0.0:8080'
  }
}

My setup:
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]

Any ideas? I guess maybe it's something to do with the colors dependency? This is a fresh install of the package today.

Fatal error: "name" and "value" are required for setHeader().

I'm getting the above error after attempting to set up this plugin. These are what my files look like...

Gruntfile.js

module.exports = function(grunt) {

    grunt.initConfig({
        pkg: grunt.file.readJSON('package.json'),

        'link-checker': { 
            dev: {
                site: 'http://www.website.com',
                options: {
                    maxConcurrency: 20,
                }
            }
        }
    });

    grunt.loadNpmTasks('grunt-link-checker');

    grunt.registerTask('default', ['link-checker']);

};

package.json

{
  "name": "Website-Link-Crawler",
  "version": "0.0.1",
  "devDependencies": {
    "grunt": "^0.4.5",
    "grunt-link-checker": "0.0.6"
  }
}

Fatal error: Request path contains unescaped characters.

Hi,
The crawler goes through a couple of pages without any problems, but then throws this error. Any ideas as to why that might be the case, and how I could fix it? My configuration looks as follows:

linkChecker: {
  options: {
    maxConcurrency: 10
  },
  postDeploy: {
    site: 'www.radiologen-konstanz.de'
  }
}

Kind regards,
Max

Reports links with anchors (hashes) in the URL as 404

For example, it reports a 404 for this URL:

https://www.twilio.com/docs/api/client#overview

getting 404 for css files

Hi
I'm getting a very strange behavior with the link checker. It tries to get the CSS from the link tag and fails with 404 by trying to make the URL relative.

Resource not found linked from http://myIP:myPORT/products-services/healthcare-credit-card.html to http://myIP:myPORT/products-services/%27/styles/main.css
Status code: 404

I removed the IP since this client work.
Have you seen anything like this? The CSS resolves fine in the page and all styles work. I was just wondering why this error would appear. It doesn't appear on all pages either, just 3 of them.
Thanks
Joe

Possible to add an ignore / whitelist for accepted errors?

It would be nice to be able to whitelist a bunch of URLs which are expected to break for a number of reasons: dynamic-template URLS, hacky graceful fallbacks code etc.

Can i somehow do this with the current implementation?