Coder Social home page Coder Social logo

justinbeckwith / linkinator Goto Github PK

View Code? Open in Web Editor NEW
986.0 9.0 78.0 3.52 MB

๐Ÿฟ Scurry around your site and find all those broken links.

License: MIT License

TypeScript 87.90% JavaScript 12.10%
link-checker typescript broken-links 404

linkinator's Introduction

๐Ÿฟ linkinator

A super simple site crawler and broken link checker.

npm version codecov XO code style semantic-release

Behold my latest inator! The linkinator provides an API and CLI for crawling websites and validating links. It's got a ton of sweet features:

  • ๐Ÿ”ฅ Easily perform scans on remote sites or local files
  • ๐Ÿ”ฅ Scan any element that includes links, not just <a href>
  • ๐Ÿ”ฅ Supports redirects, absolute links, relative links, all the things
  • ๐Ÿ”ฅ Configure specific regex patterns to skip
  • ๐Ÿ”ฅ Scan markdown files without transpilation

Installation

npm install linkinator

Not into the whole node.js or npm thing? You can also download a standalone binary that bundles node, linkinator, and anything else you need. See releases.

Command Usage

You can use this as a library, or as a CLI. Let's see the CLI!

$ linkinator LOCATIONS [ --arguments ]

  Positional arguments

    LOCATIONS
      Required. Either the URLs or the paths on disk to check for broken links.
      Supports multiple paths, and globs.

  Flags

    --concurrency
          The number of connections to make simultaneously. Defaults to 100.

    --config
        Path to the config file to use. Looks for `linkinator.config.json` by default.

    --directory-listing
        Include an automatic directory index file when linking to a directory.
        Defaults to 'false'.

    --format, -f
        Return the data in CSV or JSON format.

    --help
        Show this command.

    --include, -i
        List of urls in regexy form to include.  The opposite of --skip.

    --markdown
        Automatically parse and scan markdown if scanning from a location on disk.

    --recurse, -r
        Recursively follow links on the same root domain.

    --retry,
        Automatically retry requests that return HTTP 429 responses and include
        a 'retry-after' header. Defaults to false.

    --retry-errors,
        Automatically retry requests that return 5xx or unknown response.

    --retry-errors-count,
        How many times should an error be retried?

    --retry-errors-jitter,
        Random jitter applied to error retry.

    --server-root
        When scanning a locally directory, customize the location on disk
        where the server is started.  Defaults to the path passed in [LOCATION].

    --skip, -s
        List of urls in regexy form to not include in the check.

    --timeout
        Request timeout in ms.  Defaults to 0 (no timeout).

    --url-rewrite-search
        Pattern to search for in urls.  Must be used with --url-rewrite-replace.

    --url-rewrite-replace
        Expression used to replace search content.  Must be used with --url-rewrite-search.

    --verbosity
        Override the default verbosity for this command. Available options are
        'debug', 'info', 'warning', 'error', and 'none'.  Defaults to 'warning'.

Command Examples

You can run a shallow scan of a website for busted links:

npx linkinator https://jbeckwith.com

That was fun. What about local files? The linkinator will stand up a static web server for yinz:

npx linkinator ./docs

But that only gets the top level of links. Lets go deeper and do a full recursive scan!

npx linkinator ./docs --recurse

Aw, snap. I didn't want that to check those links. Let's skip em:

npx linkinator ./docs --skip www.googleapis.com

The --skip parameter will accept any regex! You can do more complex matching, or even tell it to only scan links with a given domain:

linkinator http://jbeckwith.com --skip '^(?!http://jbeckwith.com)'

Maybe you're going to pipe the output to another program. Use the --format option to get JSON or CSV!

linkinator ./docs --format CSV

Let's make sure the README.md in our repo doesn't have any busted links:

linkinator ./README.md --markdown

You know what, we better check all of the markdown files!

linkinator "**/*.md" --markdown

Configuration file

You can pass options directly to the linkinator CLI, or you can define a config file. By default, linkinator will look for a linkinator.config.json file in the current working directory.

All options are optional. It should look like this:

{
  "concurrency": 100,
  "config": "string",
  "recurse": true,
  "skip": "www.googleapis.com",
  "format": "json",
  "silent": true,
  "verbosity": "error",
  "timeout": 0,
  "markdown": true,
  "serverRoot": "./",
  "directoryListing": true,
  "retry": true,
  "retryErrors": true,
  "retryErrorsCount": 3,
  "retryErrorsJitter": 5,
  "urlRewriteSearch": "/pattern/",
  "urlRewriteReplace": "replacement",
}

To load config settings outside the CWD, you can pass the --config flag to the linkinator CLI:

linkinator --config /some/path/your-config.json

GitHub Actions

You can use linkinator as a GitHub Action as well, using JustinBeckwith/linkinator-action:

on:
  push:
    branches:
      - main
  pull_request:
name: ci
jobs:
  linkinator:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - uses: JustinBeckwith/linkinator-action@v1
        with:
          paths: README.md

To see all options or to learn more, visit JustinBeckwith/linkinator-action.

API Usage

linkinator.check(options)

Asynchronous method that runs a site wide scan. Options come in the form of an object that includes:

  • path (string|string[]) - A fully qualified path to the url to be scanned, or the path(s) to the directory on disk that contains files to be scanned. required.
  • concurrency (number) - The number of connections to make simultaneously. Defaults to 100.
  • port (number) - When the path is provided as a local path on disk, the port on which to start the temporary web server. Defaults to a random high range order port.
  • recurse (boolean) - By default, all scans are shallow. Only the top level links on the requested page will be scanned. By setting recurse to true, the crawler will follow all links on the page, and continue scanning links on the same domain for as long as it can go. Results are cached, so no worries about loops.
  • retry (boolean|RetryConfig) - Automatically retry requests that respond with an HTTP 429, and include a retry-after header. The RetryConfig option is a placeholder for fine-grained controls to be implemented at a later time, and is only included here to signal forward-compatibility.
  • serverRoot (string) - When scanning a locally directory, customize the location on disk where the server is started. Defaults to the path passed in path.
  • timeout (number) - By default, requests made by linkinator do not time out (or follow the settings of the OS). This option (in milliseconds) will fail requests after the configured amount of time.
  • markdown (boolean) - Automatically parse and scan markdown if scanning from a location on disk.
  • linksToSkip (array | function) - An array of regular expression strings that should be skipped, OR an async function that's called for each link with the link URL as its only argument. Return a Promise that resolves to true to skip the link or false to check it.
  • directoryListing (boolean) - Automatically serve a static file listing page when serving a directory. Defaults to false.
  • urlRewriteExpressions (array) - Collection of objects that contain a search pattern, and replacement.

linkinator.LinkChecker()

Constructor method that can be used to create a new LinkChecker instance. This is particularly useful if you want to receive events as the crawler crawls. Exposes the following events:

  • pagestart (string) - Provides the url that the crawler has just started to scan.
  • link (object) - Provides an object with
    • url (string) - The url that was scanned
    • state (string) - The result of the scan. Potential values include BROKEN, OK, or SKIPPED.
    • status (number) - The HTTP status code of the request.

Examples

Simple example

const link = require('linkinator');

async function simple() {
  const results = await link.check({
    path: 'http://example.com'
  });

  // To see if all the links passed, you can check `passed`
  console.log(`Passed: ${results.passed}`);

  // Show the list of scanned links and their results
  console.log(results);

  // Example output:
  // {
  //   passed: true,
  //   links: [
  //     {
  //       url: 'http://example.com',
  //       status: 200,
  //       state: 'OK'
  //     },
  //     {
  //       url: 'http://www.iana.org/domains/example',
  //       status: 200,
  //       state: 'OK'
  //     }
  //   ]
  // }
}
simple();

Complete example

In most cases you're going to want to respond to events, as running the check command can kinda take a long time.

const link = require('linkinator');

async function complex() {
  // create a new `LinkChecker` that we'll use to run the scan.
  const checker = new link.LinkChecker();

  // Respond to the beginning of a new page being scanned
  checker.on('pagestart', url => {
    console.log(`Scanning ${url}`);
  });

  // After a page is scanned, check out the results!
  checker.on('link', result => {

    // check the specific url that was scanned
    console.log(`  ${result.url}`);

    // How did the scan go?  Potential states are `BROKEN`, `OK`, and `SKIPPED`
    console.log(`  ${result.state}`);

    // What was the status code of the response?
    console.log(`  ${result.status}`);

    // What page linked here?
    console.log(`  ${result.parent}`);
  });

  // Go ahead and start the scan! As events occur, we will see them above.
  const result = await checker.check({
    path: 'http://example.com',
    // port: 8673,
    // recurse: true,
    // linksToSkip: [
    //   'https://jbeckwith.com/some/link',
    //   'http://example.com'
    // ]
  });

  // Check to see if the scan passed!
  console.log(result.passed ? 'PASSED :D' : 'FAILED :(');

  // How many links did we scan?
  console.log(`Scanned total of ${result.links.length} links!`);

  // The final result will contain the list of checked links, and the pass/fail
  const brokeLinksCount = result.links.filter(x => x.state === 'BROKEN');
  console.log(`Detected ${brokeLinksCount.length} broken links.`);
}

complex();

Tips & Tricks

Using a proxy

This library supports proxies via the HTTP_PROXY and HTTPS_PROXY environment variables. This guide provides a nice overview of how to format and set these variables.

Globbing

You may have noticed in the example, when using a glob the pattern is encapsulated in quotes:

linkinator "**/*.md" --markdown

Without the quotes, some shells will attempt to expand the glob paths on their own. Various shells (bash, zsh) have different, somewhat unpredictable behaviors when left to their own devices. Using the quotes ensures consistent, predictable behavior by letting the library expand the pattern.

Debugging

Oftentimes when a link fails, it's an easy to spot typo, or a clear 404. Other times ... you may need more details on exactly what went wrong. To see a full call stack for the HTTP request failure, use --verbosity DEBUG:

linkinator https://jbeckwith.com --verbosity DEBUG

Controlling Output

The --verbosity flag offers preset options for controlling the output, but you may want more control. Using jq and --format JSON - you can do just that!

linkinator https://jbeckwith.com --verbosity DEBUG --format JSON | jq '.links | .[] | select(.state | contains("BROKEN"))'

License

MIT

linkinator's People

Contributors

0xflotus avatar alexander-fenster avatar anshulsahni avatar bcoe avatar callmehiphop avatar chalin avatar davidhauck avatar dependabot[bot] avatar gauntface avatar grumbaut avatar htmlhero avatar jakejarvis avatar johann-s avatar justinbeckwith avatar khendrikse avatar marapper avatar renovate[bot] avatar rrthomas avatar serundeputy avatar trott avatar xhmikosr avatar xiaozhenliu-gg5 avatar zeke avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

linkinator's Issues

Output is shown twice

In process of addressing #81, the output for the scan is now shown twice:

  1. In real time, as the scan is happening
  2. At the end, in an aggregated view grouped by parent

Due to the async/concurrent nature of the scan, there's no way to do the parent based grouping in real time, and instead it needs to be buffered. As a result - the end scan shows the scan twice - once linearly, and once grouped by parent. This confused a few folks, but I'm sort of at a loss on how to do it better. Parking this issue here for discussion :)

Show redirected links

Today gaxios quietly follows redirects. Folks may want to see this as a warning, so we should show that info and give an option to make that show up as an error.

Support link checking of Markdown files

Currently 404s and exits on a fresh clone of one of my repos.

Steps:

git clone [email protected]:bnb/jsconf-eu-led.git # will clone into jsconf-eu-led
npx linkinator ./jsconf-eu-led # should run against jsconf-eu-led

image

While testing this, it seems that I discovered the localhost URL's port is dynamic rather than static. I'm... not sure why that would happen, considering the code inside the directory is static and not changing between runs:

image

404 on Local Host URL When Checking Local HTML Files

I'd love to use linkinator as part of our development pipeline. We build to HTML files that are used on a local development server. I'm trying to include linkinator as part of our gulp tasks
, but I sometimes get this as the only result:
Screen Shot 2019-07-09 at 11 59 31 AM

Here are the relevant portions of code:

linkChecker.on('pagestart', (fileUrl) => {
  currentFile = sanitizeUrl(fileUrl);
});

linkChecker.on('link', (result) => {
  console.log(result);
});

const task = () => linkChecker.check({
  path: path.resolve(this.epubDir), // this.epubDir is pointing to a local directory
  recurse: true,
});

should linkinator parse non-html files?

I imagine for the most part these would just yield 0 links since we search for links via cheerio's query selector. However it would appear we've stumbled onto an edge case where it finds links via JavaScript files with HTML stored as a string. This gets funky because link attributes end up getting double escaped href=\"hi\" -> href=\\"hi\\" at which point I believe the HTML parser assumes the quotes were simply omitted.

I'm thinking we probably want to do one of the following

  • Ignore non-HTML files
  • Create an additional CLI flag to blacklist files or file types

WDYT?

Crawl single page applications

๐Ÿ‘‹ Does linkinator support crawling single page applications? I tried it locally using npx but that didn't seem to have worked. Just wanted to double check. Thanks!

Feature request

Hello!

I'm looking for a good replacement of https://www.npmjs.com/package/broken-link-checker to use in Bootstrap and came across your module.

It does look pretty simple and does quite a good job. I was wondering if you could add a few options, haven't checked if the lib already exposes them so if it's a matter of exposing them to the CLI.

  1. An option to skip external links. In our case this speeds things up a lot
  2. An option to control concurrency
  3. Maybe print the total time spent and/or total links that were checked even when an error occurs?
  4. It seems the tool follows redirects. Can you show it in the output too?

This is a question, what's the depth when --recurse is used?

Thanks in advance!

Relative files issue

BTW It seems when a link is relative it wrongfully is detected as broken.

For example:

Scanning http://localhost:5180/docs/4.3/examples/navbar-static/
 [200] https://getbootstrap.com/docs/4.3/examples/navbar-static/
 [404] http://localhost:5180/navbar-top.css
 [200] http://localhost:5180/docs/4.3/examples/navbar-fixed/

navbar-top.css is linked in http://localhost:5180/docs/4.3/examples/navbar-static/ with <link href="navbar-top.css" rel="stylesheet">.

At first I thought it was a Windows specific issue but it seems it fails on Travis CI too: https://travis-ci.org/twbs/bootstrap/jobs/494265231#L1399

Request: Output to include Display Text

Cool Library - thanks.

Would be great if the output also included the display text of the link. eg:

  "links": [
    {
      "url": "http://www.iana.org/domains/example",
      **"displaytext" : "More information..."**
      "status": 200,
      "state": "OK"
    }
  ],
  "passed": true
}

For my use cases, its end users rather than developers, that will be reviewing first and they need the display text to review page initially, and do things like determine even if the link is required anymore before passing to dev to fix if needed.

cheers

add support for an optional link filtering function

Hi @JustinBeckwith and maintainers. This is a nice module! Thanks for writing it.

In the current API doc:

linksToSkip (array) - An array of regular expression strings that should be skipped during the scan.

This works, but it can be cumbersome to write exclusion rules just using regular expressions. It would be nifty if this option (or a new option) could be passed a function instead of an array. The function would take the given link URL as input and return a Boolean indicating whether to skip it or not:

filterLinks: (url) => {
  return true // or false
}

I'd be willing to open a PR to add support for this. Would you be open to this change?

Consider using semantic-pull-requests as an alternative to commitlint bot

When I opened my first PR on this repo a few days ago, I was immediately met with this message from the @commitlint bot:

Screen Shot 2019-11-20 at 9 50 17 PM

That message is friendly enough, but it means I have to revise my commit history in order for my PR to pass all checks. I know how to revise my git commit history, but others may not.

I maintain a GitHub app called Semantic Pull Requests that serves a similar purpose to commitlint, but takes a different approach:

The default behavior of this bot is not to police all commit messages, but rather to ensure that every PR has just enough semantic information to be able to trigger a release when appropriate. The goal is to gather this semantic information in a way that doesn't make life harder for project contributors, especially newcomers who may not know how to amend their git commit history.

I noticed when my PR landed that this project is using semantic-release. That's awesome! The Semantic Pull Requests bot was designed with semantic-release in mind.

Advantages over commitlint:

  • git history doesn't need to be revised
  • a single semantic commit (or semantic pull request title if squashed) is adequate semantic info to trigger a semantic release
  • no lingering PR comments about commit message infractions

Note: I'm biased. This is an endorsement for my pet project, so take my suggestion with a grain of salt. If @commitlint is working well as-is, by all means keep using it!

Consider exposing request failure error details

Configuration

  • Version โ€”ย 1.5.0

Steps

  1. Run linkinator against a URL which will produce some kind of request error โ€”ย ex., a domain with an SSL certificate issue:
npx linkinator https://example.com.s3.amazonaws.com

(Alternatively, run linkinator against a page which links to a URL that will produce the error.)

Current Behavior

This is currently reported with error status 0, and no additional details:

$ npx linkinator https://example.com.s3.amazonaws.com
๐ŸŠโ€โ™‚๏ธ crawling https://example.com.s3.amazonaws.com
  [0] https://example.com.s3.amazonaws.com


ERROR: Detected 1 broken links. Scanned 1 links in 0.294 seconds.

Desired Behavior

It would be helpful, especially in cases where the request error uncovers a previously unknown issue at the URL in question, if any additional information about the request failure could be included in the summary.

(It could be outside the scope of the tool to report information beyond HTTP status code, which I would certainly understand.)

Thanks!

Severity

  • Enhancement

SRI hashes

I was wondering if this makes sense here. If there's an SRI hash provided, check if it matches like a normal browser does.

Maybe it's out of this tool's scope, just thinking out loud here. :)

Empty href isn't detected as broken

I noticed this by accident. We had (still have at the time of writing this) a link on our docs that does nothing:

<a href="" class="text-muted">Previous releases</a> 

Wouldn't it make sense to catch this case?

Thanks!

Use `HEAD` instead of `GET`

Initially we were using HEAD requests to check for status, but some sites were returning 405 errors that would require a retry with a GET. I just smacked it to use GET all the time, but this should be fixed.

The automated release is failing ๐Ÿšจ

๐Ÿšจ The automated release from the master branch failed. ๐Ÿšจ

I recommend you give this issue a high priority, so other packages depending on you could benefit from your bug fixes and new features.

You can find below the list of errors reported by semantic-release. Each one of them has to be resolved in order to automatically publish your package. Iโ€™m sure you can resolve this ๐Ÿ’ช.

Errors are usually caused by a misconfiguration or an authentication problem. With each error reported below you will find explanation and guidance to help you to resolve it.

Once all the errors are resolved, semantic-release will release your package the next time you push a commit to the master branch. You can also manually restart the failed CI job that runs semantic-release.

If you are not sure how to resolve this, here is some links that can help you:

If those donโ€™t help, or if this issue is reporting something you think isnโ€™t right, you can always ask the humans behind semantic-release.


Invalid npm token.

The npm token configured in the NPM_TOKEN environment variable must be a valid token allowing to publish to the registry https://registry.npmjs.org/.

If you are using Two-Factor Authentication, make configure the auth-only level is supported. semantic-release cannot publish with the default auth-and-writes level.

Please make sure to set the NPM_TOKEN environment variable in your CI with the exact value of the npm token.


Good luck with your project โœจ

Your semantic-release bot ๐Ÿ“ฆ๐Ÿš€

Parallelize http requests

Today all HTTP requests are executed serially. Instead - it could speed things up if we executed requests in a concurrent queue.

Original request: #32

Add support for XHTML files?

We're hoping to use linkinator as part of our ebook development pipeline to check for dead links in XHTML files. When we run linkinator in our build directory, though, we don't get any results.

Is there any interest in/work being on supporting XHTML files?

Thanks!

False positives with base64 images

While scanning the live site of https://getbootstrap.com/, I get a few false positives for base64 encoded images:

2019-09-24T09:56:01.4549830Z > [email protected] docs-linkinator-prod /home/runner/work/bootstrap/bootstrap
2019-09-24T09:56:01.4551012Z > linkinator https://getbootstrap.com --recurse --silent --skip "^(https://developer.microsoft.com)|(getbootstrap.com.br)"
2019-09-24T09:56:01.4551636Z
2019-09-24T10:01:33.6629369Z   [0] 
2019-09-24T10:01:33.6648275Z   [0] 
2019-09-24T10:02:22.5537788Z   [0] 
2019-09-24T10:02:26.6707210Z   [0] 

Skip `irc` links

Currently this is flagged as broken

2019-09-24T10:02:27.6445438Z   [0] irc://irc.freenode.net/%23bootstrap

Scanning order with recurse

It seems the order of links is quite different when recurse is used.

It's like when a link is discovered in crawl mode, it jumps to that page immediately and then comes back to the original file that referenced the link.

Wouldn't it make more sense to finish each page and then for each link that is discovered, follow it?

Just thinking out loud here, trying to see why I get so different results with and without recurse :)

Example of the current behavior:

> [email protected] linkinator C:\Users\xmr\Desktop\bootstrap
> linkinator _gh_pages/ --skip "^(?!http://localhost)"

๐ŸŠโ€โ™‚๏ธ crawling _gh_pages/
  [200] http://localhost:5773

 Scanning http://localhost:5773
  [SKP] https://getbootstrap.com/
  [404] http://localhost:5773/docs/4.3/dist/css/bootstrap.css
  [200] http://localhost:5773/docs/4.3/assets/css/docs.min.css
  [200] http://localhost:5773/docs/4.3/assets/img/favicons/apple-touch-icon.png
  [200] http://localhost:5773/docs/4.3/assets/img/favicons/favicon-32x32.png
  [200] http://localhost:5773/docs/4.3/assets/img/favicons/favicon-16x16.png
  [200] http://localhost:5773/docs/4.3/assets/img/favicons/manifest.json
  [200] http://localhost:5773/docs/4.3/assets/img/favicons/safari-pinned-tab.svg
  [200] http://localhost:5773/favicon.ico
  [200] http://localhost:5773/
  [200] http://localhost:5773/docs/4.3/getting-started/introduction/
  [200] http://localhost:5773/docs/4.3/examples/
  [SKP] https://themes.getbootstrap.com/
  [SKP] https://expo.getbootstrap.com/
  [SKP] https://blog.getbootstrap.com/
  [200] http://localhost:5773/docs/4.3/
  [SKP] https://getbootstrap.com/docs/4.2/
  [SKP] https://getbootstrap.com/docs/4.0/
  [SKP] https://v4-alpha.getbootstrap.com/
  [SKP] https://getbootstrap.com/docs/3.4/
  [SKP] https://getbootstrap.com/docs/3.3/
  [SKP] https://getbootstrap.com/2.3.2/
  [200] http://localhost:5773/docs/versions/
  [SKP] https://github.com/twbs/bootstrap
  [SKP] https://twitter.com/getbootstrap
  [SKP] https://bootstrap-slack.herokuapp.com/
  [SKP] https://opencollective.com/bootstrap/
  [200] http://localhost:5773/docs/4.3/getting-started/download/
  [SKP] https://www.bootstrapcdn.com/
  [200] http://localhost:5773/docs/4.3/layout/overview/
  [200] http://localhost:5773/docs/4.3/about/overview/
  [200] http://localhost:5773/docs/4.3/about/team/
  [SKP] https://github.com/twbs/bootstrap/graphs/contributors
  [SKP] https://github.com/twbs/bootstrap/blob/master/LICENSE
  [SKP] https://creativecommons.org/licenses/by/3.0/
  [SKP] https://www.google-analytics.com/analytics.js
  [SKP] https://cdn.carbonads.com/carbon.js?serve=CKYIKKJL&placement=getbootstrapcom
  [200] http://localhost:5773/docs/4.3/assets/img/bootstrap-themes.png
  [SKP] https://code.jquery.com/jquery-3.3.1.slim.min.js
  [404] http://localhost:5773/docs/4.3/dist/js/bootstrap.bundle.js
  [200] http://localhost:5773/docs/4.3/assets/js/vendor/anchor.min.js
  [200] http://localhost:5773/docs/4.3/assets/js/vendor/clipboard.min.js
  [200] http://localhost:5773/docs/4.3/assets/js/vendor/bs-custom-file-input.min.js
  [200] http://localhost:5773/docs/4.3/assets/js/src/application.js
  [200] http://localhost:5773/docs/4.3/assets/js/src/search.js
  [200] http://localhost:5773/docs/4.3/assets/js/src/ie-emulation-modes-warning.js

ERROR: Detected 2 broken links. Scanned 47 links in 0.167 seconds.
npm ERR! code ELIFECYCLE
npm ERR! errno 1
npm ERR! [email protected] linkinator: `linkinator _gh_pages/ --skip "^(?!http://localhost)"`
npm ERR! Exit status 1
npm ERR!
npm ERR! Failed at the [email protected] linkinator script.
npm ERR! This is probably not a problem with npm. There is likely additional logging output above.

npm ERR! A complete log of this run can be found in:
npm ERR!     C:\Users\xmr\AppData\Roaming\npm-cache\_logs\2019-02-18T08_58_54_326Z-debug.log

C:\Users\xmr\Desktop\bootstrap>npm run linkinator

> [email protected] linkinator C:\Users\xmr\Desktop\bootstrap
> linkinator _gh_pages --recurse --skip "^(?!http://localhost)"

๐ŸŠโ€โ™‚๏ธ crawling _gh_pages
  [200] http://localhost:5163

 Scanning http://localhost:5163
  [SKP] https://getbootstrap.com/
  [404] http://localhost:5163/docs/4.3/dist/css/bootstrap.css
  [200] http://localhost:5163/docs/4.3/assets/css/docs.min.css
  [200] http://localhost:5163/docs/4.3/assets/img/favicons/apple-touch-icon.png
  [200] http://localhost:5163/docs/4.3/assets/img/favicons/favicon-32x32.png
  [200] http://localhost:5163/docs/4.3/assets/img/favicons/favicon-16x16.png
  [200] http://localhost:5163/docs/4.3/assets/img/favicons/manifest.json
  [200] http://localhost:5163/docs/4.3/assets/img/favicons/safari-pinned-tab.svg
  [200] http://localhost:5163/favicon.ico
  [200] http://localhost:5163/

 Scanning http://localhost:5163/
  [200] http://localhost:5163/docs/4.3/getting-started/introduction/
...

Add Windows CI

It seems currently on Windows, there are 11 failing tests.

There should be test on Windows.

> [email protected] pretest C:\Users\xmr\Desktop\linkinator
> npm run compile


> [email protected] compile C:\Users\xmr\Desktop\linkinator
> tsc -p .


> [email protected] test C:\Users\xmr\Desktop\linkinator
> nyc mocha build/test



  linkinator
    1) should perform a basic shallow scan
    2) should only try a link once
    3) should skip links if asked nicely
    4) should report broken links
    5) should handle relative links
    โˆš should handle fetch exceptions
    6) should skip mailto: links
    โˆš should report malformed links as broken
    7) should detect broken image links
    8) should perform a recursive scan
    9) should not recurse non-html files
    10) should not recurse by default
    11) should retry with a GET after a HEAD


  2 passing (126ms)
  11 failing

  1) linkinator
       should perform a basic shallow scan:

      AssertionError [ERR_ASSERTION]: The expression evaluated to a falsy value:

  assert.ok(results.passed)

      + expected - actual

      -false
      +true

      at Context.it (test\test.ts:21:12)
      at process._tickCallback (internal/process/next_tick.js:68:7)

  2) linkinator
       should only try a link once:

      AssertionError [ERR_ASSERTION]: The expression evaluated to a falsy value:

  assert.ok(results.passed)

      + expected - actual

      -false
      +true

      at Context.it (test\test.ts:30:12)
      at process._tickCallback (internal/process/next_tick.js:68:7)

  3) linkinator
       should skip links if asked nicely:

      AssertionError [ERR_ASSERTION]: The expression evaluated to a falsy value:

  assert.ok(results.passed)

      + expected - actual

      -false
      +true

      at Context.it (test\test.ts:40:12)
      at process._tickCallback (internal/process/next_tick.js:68:7)

  4) linkinator
       should report broken links:

      AssertionError [ERR_ASSERTION]: Mocks not yet satisfied:
HEAD http://fake.local:80/
      + expected - actual

      -false
      +true

      at Scope.done (node_modules\nock\lib\scope.js:155:10)
      at Context.it (test\test.ts:57:11)
      at process._tickCallback (internal/process/next_tick.js:68:7)

  5) linkinator
       should handle relative links:

      AssertionError [ERR_ASSERTION]: The expression evaluated to a falsy value:

  assert.ok(results.passed)

      + expected - actual

      -false
      +true

      at Context.it (test\test.ts:65:12)
      at process._tickCallback (internal/process/next_tick.js:68:7)

  6) linkinator
       should skip mailto: links:

      AssertionError [ERR_ASSERTION]: The expression evaluated to a falsy value:

  assert.ok(results.passed)

      + expected - actual

      -false
      +true

      at Context.it (test\test.ts:83:12)
      at process._tickCallback (internal/process/next_tick.js:68:7)

  7) linkinator
       should detect broken image links:

      AssertionError [ERR_ASSERTION]: Input A expected to strictly equal input B:
+ expected - actual

- 1
+ 2
      + expected - actual

      -1
      +2

      at Context.it (test\test.ts:101:12)
      at process._tickCallback (internal/process/next_tick.js:68:7)

  8) linkinator
       should perform a recursive scan:

      AssertionError [ERR_ASSERTION]: Input A expected to strictly equal input B:
+ expected - actual

- 1
+ 5
      + expected - actual

      -1
      +5

      at Context.it (test\test.ts:121:12)
      at process._tickCallback (internal/process/next_tick.js:68:7)

  9) linkinator
       should not recurse non-html files:

      AssertionError [ERR_ASSERTION]: Input A expected to strictly equal input B:
+ expected - actual

- 1
+ 2
      + expected - actual

      -1
      +2

      at Context.it (test\test.ts:130:12)
      at process._tickCallback (internal/process/next_tick.js:68:7)

  10) linkinator
       should not recurse by default:

      AssertionError [ERR_ASSERTION]: Input A expected to strictly equal input B:
+ expected - actual

- 1
+ 2
      + expected - actual

      -1
      +2

      at Context.it (test\test.ts:135:12)
      at process._tickCallback (internal/process/next_tick.js:68:7)

  11) linkinator
       should retry with a GET after a HEAD:

      AssertionError [ERR_ASSERTION]: The expression evaluated to a falsy value:

  assert.ok(results.passed)

      + expected - actual

      -false
      +true

      at Context.it (test\test.ts:148:12)
      at process._tickCallback (internal/process/next_tick.js:68:7)



----------|----------|----------|----------|----------|-------------------|
File      |  % Stmts | % Branch |  % Funcs |  % Lines | Uncovered Line #s |
----------|----------|----------|----------|----------|-------------------|
All files |        0 |        0 |        0 |        0 |                   |
----------|----------|----------|----------|----------|-------------------|
npm ERR! Test failed.  See above for more details.

Local folder with subfolders

@JustinBeckwith: I'm trying to add linkinator in the nodejs.org repo.

The file structure there is like this:

build/
  ar/
    index.html
    <more folders>/
      index.html
  ca/
    index.html
    <more folders>/
      index.html
  en/
    index.html
    <more folders>/
      index.html
  <more folders>/
  static/

So, if I do linkinator build/ --recurse it fails because it doesn't see an index.html. If I do linkinator build/en --recurse then I get false positives because linkinator can't find the root.

Is there a workaround or something you could do about it?

Thanks in advance!

Chrome DevTools Links falsely reporting 404 (on GitHub Pages URLs?)

Just ran the following command:

npx linkinator https://nodejs.org/dist/latest-v10.x/docs/api/all.html

This successfully tested the Node.js API docs for 404s and surfaced two real 404s that can be fixed ๐ŸŽŠ

There were also 4 reported 404s that were all to the Chrome DevTools website:

These 404s weren't actually 404s, but were reported as such. I'm not sure if this is because they're hosted on GitHub Pages or there's some other factor at play here. Just wanted to report this so you were aware ๐Ÿ’–

correct links fail in the docs-test

In this PR: googleapis/nodejs-automl#332
we found that some working links are taken as broken when running the test, if we skip them, some other will start to fail continuelly.

  [0] https://console.cloud.google.com/flows/enableapi?apiid=automl.googleapis.com
  [0] https://console.cloud.google.com/cloudshell/open?git_repo=https://github.com/googleapis/nodejs-automl&page=editor&open_in_editor=samples/batch_predict.js,samples/README.md
  [0] https://console.cloud.google.com/cloudshell/open?git_repo=https://github.com/googleapis/nodejs-automl&page=editor&open_in_editor=samples/export_dataset.js,samples/README.md
  [0] https://console.cloud.google.com/cloudshell/open?git_repo=https://github.com/googleapis/nodejs-automl&page=editor&open_in_editor=samples/get_dataset.js,samples/README.md

This resolves by passing "concurrency": 10 to the linkinator.config.json though.

--silent seems bugged, with json format

{
  "format": "json",
  "recurse": true,
  "silent": true,
  "concurrency": 100,
  "skip": "www.googleapis.com"
}
  • `linkinator URL``` all results still returned
  • `linkinator --silent URL``` all results still returned

format=csv seems to work

CLI: Both skip and silent use the same short flag

Description

The CLI uses -s as the short flag for both skip and silent:

    --skip, -s
        List of urls in regexy form to not include in the check.
    --silent, -s
        Only output broken links

Notes

Currently, this seems to favor silent, meaning linkinator ./ -s www.googleapis.com is not recognized as a valid argument.

Personal Notes

This is a great tool, thanks for building it! ๐Ÿ™

Respect robots

Currently there is no logic to respect robots.txt. This angers the internet.

CI: cache global npm modules

Since you are using npm >= 6, you can safely cache $HOME/.npm/ along with #62. This should speed things up for consequent builds.

I'm not familiar with Cirrus CI, so I don't know how you can invalidate the cache, so if it's not possible to do this automatically just ignore this issue :)

Add verbosity flags

We currently show output for the skipped links, and it's a tad noisy. It may be nice to add a --verbose option, and reduce the total default output.

Feedback from:
#32 (comment)

Different domain edge case

I was trying linkinator against https://getbootstrap.com/ and I'm seeing something weird.

For some reason, even though we don't have those links, linkinator follows https://getbootstrap.com.br.

Could this be a bug because https://getbootstrap.com.br includes https://getbootstrap.com?

Trimmed log:

2019-09-23T15:14:19.0048569Z  Scanning https://getbootstrap.com/docs/4.3/about/translations/
2019-09-23T15:14:19.1317694Z   [200] https://bootstrap.hexschool.com/
2019-09-23T15:14:25.2626872Z   [200] https://code.z01.com/v4/
2019-09-23T15:14:25.5246985Z   [200] https://getbootstrap.com.br/v4/
2019-09-23T15:14:25.5393870Z 
2019-09-23T15:14:25.5396861Z  Scanning https://getbootstrap.com.br/v4/
...
2019-09-23T15:20:20.6278615Z  Scanning https://getbootstrap.com.br/
2019-09-23T15:20:20.6637726Z   [200] https://getbootstrap.com.br/docs/4.1/dist/css/bootstrap.css
2019-09-23T15:20:20.6977145Z   [200] https://getbootstrap.com.br/docs/4.1/assets/css/docs.min.css
2019-09-23T15:20:20.7184472Z   [200] https://getbootstrap.com.br/docs/4.1/assets/img/favicons/apple-touch-icon.png
2019-09-23T15:20:20.7388298Z   [200] https://getbootstrap.com.br/docs/4.1/assets/img/favicons/favicon-32x32.png
2019-09-23T15:20:20.7575299Z   [200] https://getbootstrap.com.br/docs/4.1/assets/img/favicons/favicon-16x16.png
2019-09-23T15:20:20.7781649Z   [200] https://getbootstrap.com.br/docs/4.1/assets/img/favicons/manifest.json
2019-09-23T15:20:20.7987411Z   [200] https://getbootstrap.com.br/docs/4.1/assets/img/favicons/safari-pinned-tab.svg
2019-09-23T15:20:20.8187174Z   [200] https://getbootstrap.com.br/favicon.ico
2019-09-23T15:20:20.8439271Z   [200] https://getbootstrap.com.br/docs/4.1/getting-started/introduction/
2019-09-23T15:20:20.8440351Z 
...

full scan log: https://gist.github.com/XhmikosR/377ccb920f88a4ecd18c21d1d0ae2692

[Enhancement Idea] only show path in CLI output when scanning locally

Hi, I happily discovered linkinator as a better maintained CLI for broken link checking.

I'll likely switch to using it in any case but nevertheless here's one thing that the (not as technical) content authors that have to handle the broken links output would appreciate:

If the linkinator is scanning a local directory with a local webserver the CLI output would make more sense to users if the https://localhost:0000/ part of both the page the broken link was found on and the broken link target were removed.

[Feature Request] Add request timeout option

It would be great to have a timeout option as "gaxios" does not currently have one by default and potentially could hang the crawling process. The timeout could be set to 60 sec by default just in case.

We had the issue with an external link never returning anything and adding this option fixed it on our end.

BUG "449 more items" invalid JSON output

npx linkinator website with 1000 pages

config

{
  "format": "json",
  "recurse": true,
  "silent": false,
  "concurrency": 25,
  "skip": "www.googleapis.com"
}

expected:

npm -y init
npm install linkinator
npx linkinator http://somewebsite.com/

expected valid json file

actual result:

{ links: [ {
    },
    ... 449 more items
  ],
  passed: false
}

I assume its a bug in using some kind of formatter or something?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.