gjtorikian / html-proofer Goto Github PK

View Code? Open in Web Editor NEW

1.6K 32.0 193.0 8.54 MB

Test your rendered HTML files to make sure they're accurate.

License: MIT License

Ruby 64.47% HTML 35.47% PHP 0.01% JavaScript 0.01% Shell 0.05%

ruby link-checker html

html-proofer's Introduction

HTMLProofer

If you generate HTML files, then this tool might be for you!

Project scope

HTMLProofer is a set of tests to validate your HTML output. These tests check if your image references are legitimate, if they have alt tags, if your internal links are working, and so on. It's intended to be an all-in-one checker for your output.

In scope for this project is any well-known and widely-used test for HTML document quality. A major use for this project is continuous integration -- so we must have reliable results. We usually balance correctness over performance. And, if necessary, we should be able to trace this program's detection of HTML errors back to documented best practices or standards, such as W3 specifications.

Third-party modules. We want this product to be useful for continuous integration so we prefer to avoid subjective tests which are prone to false positive results, such as spell checkers, indentation checkers, etc. If you want to work on these items, please see the section on custom tests and consider adding an implementation as a third-party module.

Advanced configuration. Most front-end developers can test their HTML using our command line program. Advanced configuration will require using Ruby.

Installation

Add this line to your application's Gemfile:

gem 'html-proofer'

And then execute:

$ bundle install

Or install it yourself as:

$ gem install html-proofer

NOTE: When installation speed matters, set NOKOGIRI_USE_SYSTEM_LIBRARIES to true in your environment. This is useful for increasing the speed of your Continuous Integration builds.

What's tested?

Below is a mostly comprehensive list of checks that HTMLProofer can perform.

Images

img elements:

Whether all your images have alt tags
Whether your internal image references are not broken
Whether external images are showing
Whether your images are HTTP

Scripts

script elements:

Whether your internal script references are working
Whether external scripts are loading
Whether CORS/SRI is enabled

Favicon

Whether your favicons are valid.

OpenGraph

Whether the images and URLs in the OpenGraph metadata are valid.

Usage

You can configure HTMLProofer to run on:

a file
a directory
an array of directories
an array of links

It can also run through the command-line.

Using in a script

Require the gem.
Generate some HTML.
Create a new instance of the HTMLProofer on your output folder.
Call proofer.run on that path.

Here's an example:

require "html-proofer"
require "html/pipeline"
require "find"

# make an out dir
Dir.mkdir("out") unless File.exist?("out")

pipeline = HTML::Pipeline.new([
  HTML::Pipeline::MarkdownFilter,
  HTML::Pipeline::TableOfContentsFilter,
],
gfm: true)

# iterate over files, and generate HTML from Markdown
Find.find("./docs") do |path|
  next unless File.extname(path) == ".md"
  contents = File.read(path)
  result = pipeline.call(contents)

  File.open("out/#{path.split("/").pop.sub(".md", ".html")}", "w") { |file| file.write(result[:output].to_s) }
end

# test your out dir!
HTMLProofer.check_directory("./out").run

Checking a single file

If you simply want to check a single file, use the check_file method:

HTMLProofer.check_file("/path/to/a/file.html").run

Checking directories

If you want to check a directory, use check_directory:

HTMLProofer.check_directory("./out").run

If you want to check multiple directories, use check_directories:

HTMLProofer.check_directories(["./one", "./two"]).run

Checking an array of links

With check_links, you can also pass in an array of links:

HTMLProofer.check_links(["https://github.com", "https://jekyllrb.com"]).run

Swapping information

Sometimes, the information in your HTML is not the same as how your server serves content. In these cases, you can use swap_urls to map the URL in a file to the URL you'd like it to become. For example:

run_proofer(file, :file, swap_urls: { %r{^https//placeholder.com} => "https://website.com" })

In this case, any link that matches the ^https://placeholder.com will be converted to https://website.com.

A similar swapping process can be done for attributes:

run_proofer(file, :file, swap_attributes: { "img": [["data-src", "src"]] })

In this case, we are telling HTMLProofer that, for any img tag detected, for any src attribute, pretend it's actually the src attribute instead. Since the value is an array of arrays, you can pass in as many attribute swaps as you need for each element.

Using on the command-line

You'll also get a new program called htmlproofer with this gem. Terrific!

Pass in options through the command-line as flags, like this:

htmlproofer --extensions .html.erb ./out

Use htmlproofer --help to see all command line options.

Special cases for the command-line

For options which require an array of input, surround the value with quotes, and don't use any spaces. For example, to exclude an array of HTTP status code, you might do:

htmlproofer --ignore-status-codes "999,401,404" ./out

For something like url-ignore, and other options that require an array of regular expressions, you can pass in a syntax like this:

htmlproofer --ignore-urls "/www.github.com/,/foo.com/" ./out

Since swap_urls is a bit special, you'll pass in a pair of RegEx:String values. The escape sequences \: should be used to produce literal :s htmlproofer will figure out what you mean.

htmlproofer --swap-urls "wow:cow,mow:doh" --extensions .html.erb --ignore-urls www.github.com ./out

Some configuration options, such as --typheous, --cache, or --swap-attributes, require well-formatted JSON.

Adjusting for a `baseurl`

If your Jekyll site has a baseurl configured, you'll need to adjust the generated url validation to cope with that. The easiest way is using the swap_urls option.

For a site.baseurl value of /BASEURL, here's what that looks like on the command line:

htmlproofer --assume-extension ./_site --swap-urls '^/BASEURL/:/'

or in your Rakefile

require "html-proofer"

task :test do
  sh "bundle exec jekyll build"
  options = { swap_urls: "^/BASEURL/:/" }
  HTMLProofer.check_directory("./_site", options).run
end

Using through Docker

If you have trouble with (or don't want to) install Ruby/Nokogumbo, the command-line tool can be run through Docker. See klakegg/html-proofer for more information.

Ignoring content

Add the data-proofer-ignore attribute to any tag to ignore it from every check.

<a href="https://notareallink" data-proofer-ignore>Not checked.</a>

This can also apply to parent elements, all the way up to the <html> tag:

<div data-proofer-ignore>
  <a href="https://notareallink">Not checked because of parent.</a>
</div>

Ignoring new files

Say you've got some new files in a pull request, and your tests are failing because links to those files are not live yet. One thing you can do is run a diff against your base branch and explicitly ignore the new files, like this:

directories = ['content']
merge_base = %x(git merge-base origin/production HEAD).chomp
diffable_files = %x(git diff -z --name-only --diff-filter=AC #{merge_base}).split("\0")
diffable_files = diffable_files.select do |filename|
  next true if directories.include?(File.dirname(filename))

  filename.end_with?(".md")
end.map { |f| Regexp.new(File.basename(f, File.extname(f))) }

HTMLProofer.check_directory("./output", { ignore_urls: diffable_files }).run

Configuration

The HTMLProofer constructor takes an optional hash of additional options:

Option	Description	Default
`allow_hash_href`	If `true`, assumes `href="#"` anchors are valid	`true`
`allow_missing_href`	If `true`, does not flag `a` tags missing `href`. In HTML5, this is technically allowed, but could also be human error.	`false`
`assume_extension`	Automatically add specified extension to files for internal links, to allow extensionless URLs (as supported by most servers)	`.html`
`checks`	An array of Strings indicating which checks you want to run	`['Links', 'Images', 'Scripts']`
`check_external_hash`	Checks whether external hashes exist (even if the webpage exists)	`true`
`check_internal_hash`	Checks whether internal hashes exist (even if the webpage exists)	`true`
`check_sri`	Check that `<link>` and `<script>` external resources use SRI	false
`directory_index_file`	Sets the file to look for when a link refers to a directory.	`index.html`
`disable_external`	If `true`, does not run the external link checker	`false`
`enforce_https`	Fails a link if it's not marked as `https`.	`true`
`extensions`	An array of Strings indicating the file extensions you would like to check (including the dot)	`['.html']`
`ignore_empty_alt`	If `true`, ignores images with empty/missing alt tags (in other words, `<img alt>` and `<img alt="">` are valid; set this to `false` to flag those)	`true`
`ignore_files`	An array of Strings or RegExps containing file paths that are safe to ignore.	`[]`
`ignore_empty_mailto`	If `true`, allows `mailto:` `href`s which do not contain an email address.	`false`
`ignore_missing_alt`	If `true`, ignores images with missing alt tags	`false`
`ignore_status_codes`	An array of numbers representing status codes to ignore.	`[]`
`ignore_urls`	An array of Strings or RegExps containing URLs that are safe to ignore. This affects all HTML attributes, such as `alt` tags on images.	`[]`
`log_level`	Sets the logging level, as determined by Yell. One of `:debug`, `:info`, `:warn`, `:error`, or `:fatal`.	`:info`
`only_4xx`	Only reports errors for links that fall within the 4xx status code range.	`false`
`root_dir`	The absolute path to the directory serving your html-files.	""
`swap_attributes`	JSON-formatted config that maps element names to the preferred attribute to check	`{}`
`swap_urls`	A hash containing key-value pairs of `RegExp => String`. It transforms URLs that match `RegExp` into `String` via `gsub`.	`{}`

In addition, there are a few "namespaced" options. These are:

:typhoeus / :hydra
:cache

Configuring Typhoeus and Hydra

Typhoeus is used to make fast, parallel requests to external URLs. You can pass in any of Typhoeus' options for the external link checks with the options namespace of :typhoeus. For example:

HTMLProofer.new("out/", { extensions: [".htm"], typhoeus: { verbose: true, ssl_verifyhost: 2 } })

This sets HTMLProofer's extensions to use .htm, gives Typhoeus a configuration for it to be verbose, and use specific SSL settings. Check the Typhoeus documentation for more information on what options it can receive.

You can similarly pass in a :hydra option with a hash configuration for Hydra.

The default value is:

{
  typhoeus:
  {
    followlocation: true,
    connecttimeout: 10,
    timeout: 30,
  },
  hydra: { max_concurrency: 50 },
}

On the CLI, you can provide the --typhoeus or hydra arguments to set the configurations. This is parsed using JSON.parse and mapped on top of the default configuration values so that they can be overridden.

Setting `before-request` callback

You can provide a block to set some logic before an external link is checked. For example, say you want to provide an authentication token every time a GitHub URL is checked. You can do that like this:

proofer = HTMLProofer.check_directory(item, opts)
proofer.before_request do |request|
  request.options[:headers]["Authorization"] = "Bearer <TOKEN>" if request.base_url == "https://github.com"
end
proofer.run

The Authorization header is being set if and only if the base_url is https://github.com, and it is excluded for all other URLs.

Configuring caching

Checking external URLs can slow your tests down. If you'd like to speed that up, you can enable caching for your external and internal links. Caching simply means to skip link checking for links that are valid for a certain period of time.

You can enable caching for this by passing in the configuration option :cache, with a hash containing a single key, :timeframe. :timeframe defines the length of time the cache will be used before the link is checked again. The format of :timeframe is a hash containing two keys, external and internal. Each of these contains a number followed by a letter indicating the length of time:

M means months
w means weeks
d means days
h means hours

For example, passing the following options means "recheck external links older than thirty days":

{ cache: { timeframe: { external: "30d" } } }

And the following options means "recheck internal links older than two weeks":

{ cache: { timeframe: { internal: "2w" } } }

Naturally, to support both internal and external link caching, both keys would need to be provided. The following checks external links every two weeks, but internal links only once a week:

{ cache: { timeframe: { external: "2w", internal: "1w" } } }

You can change the filename or the directory where the cache file is kept by also providing the storage_dir key:

{ cache: { cache_file: "stay_cachey.json", storage_dir: "/tmp/html-proofer-cache-money" } }

Links that were failures are kept in the cache and always rechecked. If they pass, the cache is updated to note the new timestamp.

The cache operates on external links only.

If caching is enabled, HTMLProofer writes to a log file called tmp/.htmlproofer/cache.log. You should probably ignore this folder in your version control system.

On the CLI, you can provide the --cache argument to set the configuration. This is parsed using JSON.parse and mapped on top of the default configuration values so that they can be overridden.

Caching with continuous integration

Enable caching in your continuous integration process. It will make your builds faster.

In GitHub Actions:

Add this step to your build workflow before HTMLProofer is run:

      - name: Cache HTMLProofer
        id: cache-htmlproofer
        uses: actions/cache@v2
        with:
          path: tmp/.htmlproofer
          key: ${{ runner.os }}-htmlproofer

Also make sure that your later step which runs HTMLProofer will not return a failed shell status. You can try something like html-proof ... || true. Because a failed step in GitHub Actions will skip all later steps.

In Travis:

If you want to enable caching with Travis CI, be sure to add these lines into your .travis.yml file:

cache:
  directories:
  - $TRAVIS_BUILD_DIR/tmp/.htmlproofer

For more information on using HTML-Proofer with Travis CI, see this wiki page.

Logging

HTML-Proofer can be as noisy or as quiet as you'd like. If you set the :log_level option, you can better define the level of logging.

Custom tests

Want to write your own test? Sure, that's possible!

Just create a class that inherits from HTMLProofer::Check. This subclass must define one method called run. This is called on your content, and is responsible for performing the validation on whatever elements you like. When you catch a broken issue, call add_failure(message, line: line, content: content) to explain the error. line refers to the line numbers, and content is the node content of the broken element.

If you're working with the element's attributes (as most checks do), you'll also want to call create_element(node) as part of your suite. This constructs an object that contains all the attributes of the HTML element you're iterating on, and can also be used directly to call add_failure(message, element: element).

Here's an example custom test demonstrating these concepts. It reports mailto links that point to [email protected]:

class MailToOctocat < HTMLProofer::Check
  def mailto_octocat?
    @link.url.raw_attribute == "mailto:[email protected]"
  end

  def run
    @html.css("a").each do |node|
      @link = create_element(node)

      next if @link.ignore?

      return add_failure("Don't email the Octocat directly!", element: @link) if mailto_octocat?
    end
  end
end

Don't forget to include this new check in HTMLProofer's options, for example:

# removes default checks and just runs this one
HTMLProofer.check_directories(["out/"], { checks: ["MailToOctocat"] })

See our list of third-party custom classes and add your own to this list.

Reporting

By default, HTML-Proofer has its own reporting mechanism to print errors at the end of the run. You can choose to use your own reporter by passing in your own subclass of HTMLProofer::Reporter:

proofer = HTMLProofer.check_directory(item, opts)
proofer.reporter = MyCustomReporter.new(logger: proofer.logger)
proofer.run

Your custom reporter must implement the report function which implements the behavior you wish to see. The logger kwarg is optional.

Troubleshooting

Here are some brief snippets identifying some common problems that you can work around. For more information, check out our wiki.

Our wiki page on using HTML-Proofer with Travis CI might also be useful.

Ignoring SSL certificates

To ignore SSL certificates, turn off Typhoeus' SSL verification:

HTMLProofer.check_directory("out/", {
  typhoeus: {
    ssl_verifypeer: false,
    ssl_verifyhost: 0,
},
}).run

User-Agent

To change the User-Agent used by Typhoeus:

HTMLProofer.check_directory("out/", {
  typhoeus: {
    headers: { "User-Agent" => "Mozilla/5.0 (compatible; My New User-Agent)" },
  }
}).run

Alternatively, you can specify these options on the command-line with:

htmlproofer --typhoeus='{"headers":{"User-Agent":"Mozilla/5.0 (compatible; My New User-Agent)"}}'

Cookies

Sometimes links fail because they don't have access to cookies. To fix this you can create a .cookies file using the following snippets:

HTMLProofer.check_directory("out/", {
  typhoeus: {
    cookiefile: ".cookies",
    cookiejar: ".cookies",
  }
}).run

htmlproofer --typhoeus='{"cookiefile":".cookies","cookiejar":".cookies"}'

Regular expressions

To exclude urls using regular expressions, include them between forward slashes and don't quote them:

HTMLProofer.check_directories(["out/"], {
  ignore_urls: [/example.com/],
}).run

Real-life examples

Project	Repository	Notes
Jekyll's website	jekyll/jekyll	A separate script calls `htmlproofer` and this used to be called from Circle CI
Raspberry Pi's documentation	raspberrypi/documentation
Squeak's website	squeak-smalltalk/squeak.org
Atom Flight Manual	atom/flight-manual.atom.io
HTML Website Template	fulldecent/html-website-template	A starting point for websites, uses a Rakefile and Travis configuration to call preconfigured testing
Project Calico Documentation	projectcalico/calico	Simple integration with Jekyll and Docker using a Makefile
GitHub does dotfiles	dotfiles/dotfiles.github.com	Uses the proof-html GitHub action

html-proofer's People

Contributors

Stargazers

Watchers

Forkers

mbijon kansaichris nschonni afeld aroben akoeplinger hktan1 jimmydjwu83 arjunmenon qinghezi183 myokoym ashengwang mrthomas108 bertrand-caron johnelse stbenjam kyleboyle poindexterc eksperimental-forks evilcooldud martinruden wolfgang42 tsing lauweijie codetechlg jonbartlett doismellburning reetendra19 modulexcite mirceapreotu black80887 tiger66639 kidk naveen-iiits johnzeringue bkeepers stephenitis plaindocs jmieleiii mlinksva tuckie arvindsv cloudxtreme parastoo-62 pranavgoel25 rabbotio mdgunn izzyrut fulldecent tweag lee-dohm merafour andyfry01 riton nidayecc peternewman jeremy patmart uamakernyc jeznag cpu nulltask mattlk13 ilyalyo tisba mattclegg henri-tremblay pauldambra alexxnica kryndex gemfarmer shma 18f langphil stephengroat tomdee jebcat1982 homemaker1963 2pees jnasoy timrogers graeme-a-stewart kinzhao skully78 olleolleolle jmack2 ldemailly hackerface theneva him2him2 muhammadyana nicolasleger sylturner fullstackenviormentss adamdecaf dkuspawono luizhassuncao ibobik naylin15 seankilleen

html-proofer's Issues

Test for valid HTML

Doesn't look like this is a feature yet, but it would be very nice to have.

Getting errors for "//", "mailto:" and "tel:" URLs

git clone [email protected]:hafniatimes/hafniatimes.github.io.git reproduction
cd reproduction
bundle exec jekyll build
gem install html-proofer
html-proof ./_site

Returns

$ htmlproof ./_site
Running [Links, Images] checks on ./_site on *.html... 

Checking 8 external links...
Ran on 6 files!

./_site/404/index.html: internally linking to //twitter.com/hafniatimes, which does not exist

./_site/articles/2014/06/06/intro.html: internally linking to //twitter.com/hafniatimes, which does not exist

./_site/contact/index.html: internally linking to //twitter.com/hafniatimes, which does not exist

./_site/contact/index.html: mailto: is an invalid URL

./_site/contact/index.html: tel: is an invalid URL

./_site/da/articles/2014/06/06/intro.html: internally linking to //twitter.com/hafniatimes, which does not exist

./_site/da/index.html: internally linking to //twitter.com/hafniatimes, which does not exist

./_site/index.html: internally linking to //twitter.com/hafniatimes, which does not exist

htmlproof 0.7.1 | Error:  HTML-Proofer found 8 failures!

Are these href types supposed to fail?

href="//..."
href="mailto:..."
href="tel:..."

Just wondering. :)

Error “internally linking to , which does not exist”

@lurch reported the issue in raspberrypi/documentation#104 (comment). Looks similar to #102.

I can reproduce the issue on my clone https://github.com/penibelst/documentation/commit/6787d65531891c4c231502c6d0482eab1134acaf

Need relaxing option for empty alt check

Alt tags may be left empty for decorative images http://dev.w3.org/html5/alt-techniques/#secm3
It would be good if to have a similar whitelist as the href check to exclude known valid images like "logo.png" where the content is purely decorative. In the decorative case, an empty alt tag prevents screen readers from reading the file name as a fallback.

Support links behind auth

Occasionally, at GitHub, we'll link to sites within github.com that are behind auth. For example: [check out this discussion](https://www.github.com/github/secret-internal-repo/issues/23).

We've had to exclude these links by writing them out as HTML and adding data-proofer-ignore. Blah.

I think instead there should be a new config option hash that takes a domain as a key, and an OAuth token as the value, so that these sorts of links can be checked. For example, you'd pass in :domain_auth => { "github.com" => ENV['MACHINE_USER_TOKEN'] }. When HTML::Proofer hits a 404, it'd look the domain up, and try to use the provided token to recheck the link.

/cc @penibelst @parkr Y'all think this makes sense?

Problems with 301s and hash tag refs

Scenario:

Page A links to Page B#some-hash
Page B 301s to another page, Page C
Page C does actually actually some-hash

html-proofer fails, though. It does not follow the redirect in step two; instead, it tries to look for the hash on Page B and complains.

Proofer raises false positive on RSS feeds

If you have <link> field in an RSS feed, with the tag body being the link, HTML proofer raises an anchor has no href attribute error.

e.g. <link>http://ben.balter.com/2014/10/08/why-government-contractors-should-%3C3-open-source/</link>

Expose line number in errors

Not sure how to do it (maybe count \n's?), but would be extremely helpful to know the line number of errors when they're outputted to the console, e.g.:

_site/foo.html: internally linking to _site/bar.html on line 7 which doesn't exist

Real life usage examples

Let’s list 3-4 real life usage examples in the readme. I propose to mention only business-backed cases. My favorites:

Project	Repository
Raspberry Pi documentation	https://github.com/raspberrypi/documentation
Open Whisper Systems website	https://github.com/WhisperSystems/whispersystems.org
Jekyll website	https://github.com/jekyll/jekyll

Redirects don't appear to be handled properly

It could just be me, but I've noticed that HTML::Proofer will (usually?) treat redirects as failures. Is this behavior intentional and, if not, would it be reasonably easy to fix?

Crash when folder named *.html

I just discovered that if I create a folder that ends in .html, like Test.html, the proofer crashes:

htmlproof 1.3.1 | Error: Is a directory @ io_fread - _site/Test.html

URLs with parameters deemed invalid

Have a page with the URL http://dotgov-browser.herokuapp.com/domains?cms=drupal, which HTML proofer complains: ./_site/2014/07/07/analysis-of-federal-executive-domains-part-deux/index.html: (http://dotgov-browser.herokuapp.com/domains?cms=drupal) is an invalid URL.

The URL returns a 50x (my fault), but should still be seen as a valid URL.

False positive on link with url encoded character

I have the following snippet in one of my html files:

<link rel="prefetch" href="data/c%23.csv">

On disk, the linked file is named c#.csv so I am url encoding the number sign character in html.

html-proofer reports the following error when encountering this:

index.html: internally linking to data/c%23.csv, which does not exist

This can't be right. I'm referencing a few more files from the same path and this is the only one that produces an error, so the issue is definitely related to the encoded character.

Failing test: "Links test: fails on redirects if not following"

Looks like something changed on the referenced URL, so the test fails now:

Failures:

  1) Links test fails on redirects if not following
     Failure/Error: output.should match /External link https:\/\/help.github.com\/changing-author-info\/ failed: 301 No error/
       expected "spec/html/proofer/fixtures/links/linkWithRedirect.html: External link http://timclem.wordpress.com/2012/03/01/mind-the-end-of-your-line/ failed: 301 No error\n" to match /External link https:\/\/help.github.com\/changing-author-info\/ failed: 301 No error/
     # ./spec/html/proofer/links_spec.rb:45:in `block (2 levels) in <top (required)>'

Remove double \n in output

Maybe it’s just me, but \n\n is overkill in the log:

I think it’s acceptable that there’s one newline, in the cases where the line is longer than the terminal width, though:

But the current set-up makes it really hard to read the log in one window.

Support Open Graph

The Open Graph protocol requires two properties for every page we could check:

og:image - An image URL which should represent your object within the graph.
og:url - The canonical URL of your object that will be used as its permanent ID in the graph, e.g., "http://www.imdb.com/title/tt0117500/".

Example:

<meta property="og:url" content="http://www.example.com/" />
<meta property="og:image" content="http://www.example.com/image.png" />

Handle 503s better

Error when linking to git://, ftp:// etc. URIs

The following HTML produces validation errors:

<a href="git://github.com/mono/mono">Git</a>
<a href="ftp://ftp.example.com">FTP</a>
<a href="irc://irc.gimp.org/mono">IRC</a>
<a href="svn://svn.example.com">SVN</a>

../test.html: internally linking to git://github.com/mono/mono, which does not exist
../test.html: internally linking to ftp://ftp.example.com, which does not exist
../test.html: internally linking to irc://irc.gimp.org/mono, which does not exist
../test.html: internally linking to svn://svn.example.com, which does not exist

I tried passing in --href_ignore git, that changed nothing. I know that the proofer can't really validate the links, but shouldn't it just ignore them then, like it does with mailto: ?

Awesome

Just wanted to say thank you for this great tool! 😍

Hanging on run

Hey,

The most recent version 1.3.3 is hanging. It says it's running, but I've waited a few minutes and nothing happens.

> htmlproof _site
> Running ["Images", "Links", "Scripts"] checks on _site on *.html...

Thanks for any help.

External embeds

Inspired by the discussion with @parkr about privacy I want to propose an optional checking for external embedded resources: images, styles, scripts, … External embeds lack many things.

Reliability — External servers come, go, and stay. The best current example was published lately: Don’t Use jquery-latest.js.
Speed — Every external connection means a new time consuming connection opening.
Privacy — if you respect your visitors, you don’t let them be tracked.

High quality websites only serve from own hosts. If I migrate an old website, first thing I do is to collect all embedded resources. Then I can really control what happens on my website.

The option also would help big teams with many authors to take care, because lazy authors sometimes embed images from Tumblr instead of uploading to own server.

Scenario:

We serve from www.example.com
Our assets are assets.example.com

The option must check all external resources, (e. g. http://code.jquery.com/jquery-latest.min.js) except your own external server.

How would you design such an option?

Mysterious "link has no href attribute" error

I just started testing one of my sites
and it turns out there's an issue with a link having no href attribute. Or does
that mean it's an a tag that has no href attribute?

It's a confusing error – could it be elaborated upon?

"Too many open files" error

Ran into a very curious issue just now, running the tests locally:

~/code/stuff$ be rake cibuild
bundle exec jekyll build --destination _site
Configuration file: /Users/parkermoore/code/stuff/_config.yml
            Source: /Users/parkermoore/code/stuff
       Destination: _site
      Generating... done.
Running [Links, Images] checks on /Users/parkermoore/code/stuff/_site...

rake aborted!
Too many open files - /Users/parkermoore/code/stuff/_site/mirrors/world.html
_tasks/cibuild.rake:5:in `block in <top (required)>'
Tasks: TOP => cibuild
(See full trace by running task with --trace)

The error appears to be Ruby reaching its file descriptor limit. Any way I can limit html-proofer to a certain number of files at a time?

Namespace Typhoeus options

As noted in #113 (comment), Typhoues is real picky about what it take in. I'll need to make a breaking release in order to namespace Typhoeus (and other libs!) options. So rather than

HTML::Proofer.new("out/", {:ext => ".htm", :verbose => true, :ssl_verifyhost => 2 })

It should be

HTML::Proofer.new("out/", {:ext => ".htm", :verbose => true, :typhoeus => {:ssl_verifyhost => 2 }})

Threads/Processes?

Sooooo I have a thought. What if we used Process.fork or Thread.new to allow for concurrent link proofs? Which is a better approach? Does Typhoeus do this already and I just don't know it?

Why stop updating broken links?

Travis has a blog. In https://github.com/travis-ci/blog-travis-ci-com/pull/21 somebody finds a broken link and tries to fix it. A Travis guy answers:

Here's a general question: Is the blog meant to be a document to reflect how things are now, or a historical document that announces what was new then?

It makes sense to make corrections to errors for a short while after the article's publication, but at some point we should stop updating them.

The Travis guy do not merge the fixed link and leaves the issue open. It is open now for two months.

It would be helpful to provide motivation in such situations. I think on images like this:

Image source

Redirected links don't report original href in log

I'm seeing some errors appear for links that don't actually exist in the HTML files specified. This is due to a link in the HTML redirecting to another URL and that end URL being reported in the log rather than the original URL linked. Made it a bit tough to find the broken link in a page with a pile of links.

internal links to element id's are not found

I appears that at least with 0.27.0 internal links that are references to an element id are not found eg.
the two skiplinks in example below give an error

...internally linking to #mainMenu, which does not exist
...internally linking to #mainContent, which does not exist

while both obviously do exist, are valid hash-name references and are focuable elements, the former being the first anchor in the sitenav <nav> element; the latter being the <main> element

<!DOCTYPE html>
<html lang="nl-NL" class="no-js" id="document">
    <head prefix="og: http://ogp.me/ns# article: http://ogp.me/ns/article#">
        <meta charset="utf-8">
        <title>Homepage</title>
    </head>
    <body>

        <div id="skipLinks" class="skiplinks">
            <a href="#mainMenu" class="skiplink">Naar hoofdnavigatiemenu</a>
            <a href="#mainContent" class="skiplink">Naar hoofdinhoud</a>
        </div>

        <div class="site">
          <header id="siteHeader" class="siteHeader" role="banner">
            <h1 class="title">GeoDienstenCentrum</h1>
            <p class="subtitle">Toegankelijke ruimtelijke informatievoorziening</p>

          <nav id="sitenav" class="site-nav" role="navigation">
              <ul>
                  <li>
                      <a href="/" id="mainMenu">
                          <span aria-hidden="true" data-icon="&#xe60d;"></span>
                          <span>home</span>
                      </a>
                  </li>
                  <li>
                      <a href="/over.html">
                          <span aria-hidden="true" data-icon="&#xe60e;"></span>
                          <span>over</span>
                      </a>
                  </li>
              </ul>
          </nav>
          </header>

          <main id="mainContent" tabindex="-1" role="main">

          <p>Voor advies over en implementatie van toegankelijke ruimtelijke informatie met een 
"privacy first" insteek, bij voorkeur op basis van open standaarden, open source software 
en open data.<p>

          </main>

        </div>

        <div class="site-footer">
              <span class="rss">
                  <a href="/atom.xml" class="">
                    <span aria-hidden="true" data-icon="&#xe608;"></span>
                    <span class="visually-hidden">Atom feed voor deze site</span>
                  </a>
              </span>
        </div>
        <script src="/js/script.js" charset="utf-8"></script>
    </body>
</html>

Full pages and traces are on Travis-ci: https://travis-ci.org/GeoDienstenCentrum/geodienstencentrum.github.io/builds/33530564

allow links to sites with self-signed certs?

Upgraded from 1.1.3 to 1.3.0, and it seems the ssl_verifypeer option is no longer supported.

require 'html/proofer'                                                       

task :test do                                                                
  HTML::Proofer.new("./_site", href_ignore: ['#'], ssl_verifypeer: false).run
end

After upgrading I have a few new failures of the form.

External link https://blog.patternsinthevoid.net/ failed: 0 Peer certificate cannot be authenticated with given CA certificates

Prose checking

I would like to check the content of elements except code and pre against an arbitrary array of strings. Two use cases come in my mind.

Typography. You will never see characters like ", --, !! in a book written in a European language. People who care about could proof their texts:
```
HTML::Proofer.new("./_book", {:prose => ["\"", "--", "!!"]}).run
```

Censorship. Some words can not be published:

HTML::Proofer.new("./_vegan", {:prose => ["meat", "fish", "egg"]}).run

The typography use case is more important, because if you use pre-processors like Markdown, you write -- down and the renderer converts it to n-dash –. Proofer could watch out whether it renders as expected.

What do you think about it?

Cannot verify external SSL links after speedup

All HTTP links return failed: 301 SSL connect error

Example: https://travis-ci.org/benbalter/benbalter.github.com/builds/16423365

Internally cache status of known URLs

Running html-proofer on my personal site, which isn't huge, can take ~10 minutes. I wonder if, for example, every time I link to / or /about in the header, if it's making an HTTP call for each page. If within one run we've checked a URL and got a 200 status code, cache it so that we don't keep rechecking the same URLs and can complete tests in a reasonable time.

No such file or directory @ rb_sysopen

Proofer crashes on a Travis test with:

htmlproof 0.6.7 | Error:  No such file or directory @ rb_sysopen - <file path>

The PR is IIIF/api#105. The traced Travis build is https://travis-ci.org/IIIF/iiif.io/builds/25216529

I can reproduce the issue on my local machine with Ruby 1.9.3.

Ignore <a href="#"> when checking internal links

Using such anchors is quite a common practice (e.g. by Bootstrap Dropdowns) and generates the following error:

index.html: linking to internal hash # that does not exist

I think they shouldn't trigger an error as they are just used as a placeholder and not for linking to a specific part of the document.

in-href JS returns error

_site/index.html: javascript:if(typeof WZXYxe58==typeof alert)WZXYxe58();(function(){var s=document.createElement('link');s.setAttribute('href','/static/css/dyslexia.css');s.setAttribute('rel','stylesheet');s.setAttribute('type','text/css');document.getElementsByTagName('head')[0].appendChild(s);})(); is an invalid URL

— https://travis-ci.org/hafniatimes/hafniatimes.github.io/builds/31079849#L319

Not sure whether there’s a specific part that’s failing, or if the script just doesn’t fancy in-href JS, so I defer to you in this matter. 'href','/static/css/dyslexia.css' works just fine, if you click Dyslexia here. :)

It’s probably not the place for html-proofer to inspect a JavaScript href, but it currently seems to be broken either way.

data uris in img tags fails to validate

When using a data uri in an img tag
<img src="data:image/png;base64, blah">
I get
bad URI(is not URI?): data:image/png;base64, blah

This should pass validation

30x response codes should be followed and verified, not blindly fail

E.g., try linking to http://mediadecoder.blogs.nytimes.com/2010/11/29/netflix-partner-says-comcast-toll-threatens-online-video-delivery/, which due to the 303 (!) paywall redirect, fails via Proofer.

Trying to ignore all alt tags ignores all links

I'm trying to skip the alt tag check. When I use alt_ignore: [/.*/] in the options, all links are ignored rather than just ignoring the alt tag check.

undefined method `version` error

Hi, I was attempting to follow the doc here (http://jekyllrb.com/docs/continuous-integration/) and found an interesting error when running bundle exec htmlproof ./_site.

/Users/me/.rbenv/versions/1.9.3-p547/lib/ruby/gems/1.9.1/gems/html-proofer-1.1.5/bin/htmlproof:11:in `block in <top (required)>': undefined method `version' for nil:NilClass (NoMethodError)
    from /Users/me/.rbenv/versions/1.9.3-p547/lib/ruby/gems/1.9.1/gems/mercenary-0.3.4/lib/mercenary.rb:21:in `program'
    from /Users/me/.rbenv/versions/1.9.3-p547/lib/ruby/gems/1.9.1/gems/html-proofer-1.1.5/bin/htmlproof:10:in `<top (required)>'
    from /Users/me/.rbenv/versions/1.9.3-p547/bin/htmlproof:23:in `load'
    from /Users/me/.rbenv/versions/1.9.3-p547/bin/htmlproof:23:in `<main>'

This happens to me on both CI (Shippable, Ubuntu 12.04) as well as a local environment (OSX 10.9.2, ruby 1.9.3p547). I installed the gem through bundle install.

A quick fix is commenting out line 11 in bin/htmlproof:

p.version Gem::Specification::load(File.join(File.dirname(__FILE__), "..", "html-proofer.gemspec")).version

This suggests that the "html-proofer.gemspec" file is missing, and indeed it is:

$ cd /Users/me/.rbenv/versions/1.9.3-p547/lib/ruby/gems/1.9.1/gems/html-proofer-1.1.5/
$ ls -l
drwxr-xr-x  3 me  staff   102 Aug  7 22:38 bin
drwxr-xr-x  3 me  staff   102 Aug  7 22:23 lib
$ curl -O https://raw.githubusercontent.com/gjtorikian/html-proofer/master/html-proofer.gemspec

I checked and the issue is not there in version 1.1.4, which seems to download all the files in the repository, not just the bin and lib directories. As a result, the issue may stem from 88f6572.

Thank you for writing the gem - it has helped me find many interesting link errors.

Make the output more readable

Follow-up to #71 I propose to make the output more readable by using indentation (inspired by NPM). Note the issue count at the end of some lines if the issue appears more than once. Examples:

Sorted by path

./_site/blog/a-whisper/index.html (4)
├── image /assets/body/black_down_arrow-c5df63bbaa0639b1295aa92bf32fe9ff.png does not have an alt attribute
├── image /assets/body/rss-c859bf63379b25bc6e44eba6f7a8b5ed.png does not have an alt attribute (2)
└── image /blog/images/waterfall.jpg does not have an alt attribute
./_site/blog/advanced-ratcheting/index.html (3)
├── image /assets/body/black_down_arrow-c5df63bbaa0639b1295aa92bf32fe9ff.png does not have an alt attribute
└── image /assets/body/rss-c859bf63379b25bc6e44eba6f7a8b5ed.png does not have an alt attribute (2)

Sorted by issue

image /assets/body/black_down_arrow-c5df63bbaa0639b1295aa92bf32fe9ff.png does not have an alt attribute (2)
├── ./_site/blog/a-whisper/index.html
└── ./_site/blog/advanced-ratcheting/index.html
image /assets/body/rss-c859bf63379b25bc6e44eba6f7a8b5ed.png does not have an alt attribute (4)
├── ./_site/blog/a-whisper/index.html (2)
└── ./_site/blog/advanced-ratcheting/index.html (2)
image /blog/images/waterfall.jpg does not have an alt attribute
└── ./_site/blog/a-whisper/index.html

Content negotiation

Proofer should emulate content negotiation for html files. We could try to do it like Apache’s MultiViews:

The effect of MultiViews is as follows: if the server receives a request for /some/dir/foo, if /some/dir has MultiViews enabled, and /some/dir/foo does not exist, then the server reads the directory looking for files named foo.*, and effectively fakes up a type map which names all those files, assigning them the same media types and content-encodings it would have if the client had asked for one of them by name. It then chooses the best match to the client's requirements.

Saying this I think it’s time for an own Internal class that fakes all the server things we support: DirectoryIndex, MultiViews, followlocation, hashes, etc.

uri = Proofer::Internal.new("path/to/internal/ressource", options = {})

if uri.invalid? do
  if uri.hash? do
    issues << "Hash not found"
  elsif uri.empty?
    issues << "Uri empty"
  elsif uri.ugly?
    issues << "Uri ugly"
  end
end

Allow URLs to be ignored by attribute

Would be awesome if you could add a ci-ignore class or something similar to a link for it to be ignored by HTML Proofer.

The biggest use case would be hashes that are handled by Javascript (e.g., backbone fragments), but also for URLs generated dynamically that wouldn't be practical to add to href_ignore.

I'd imagine it'd be something like:

<a href="#print" class="ci-ignore">Print</a>

Glad to take a pass at it, if there's interest.

Checking the srcset attribute

Images can have a srcset attribute:

When authors adapt their sites for high-resolution displays, they often need to be able to use different assets representing the same image. We address this need for adaptive, bitmapped content images by adding a srcset attribute to the img element.

Warn on permanent redirects (301)

Over the time external links get moved permanently, because nice guys don’t break the web. In most cases that means “the old url is deprecated, use the new url”. Can we have an option to output a warning on those links?

There is another case: automatic server-side redirection, when people forget to add a trailing slash in their internal links. An example is Bootstrap’s main menu: they just list the lazy /components instead of the right /components/. This causes a performance waste.

Integrate Hwacha for Parallelized Checks?

Just heard about hwacha, which could improve times due to its ability to run checks in parallel. Thoughts on using it?

Issues with SSL checks

./out/ssl-configuration.html[0m: External link https://www.openssl.org/ failed: 0 Peer certificate cannot be authenticated with given CA certificates

�[34m./out/what-is-my-disk-quota.html�[0m: External link https://www.npmjs.org/ failed: 0 Peer certificate cannot be authenticated with given CA certificates

�[34m./out/what-are-other-good-resources-for-learning-git-and-github.html�[0m: External link https://www.codeschool.com/courses/git-real failed: 0 Peer certificate cannot be authenticated with given CA certificates

�[34m./out/what-are-other-good-resources-for-learning-git-and-github.html�[0m: External link https://www.codeschool.com/ failed: 0 Peer certificate cannot be authenticated with given CA certificates

Incorrect fail on 200

https://a248.e.akamai.net/camo.github.com/4c724400e0e4144f44f3830ce8e82f8dd948b3f7/687474703a2f2f6769746875622e73332e616d617a6f6e6177732e636f6d2f626c6f672f77617463682d737461722e706e67 does not exist HTTP request failed: 200

WTF.

How to use from command line?

I can install html-proofer.

$ gem install html-proofer
Successfully installed html-proofer-0.6.0

How can I use it from command line? html-proofer or htmlproof don’t work here.

Use Commander?

I'm no pro at Ruby development but I think this would be really useful as a command-line executable tool.

Nokogiri dependency brings CI builds to a crawl

Just testing out using html-proofer for a random site of "stuff" I have built and the Nokogiri dependency slows everything down a lot, as installation is incredibly slow.

What is the "best practice" around this? Build locally or store vendored versions of gems?