Coder Social home page Coder Social logo

htmlrulesanitizer's Introduction

HtmlRuleSanitizer

Nuget version

HtmlRuleSanitizer is a white list rule based HTML sanitizer built on top of the HTML Agility Pack. Use it to cleanup HTML and removing malicious content.

var sanitizer = HtmlSanitizer.SimpleHtml5Sanitizer();
string cleanHtml = sanitizer.Sanitize(dirtyHtml);

Without configuration HtmlRuleSanitizer will strip absolutely everything. This ensures that you are in control of what HTML is getting through. It was inspired by the client side parser of the wysihtml5 editor.

Use cases

HtmlRuleSanitizer was designed with the following use cases in mind:

  • Prevent cross-site scripting (XSS) attacks by removing javascript and other malicious HTML fragments.
  • Restrict HTML to simple markup in order to allow for easy transformation to other document types without having to deal with all possible HTML tags.
  • Enforce nofollow on links to discourage link spam.
  • Cleanup submitted HTML by removing empty tags for example.
  • Restrict HTML to a limited set of tags, for example in a comment system.

Features

  • CSS class white listing
  • Empty tag removal
  • Tag white listing
  • Tag attribute and CSS class enforcement
  • Tag flattening to simplify document structure while maintaining content
  • Tag renaming
  • Attribute checks (e.g. URL validity) and white listing
  • Attribute quote normalization
  • A fluent style configuration interface
  • HTML entity encoding
  • Comment removal

Usage

Install the HtmlRuleSanitizer NuGet package. Optionally add the following using statement in the file where you intend to use HtmlRuleSanitizer:

using Vereyon.Web;

Basic usage

var sanitizer = HtmlSanitizer.SimpleHtml5Sanitizer();
string cleanHtml = sanitizer.Sanitize(dirtyHtml);

Note: the SimpleHtml5Sanitizer returns a rule set which does not allow for a full document definition. Use SimpleHtml5DocumentSanitizer

Sanitize a document

When dealing with full HTML documents including the html and body tags, use SimpleHtml5DocumentSanitizer:

var sanitizer = HtmlSanitizer.SimpleHtml5DocumentSanitizer();
string cleanHtml = sanitizer.Sanitize(dirtyHtml);

Configuration

The code below demonstrates how to configure a rule set which only allows strong, i and a tags and which enforces the link tags to have a valid url, be no-follow and open in a new window. In addition, any b tag is renamed to strong because they more or less do the same anyway and b is deprecated. Any empty tags are removed to get rid of them. This would be a nice example for comment processing.

var sanitizer = new HtmlSanitizer();
sanitizer.Tag("strong").RemoveEmpty();
sanitizer.Tag("b").Rename("strong").RemoveEmpty();
sanitizer.Tag("i").RemoveEmpty();
sanitizer.Tag("a").SetAttribute("target", "_blank")
	.SetAttribute("rel", "nofollow")
	.CheckAttributeUrl("href")
	.RemoveEmpty();

string cleanHtml = sanitizer.Sanitize(dirtyHtml);

CSS class whitelisting

Global CSS class whitelisting is achieved as follows where CSS classes are space separated:

sanitizer.AllowCss("legal also-legal");

Custom attribute sanitization

Attribute sanitization can be peformed by implementing a custom IHtmlAttributeSanitizer. The code below illustrates a simple custom sanitizer which overrides the attribute value:

class CustomSanitizer : IHtmlAttributeSanitizer
{
    public SanitizerOperation SanitizeAttribute(HtmlAttribute attribute, HtmlSanitizerTagRule tagRule)
    {
		// Override the attribute value and leave the attribute as be.
        attribute.Value = "123";
        return SanitizerOperation.DoNothing;
    }
}

The custom sanitizer can then be assigned to the desired attributes as follows:

var sanitizer = new HtmlSanitizer();
var attributeSanitizer = new CustomSanitizer();
sanitizer.Tag("span").SanitizeAttributes("style", attributeSanitizer);

Custom element sanitization

Element sanitization can be performed by implement a customer IHtmlElementSanitizer, much like custom attribute sanitization. The code below illustrates a custom sanitizer which will remove span elements which contain the text "remove me":

var sanitizer = new HtmlSanitizer();
sanitizer.Tag("span").Sanitize(new CustomSanitizer(element =>
{
    return element.InnerText == "remove me"
        ? SanitizerOperation.RemoveTag
        : SanitizerOperation.DoNothing;
}));

Contributing

Contributions are welcome through a GitHub pull request.

Setup

dotnet restore

Tests

Got tests? Yes, see the tests project. It uses xUnit.

cd Web.HtmlSanitizer.Tests/
dotnet test

More information

License

MIT X11

htmlrulesanitizer's People

Contributors

aaubry avatar cakkermans avatar dahall avatar itsdrewmiller avatar leotsarev avatar mtriff avatar speshulk926 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

htmlrulesanitizer's Issues

How to report security issues

Hello there,

I found that is it possible to craft inputs to bypass the HtmlRule sanitizer and achieve XSS. It is not quite clear how to report them without dumping them in a public Github issue, which I'd rather avoid for obvious reasons.
I tried messaging [email protected] but got no response - so is there a better way to get in touch?

Cheers,
leeN

Multi-targeted assemblies and NuGet package

If you are using VS2017 as your IDE, you can convert the project so that it will build against multiple target framework versions (e.g. 3.5, 4.0, 4.5 and .NET Standard 2.0) and automatically package them into a NuGet package. If this is something you'd consider doing, let me know and I'll do the work and submit a Pull Request. This would help on a project I'm doing.

Invalid HTML output when using void tags

In case of cleaning HTML as following, sanitizer generates invalid HTML output [closing tags are missing].
Perhaps this is a problem with void tags.
I used HtmlSanitizer.SimpleHtml5Sanitizer() in this example.

INPUT

<p><img src="./x.jpg"></p>
<p><img src="./y.jpg"></p>
<p><img src="./z.jpg"></p>
<p>Tekst<br></p>
<p><svg viewBox="0 0 120 120" xmlns="http://www.w3.org/2000/svg"><rect x="10" y="10" width="100" height="100" rx="15"/></svg></p>
<p><input type="text"></p>

OUTPUT

<p><p><p><p>Tekst<br></p><p><p>

Link to .NETFiddle with this case -> https://dotnetfiddle.net/mvVdbQ

Error when parsing mailto href attributes

The following unit test runs correctly:

        [Fact]
        public void AHrefUrlCheckMailToTest()
        {

	        string result;
	        var sanitizer = new HtmlSanitizer();
	        sanitizer.Tag("a").CheckAttribute("href", HtmlSanitizerCheckType.Url);

	        // Test a relative url, which should pass.
	        var input = @"<a href=""mailto:[email protected]?subject=test"">MailTo</a>";
	        var expected = @"<a href=""mailto:[email protected]?subject=test"">MailTo</a>";
	        result = sanitizer.Sanitize(input);
	        Assert.Equal(expected, result);
        }

Howver, if you have a space in the subject argument:

        [Fact]
        public void AHrefUrlCheckMailToTest()
        {

	        string result;
	        var sanitizer = new HtmlSanitizer();
	        sanitizer.Tag("a").CheckAttribute("href", HtmlSanitizerCheckType.Url);

	        // Test a relative url, which should pass.
	        var input = @"<a href=""mailto:[email protected]?subject=test this"">MailTo</a>";
	        var expected = @"<a href=""mailto:[email protected]?subject=test this"">MailTo</a>";
	        result = sanitizer.Sanitize(input);
	        Assert.Equal(expected, result);
        }

The test fails. You can work around this by using %20 instead of space in the input string.

Fails a basic test case

            var san = HtmlSanitizer.SimpleHtml5Sanitizer();
            foreach (var t in "p br i b tt strong".Split(" "))
            {
                san.Tag(t).RemoveEmpty();
            }
            var s = san.Sanitize("<html><script src=\"abc\"><body><p>ABC<b>abc</b><p>XYZ<b>xyz</p><u><li>abc<li>xyz</li></body></html>");

returns an empty string. Does your class sanitize not HTML documents but HTML fragments? This is not very useful when HTML comes from external sources beyond our control because it would then require preliminary stripping of
<html>, <head>, <body>
etc containers.

HtmlRuleSanitizer expects HtmlAgilityPack to be exact 1.4.9

Hi, when I trying to update NuGet reference to HtmlAgilityPack to latest stable version, this causing unexpected breakage when I try to run;

System.IO.FileLoadException: Could not load file or assembly 'HtmlAgilityPack, Version=1.4.9.5, Culture=neutral, PublicKeyToken=bd319b19eaf3b43a' or one of its dependencies. The located assembly's manifest definition does not match the assembly reference. (Exception from HRESULT: 0x80131040)**

Stacktrace
Vereyon.Web.HtmlSanitizer.Sanitize(String html)

M.b. I'm doing something wrong, however...

Enforce single quotes instead of double quotes for attributes

Hi,
This library works fine for all my use cases. But for some cases, I need to add the HTML attribute in single quotes instead of double.

Like
original string <a href='www.example.com'>
after setting target attribute it becomes <a href='www.example.com' target="_blank">
but need <a href='www.example.com' target='_blank'>

Do you have any option to do it?

Multithreaded use of .Sanitize ?

Is it safe to craft an instance of HtmlSanitizer either by lazy singleton or as a static memeber field, and then use it from various threads (i.e. from multiple parallel http requests processing) ?
Or the only way is to create a new instance for each thread - in my case for each http request ?

[BREAKING] No longer possible to override UrlCheckerAttributeSanitizer or construct it instance

Previously I was able to create my own sanitizer AllowWhiteListedIframeDomains based on UrlCheckerAttributeSanitizer

It was like:

internal class AllowWhiteListedIframeDomains : UrlCheckerAttributeSanitizer
{
    private AllowWhiteListedIframeDomains() { }
    public static AllowWhiteListedIframeDomains Default { get; private set; } = new AllowWhiteListedIframeDomains();

    protected override bool AttributeUrlCheck(HtmlAttribute attribute)
    {
        var baseResult = base.AttributeUrlCheck(attribute);
        if (!baseResult)
        {
            return false;
        }

        if (attribute.Value.StartsWith("https://music.yandex.ru/iframe/")
            || attribute.Value.StartsWith("https://www.youtube.com/embed/")
            || attribute.Value.StartsWith("https://ok.ru/videoembed/")
            )
        {
            return true;
        }

        return false;
    }
}

Now its no longer possible, because UrlCheckerAttributeSanitizer defaults to empty list of allowed schemes and only code internal to HtmlRuleSanitizer assembly can set schemes.

Solutions:

  1. If UrlCheckerAttributeSanitizer doesn't mean to use outside of assembly, let's make it internal (and remove virtual from AttributeUrlCheck)
    2a. Make setter for AllowedUriSchemes publicly accessible (or at least protected internal).
    2b. Or even make it constructor argument.

em tag gets deleted completely

em tag gets deleted completely as shown in below:

em_replacement

Its supposed to be, but after sanitization whole text gets deleted.

Bold text

I have included p, em, strong, along with other html tags in whitelist . Still its getting deleted.

Relative urls in anchor tag throw an exception

Changing your UrlCheckTest() to
var inputIllegal = @"<a href=""../relative.htm"">Relative link</a>";

causes AttributeUrlCheck() to throw an InvalidOperationException by the uri.Scheme property. You need to add an absolute uri check
if (!uri.IsWellFormedOriginalString() || !uri.IsAbsoluteUri)

Special Characters, Space and Links

Hi There,

The library works fine for me except for some of the special characters. Changing the html text of some special characters to some unicode format is the problem for me.
For instance,
This is initial text.
initial2
Sanitized:
sanitized2

Initial:
initial_link
Sanitized:
sanitized_link

Initial:
initial_symbol
Sanitized:
sanitized_symbol

I don't want these changes to happen, is there anyway to prevent this or put them in white list or so?

Thanks,
Laxman Mankala

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.