Coder Social home page Coder Social logo

htmlsanitizer's Introduction

HtmlSanitizer

NuGet version Build status codecov.io Sonarcloud Quality Gate

netstandard2.0 net46

HtmlSanitizer is a .NET library for cleaning HTML fragments and documents from constructs that can lead to XSS attacks. It uses AngleSharp to parse, manipulate, and render HTML and CSS.

Because HtmlSanitizer is based on a robust HTML parser it can also shield you from deliberate or accidental "tag poisoning" where invalid HTML in one fragment can corrupt the whole document leading to broken layout or style.

In order to facilitate different use cases, HtmlSanitizer can be customized at several levels:

  • Configure allowed HTML tags through the property AllowedTags. All other tags will be stripped.
  • Configure allowed HTML attributes through the property AllowedAttributes. All other attributes will be stripped.
  • Configure allowed CSS property names through the property AllowedCssProperties. All other styles will be stripped.
  • Configure allowed CSS at-rules through the property AllowedAtRules. All other at-rules will be stripped.
  • Configure allowed URI schemes through the property AllowedSchemes. All other URIs will be stripped.
  • Configure HTML attributes that contain URIs (such as "src", "href" etc.) through the property UriAttributes.
  • Provide a base URI that will be used to resolve relative URIs against.
  • Cancelable events are raised before a tag, attribute, or style is removed.

Usage

Install the HtmlSanitizer NuGet package. Then:

using Ganss.Xss;
var sanitizer = new HtmlSanitizer();
var html = @"<script>alert('xss')</script><div onload=""alert('xss')"""
    + @"style=""background-color: rgba(0, 0, 0, 1)"">Test<img src=""test.png"""
    + @"style=""background-image: url(javascript:alert('xss')); margin: 10px""></div>";
var sanitized = sanitizer.Sanitize(html, "https://www.example.com");
var expected = @"<div style=""background-color: rgba(0, 0, 0, 1)"">"
    + @"Test<img src=""https://www.example.com/test.png"" style=""margin: 10px""></div>";
Assert.Equal(expected, sanitized);

There's an online demo, plus there's also a .NET Fiddle you can play with.

More example code and a description of possible options can be found in the Wiki.

Tags allowed by default

a, abbr, acronym, address, area, article, aside, b, bdi, big, blockquote, body, br, button, caption, center, cite, code, col, colgroup, data, datalist, dd, del, details, dfn, dir, div, dl, dt, em, fieldset, figcaption, figure, font, footer, form, h1, h2, h3, h4, h5, h6, head, header, hr, html, i, img, input, ins, kbd, keygen, label, legend, li, main, map, mark, menu, menuitem, meter, nav, ol, optgroup, option, output, p, pre, progress, q, rp, rt, ruby, s, samp, section, select, small, span, strike, strong, sub, summary, sup, table, tbody, td, textarea, tfoot, th, thead, time, tr, tt, u, ul, var, wbr

Attributes allowed by default

abbr, accept-charset, accept, accesskey, action, align, alt, autocomplete, autosave, axis, bgcolor, border, cellpadding, cellspacing, challenge, char, charoff, charset, checked, cite, clear, color, cols, colspan, compact, contenteditable, coords, datetime, dir, disabled, draggable, dropzone, enctype, for, frame, headers, height, high, href, hreflang, hspace, ismap, keytype, label, lang, list, longdesc, low, max, maxlength, media, method, min, multiple, name, nohref, noshade, novalidate, nowrap, open, optimum, pattern, placeholder, prompt, pubdate, radiogroup, readonly, rel, required, rev, reversed, rows, rowspan, rules, scope, selected, shape, size, span, spellcheck, src, start, step, style, summary, tabindex, target, title, type, usemap, valign, value, vspace, width, wrap

Note: to prevent classjacking and interference with classes where the sanitized fragment is to be integrated, the class attribute is disallowed by default. It can be added as follows:

var sanitizer = new HtmlSanitizer();
sanitizer.AllowedAttributes.Add("class");
var sanitized = sanitizer.Sanitize(html);

CSS properties allowed by default

align-content, align-items, align-self, all, animation, animation-delay, animation-direction, animation-duration, animation-fill-mode, animation-iteration-count, animation-name, animation-play-state, animation-timing-function, backface-visibility, background, background-attachment, background-blend-mode, background-clip, background-color, background-image, background-origin, background-position, background-position-x, background-position-y, background-repeat, background-repeat-x, background-repeat-y, background-size, border, border-bottom, border-bottom-color, border-bottom-left-radius, border-bottom-right-radius, border-bottom-style, border-bottom-width, border-collapse, border-color, border-image, border-image-outset, border-image-repeat, border-image-slice, border-image-source, border-image-width, border-left, border-left-color, border-left-style, border-left-width, border-radius, border-right, border-right-color, border-right-style, border-right-width, border-spacing, border-style, border-top, border-top-color, border-top-left-radius, border-top-right-radius, border-top-style, border-top-width, border-width, bottom, box-decoration-break, box-shadow, box-sizing, break-after, break-before, break-inside, caption-side, caret-color, clear, clip, color, column-count, column-fill, column-gap, column-rule, column-rule-color, column-rule-style, column-rule-width, column-span, column-width, columns, content, counter-increment, counter-reset, cursor, direction, display, empty-cells, filter, flex, flex-basis, flex-direction, flex-flow, flex-grow, flex-shrink, flex-wrap, float, font, font-family, font-feature-settings, font-kerning, font-language-override, font-size, font-size-adjust, font-stretch, font-style, font-synthesis, font-variant, font-variant-alternates, font-variant-caps, font-variant-east-asian, font-variant-ligatures, font-variant-numeric, font-variant-position, font-weight, gap, grid, grid-area, grid-auto-columns, grid-auto-flow, grid-auto-rows, grid-column, grid-column-end, grid-column-gap, grid-column-start, grid-gap, grid-row, grid-row-end, grid-row-gap, grid-row-start, grid-template, grid-template-areas, grid-template-columns, grid-template-rows, hanging-punctuation, height, hyphens, image-rendering, isolation, justify-content, left, letter-spacing, line-break, line-height, list-style, list-style-image, list-style-position, list-style-type, margin, margin-bottom, margin-left, margin-right, margin-top, mask, mask-clip, mask-composite, mask-image, mask-mode, mask-origin, mask-position, mask-repeat, mask-size, mask-type, max-height, max-width, min-height, min-width, mix-blend-mode, object-fit, object-position, opacity, order, orphans, outline, outline-color, outline-offset, outline-style, outline-width, overflow, overflow-wrap, overflow-x, overflow-y, padding, padding-bottom, padding-left, padding-right, padding-top, page-break-after, page-break-before, page-break-inside, perspective, perspective-origin, pointer-events, position, quotes, resize, right, row-gap, scroll-behavior, tab-size, table-layout, text-align, text-align-last, text-combine-upright, text-decoration, text-decoration-color, text-decoration-line, text-decoration-skip, text-decoration-style, text-indent, text-justify, text-orientation, text-overflow, text-shadow, text-transform, text-underline-position, top, transform, transform-origin, transform-style, transition, transition-delay, transition-duration, transition-property, transition-timing-function, unicode-bidi, user-select, vertical-align, visibility, white-space, widows, width, word-break, word-spacing, word-wrap, writing-mode, z-index

CSS at-rules allowed by default

namespace, style

style refers to style declarations within other at-rules such as @media. Disallowing @namespace while allowing other types of at-rules can lead to errors. Property declarations in @font-face and @viewport are not sanitized.

Note: the style tag is disallowed by default.

URI schemes allowed by default

http, https

Note: Protocol-relative URLs (e.g. //github.com) are allowed by default (as are other relative URLs).

to allow mailto: links:

sanitizer.AllowedSchemes.Add("mailto");

Default attributes that contain URIs

action, background, dynsrc, href, lowsrc, src

Thread safety

The Sanitize() and SanitizeDocument() methods are thread-safe, i.e. you can use these methods on a single shared instance from different threads provided you do not simultaneously set instance or static properties. A typical use case is that you prepare an HtmlSanitizer instance once (i.e. set desired properties such as AllowedTags etc.) from a single thread, then call Sanitize()/SanitizeDocument() from multiple threads.

Text content not necessarily preserved as-is

Please note that as the input is parsed by AngleSharp's HTML parser and then rendered back out, you cannot expect the text content to be preserved exactly as it was input, even if no elements or attributes were removed. Examples:

  • 4 < 5 becomes 4 &lt; 5
  • <SPAN>test</p> becomes <span>test<p></p></span>
  • <span title='test'>test</span> becomes <span title="test">test</span>

On the other hand, although some broken HTML is fixed by the parser, the output might still contain invalid HTML. Examples:

  • <div><li>test</li></div>
  • <ul><br><li>test</li></ul>
  • <h3><p>test</p></h3>

License

MIT License

htmlsanitizer's People

Contributors

304notmodified avatar admirpajalic avatar alexbyte avatar archimed-lefebvre avatar bjornri avatar chucklu avatar dependabot-preview[bot] avatar dependabot[bot] avatar emptygit avatar intelorca avatar jawvig avatar jerriep avatar lahma avatar leniency avatar markashleybell avatar mganss avatar naasking avatar reinaldocoelho avatar the-nutty avatar vanillajonathan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

htmlsanitizer's Issues

Some questions

Hi,

I have a few questions that are probably obvious to you (the creator), but as user I'm not sure.

  • Why we need the uriAttributes white list?
  • Is the urlAttributes list also used for CSS sanitizing? (style tag), or how are URL'S sanitized in CSS?
  • Why is the class attribute not in the white list? Do you have some examples or references to XSS attacks on this one?
  • Are relative URL's sanitized?
  • How do I add the Protocol Relative URL (//) to the white list? Will and empty string work?
  • Are the properties background and background-image safe in CSS as result of the URL sanitation?
  • Is content CSS property sanitized and how?
  • Regular expressions are cool, but it's difficult too read due too all (subtle) details. Can you add some documentation to the regexes? (I always use RegexOptions.IgnorePatternWhitespace and #)

Thanks!

PS: maybe expand the documentation with above questions/answers?

Quotes on quoted url on background-image from IE9 gets encoded instead of replaced

I'm not 100% sure this is a bug, but it's an issue we've faced.
When inserting a background-image style attribute on IE9, the browser always quotes it with double quotes (").
However, when sanitizing, double quotes get translated into single quotes, and the single quotes on background-image get encoded.

Here's a test case:

        [Test]
        public void QuotedBackgroundImageFromIE9()
        {
            // Arrange
            var s = new HtmlSanitizer();

            // Act
            var htmlFragment = "<span style='background-image: url(\"/api/users/defaultAvatar\");'></span>";
            var actual = s.Sanitize(htmlFragment);

            // Assert
            var expected = "<span style=\"background-image: url('/api/users/defaultAvatar')\"></span>";
            Assert.That(actual, Is.EqualTo(expected).IgnoreCase);
        }

I was able to work around this, and will be improving the solution to avoid false matches, but it doesn't feel right:

            s.PostProcessNode +=
                (sender, args) =>
                {
                    if (!args.Node.HasStyle("background-image")) return;
                    args.Node.Style["background-image"] =
                        args.Node.Style["background-image"].Replace("url(%22", "url('").Replace("%22)", "')");
                };

Opinions?

Add a strong name to nuget package

Complied dll is missing a strong name so It is not possible to use it in strong named project.

Is it possible to add strong name during the build ?

HtmlSanitizer doesn't sanitize html attribute values

Hi,
I am using HtmlSanitizer library. My use case is to whitelist some html attributes like OnMouseHover but I don't want javascript "alert" as a value of OnMouseHover. For example: following is my Html which I want to sanitize:
<h1 onmouseover="alert('This is XSS attack')">XSS</h1>

So I want Sanitization library to check the value of attributes which is not currently supported. Currently I downloaded the source code and made customizations as per my requirement. Do you have any plans to support attribute value sanitization? Please do let me know.

Thanks
Amit

Data URIs with more than 65519 characters are always stripped

I'm trying to embed images (base64 encoded) in some html that's stored in a db. The encoded data is stored in the src attribute. The src attribute is supposed to be included by default but when I run the sanitizer it removes the src and its contents. The attribute looks something like this: ``src="data:image/png;base64,iVBORw0KGgoAAAANSU

Is there a way to do this? I even tried adding the src attribute to my sanitizer. Thanks.

HTML5 support

(feature request)
Please support HTML5:

  • HTML5 tags like header, nav, section etc
  • HTML5 attributes, like data- , autocomplete, novalidate etc

This can be done of course when setting the properties, but I would be convenient if the library supported. This could be the defaults, or a 'switch' like boolean AllowHtml5Tags, boolean AllowDataAttributes etc

Idea: create Javascript / jQuery port

This library uses heavily the CsQuery project, which is a good thing. CsQuery is a port of jQuery.

Would not it be cool if this library got ported to Javascript / jQuery?

Allowed properties should be ICollection<string>

Hi,

First of all thanks for this library and the quick responses.

I was working with this library and I realized that adding an allowed item (tag, attribute etc) is a bit unpractical. This is the result of using IEnumerable for those properties.

For example, I'm adding an attribute to the allowed list with the following code:

var allowedAttributes = htmlSanitizer.AllowedAttributes.ToList();
allowedAttributes.Add(attributename);
htmlSanitizer.AllowedAttributes = allowedAttributes;

But I would prefer:

htmlSanitizer.AllowedAttributes.Add(attributename);

I suggest:

  • change the property signatures from IEnumerable<string> to ICollection<string> for easy adding and removal. This for the following properties:
    • AllowedSchemes
    • AllowedTags
    • AllowedAttributes
    • UriAttributes
    • AllowedCssProperties
  • use internal only the 'HashSet(withStringComparer.InvariantCultureIgnoreCase) so noLists orArray`s. This for performance reasons and clarity.

Issue with input <svg onload=alert(111)/>

Hi,

I am using your HtmlSanitizer and it works great. But I found an issue with one kind of input.

html = 1

When I give this input it just does not give response,

but when I give encode input

html = 1<svg%20onload=alert(111)/>

inputs

it works great.

Edit: Since issue editor was not allowing my inputs (thinking its an injection.) So, I have uploaded an image to show you my inputs.

HTMLSanitizer not supporting "face" attribute

Hi!

In our application, we are using an Infragistics WebHtmlEditor, which when you set a font face, you will end up with the following markup : "< p > < font face = "Impact" size = "3" > font face test < /font > < /p >".
After I sanitize this, I end up with : "< p > < font size = "3" > font face test < /font > < /p >". Shouldn't "face" attribute be supported?
I did fix this by using sanitizer.AllowedAttributes.Add("face"), however, my concern is what other tags/attributes, etc might be missing from this list : https://pythonhosted.org/feedparser/html-sanitization.html as we might be having trouble with some other stylings as well.
Thanks

Quoted background-image becomes unbalanced

Sanitizing

<div style="background-image: url('some/random/url.img')"></div>

Removes the first '. This seems like a bug.

This was tested with nuget HtmlSanitizer 2.0.5623.30465

Here's a test case:

var _sanitizer = new Ganss.XSS.HtmlSanitizer();
var html = "<div style=\"background-image: url('some/random/url.img')\"></div>";
Assert.Equals("<div style=\"background-image: url('some/random/url.img')\"></div>", _sanitizer.Sanitize(html))

performance - contains on non-set

I noticed that some .contains operations works on sets, and some on lists

For performance reasons .contains on a set always preferred over lists, if it possible. (the trade off is after 3-4 items if I'm correct)

UriAttributes.Contains is now working on a non-set. It would be wise to build a set first.

NuGet Package version 3.x unable to install into project..

Hey,

So I am unable to install your 3.x NuGet package. Tried running as Admin and normal user.

Error: "Could not be found in your workspace, or you do not have permission to access it."

Packages:
Microsoft.Bcl.Build.1.0.14
Microsoft.Bcl.Async.1.0.168
AngleSharp 0.9.4
HtmlSanitizer 3.1.79

Any help would be greatly appreciated. It seems I am able to install HtmlSanitize version 2.x is that stable and version 3.x still in beta?

Allow specific class names

I know you can allow or disallow the class tag. It would be great if it were possible to be able to specify a list of valid class names. For example, if I wanted to allow the class tag and only allow the "control" class name, something like this:

var sanitizer = new HtmlSanitizer();
sanitizer.AllowedTags.Add("class");
sanitizer.AllowedClassNames.Add("control");
sanitizer.Sanitize("<div class=\"control\">"); // Passes sanitation
sanitizer.Sanitize("<div class=\"tiger\">"); // Should remove class "tiger"

Data URL Scheme Support

RFC 2397 specifies data URL scheme. I understand that it may lead to an XSS attack if using text/html or text/plain media type, but I think media type like image/png in img tag should pass as browser will try to render it as image. Or is there any particular reason it wasn't?

I created 4 tests for this issue

[Test]
public void DataUrlSchemeImgTagUsingImageMediaTypeTest()
{
    // Arrange
    var s = new HtmlSanitizer();

    // Act
    var htmlFragment = "<img src=\"data:image/gif;base64,R0lGODdhMAAwAPAAAAAAAP///ywAAAAAMAAwAAAC8IyPqcvt3wCcDkiLc7C0qwyGHhSWpjQu5yqmCYsapyuvUUlvONmOZtfzgFzByTB10QgxOR0TqBQejhRNzOfkVJ+5YiUqrXF5Y5lKh/DeuNcP5yLWGsEbtLiOSpa/TPg7JpJHxyendzWTBfX0cxOnKPjgBzi4diinWGdkF8kjdfnycQZXZeYGejmJlZeGl9i2icVqaNVailT6F5iJ90m6mvuTS4OK05M0vDk0Q4XUtwvKOzrcd3iq9uisF81M1OIcR7lEewwcLp7tuNNkM3uNna3F2JQFo97Vriy/Xl4/f1cf5VWzXyym7PHhhx4dbgYKAAA7\" alt=\"Larry\">";
    var actual = s.Sanitize(htmlFragment);

    // Assert
    var expected = "<img src=\"data:image/gif;base64,R0lGODdhMAAwAPAAAAAAAP///ywAAAAAMAAwAAAC8IyPqcvt3wCcDkiLc7C0qwyGHhSWpjQu5yqmCYsapyuvUUlvONmOZtfzgFzByTB10QgxOR0TqBQejhRNzOfkVJ+5YiUqrXF5Y5lKh/DeuNcP5yLWGsEbtLiOSpa/TPg7JpJHxyendzWTBfX0cxOnKPjgBzi4diinWGdkF8kjdfnycQZXZeYGejmJlZeGl9i2icVqaNVailT6F5iJ90m6mvuTS4OK05M0vDk0Q4XUtwvKOzrcd3iq9uisF81M1OIcR7lEewwcLp7tuNNkM3uNna3F2JQFo97Vriy/Xl4/f1cf5VWzXyym7PHhhx4dbgYKAAA7\" alt=\"Larry\">";
    Assert.That(actual, Is.EqualTo(expected).IgnoreCase);
}

[Test]
public void DataUrlSchemeImgTagUsingImageMediaTypeContainMaliciousJavaScriptTest()
{
    // Arrange
    var s = new HtmlSanitizer();

    // Act
    // base 64 encoded string of of <script>alert("Hello");</script> but specified as image/gif
    var htmlFragment = "<img src=\"data:image/gif;base64,PHNjcmlwdD5hbGVydCgiSGVsbG8iKTs8L3NjcmlwdD4=\">";
    var actual = s.Sanitize(htmlFragment);

    // Assert
    var expected = "<img src=\"data:image/gif;base64,PHNjcmlwdD5hbGVydCgiSGVsbG8iKTs8L3NjcmlwdD4=\">";
    Assert.That(actual, Is.EqualTo(expected).IgnoreCase);
}

[Test]
public void DataUrlSchemeImgTagWithNonImageMediaType()
{
    // Arrange
    var s = new HtmlSanitizer();

    // Act
    // base 64 encoded string of of <script>alert("Hello");</script>
    var htmlFragment = "<img src=\"data:text/html;base64,PHNjcmlwdD5hbGVydCgiSGVsbG8iKTs8L3NjcmlwdD4=\">";
    var actual = s.Sanitize(htmlFragment);

    // Assert
    var expected = "<img>";
    Assert.That(actual, Is.EqualTo(expected).IgnoreCase);
}

[Test]
public void DataUrlSchemeScriptTagTest()
{
    // Arrange
    var s = new HtmlSanitizer();

    // Act
    // base 64 encoded string of of <script>alert("Hello");</script>
    var htmlFragment = "<script src=\"data:text/html;base64,PHNjcmlwdD5hbGVydCgiSGVsbG8iKTs8L3NjcmlwdD4=\"></script>";
    var actual = s.Sanitize(htmlFragment);

    // Assert
    var expected = "";
    Assert.That(actual, Is.EqualTo(expected).IgnoreCase);
}

Russian text support inside html(for example in the test SanitizeEscapeAttrTest)

Thank you for your answer #29. But it breaks the protection:

    var sanitizer = new HtmlSanitizer();
    var html = @"<div title=""&lt;foo&gt;"">Тест</div>";
    var outputFormatter = new CsQuery.Output.FormatDefault(DomRenderingOptions.RemoveComments | DomRenderingOptions.QuoteAllAttributes, HtmlEncoders.Minimum);
    var actual = sanitizer.Sanitize(html, "", outputFormatter);
    Assert.That(actual, Is.EqualTo(@"<div title=""&lt;foo&gt;"">Тест</div>").IgnoreCase);
  Expected string length 35 but was 29. Strings differ at index 12.
  Expected: "<div title="&lt;foo&gt;">Тест</div>", ignoring case
  But was:  "<div title="<foo>">Тест</div>"
  -----------------------^

Throws exception on multiple recipients in a email.

Sanitize the following HTML with enabled mailto: scheme:

<a href="mailto:[email protected],[email protected]">Bang Bang</a>

Actual:

System.UriFormatException : Invalid URI: The hostname could not be parsed.
   at System.Uri.CreateHostStringHelper(String str, UInt16 idx, UInt16 end, ref Flags flags, ref String scopeId)
   at System.Uri.CreateHostString()
   at System.Uri.GetComponentsHelper(UriComponents uriComponents, UriFormat uriFormat)
   at System.Uri.GetComponents(UriComponents components, UriFormat format)
   at System.Uri.get_AbsoluteUri()
   at Ganss.XSS.HtmlSanitizer.SanitizeUrl(String url, String baseUrl)
   at Ganss.XSS.HtmlSanitizer.Sanitize(String html, String baseUrl, IOutputFormatter outputFormatter)

Expected:
No exception is thrown.

Sanitize throws MissingMethodException

Before updating to the latest version of HtmlSanitizer that uses AngleSharp, HtmlSanitizer was working in my application. After updating, however, calling Sanitize throws an exception. Here is how I'm using it (in an extension method):

        public static string Sanitize(this string htmlString)
        {
            var sanitizer = new HtmlSanitizer();
            return sanitizer.Sanitize(htmlString);
        }

And here are the details:
System.MissingMethodException was unhandled by user code
HResult=-2146233069
Message=Method not found: 'System.String AngleSharp.IMarkupFormattable.ToHtml(AngleSharp.IMarkupFormatter)'.
Source=HtmlSanitizer
StackTrace:
at Ganss.XSS.HtmlSanitizer.Sanitize(String html, String baseUrl, IMarkupFormatter outputFormatter)
at Tcbcsl.Presentation.Helpers.ExtensionMethods.Sanitize(String htmlString) in C:\Users\Jay\Documents\GitHubVisualStudio\Tcbcsl\Presentation\Helpers\ExtensionMethods.cs:line 119
at lambda_method(Closure , NewsEditModel )
at AutoMapper.Internal.DelegateBasedResolver2.Resolve(ResolutionResult source) at AutoMapper.NullReferenceExceptionSwallowingResolver.Resolve(ResolutionResult source) at AutoMapper.PropertyMap.<>c.<ResolveValue>b__44_0(ResolutionResult current, IValueResolver resolver) at System.Linq.Enumerable.Aggregate[TSource,TAccumulate](IEnumerable1 source, TAccumulate seed, Func`3 func)
at AutoMapper.PropertyMap.ResolveValue(ResolutionContext context)
at AutoMapper.Mappers.TypeMapObjectMapperRegistry.PropertyMapMappingStrategy.MapPropertyValue(ResolutionContext context, Object mappedObject, PropertyMap propertyMap)
InnerException:

Use of Html namespace breaks @Html. Intellisense in MVC views

When typing @Html. in a .cshtml file, Visual Studio 2013 Ultimate Update 2 Intellisense shows HtmlSanitizer instead of the HtmlHelper methods when HtmlSanitizer has been added as a reference to the MVC5 Web project. The problem seems to be that the Html property of the view is colliding with the Html namespace of HtmlSanitizer.

A less generic namespace would likely solve the problem.

Russian text support

I have a problem with russian text:

Code:

[Test]
        public void TestRussianText()
        {
            // Arrange
            var s = new HtmlSanitizer();

            // Act
            var htmlFragment = "Тест";
            var actual = s.Sanitize(htmlFragment);

            // Assert
            var expected = htmlFragment;
            Assert.That(actual, Is.EqualTo(expected).IgnoreCase);
        }

Test result:

 Expected string length 4 but was 28. Strings differ at index 0.
  Expected: "Тест", ignoring case
  But was:  "&#1058;&#1077;&#1089;&#1090;"
  -----------^

Changelog

Hi,!

I saw there is a new release on nuget. Is there a changelog of the new release(s)?

XHTML - Self Closing Tags

I've just been playing around with a few different HTML sanitizer libraries. This one was looking promising until I realized it changed:

<img src="..." />

to:

<img src="...">

Which is invalid XHTML. Although technically this is allowed in HTML I don't think it looks good and usually most people prefer to use a self closing tag.

V2

Is there an estimate when to release version 2?

"RemovingTag" event isn't fired.

Hi,
After updating to HtmlSanitizer with AngleSharp, we detect the follow behaviour:

When sanitizing a text only containing a element "script" or "style", the RemovingTag event isn't fired.

Here's a test case:

        [Test]
        public void RemoveEventForNotAllowedTag_ScriptTag()
        {
            RemoveReason? actual = null;
            var s = new HtmlSanitizer();
            s.RemovingTag += (sender, args) =>
            {
                actual = args.Reason;
            };
            s.Sanitize("<script>alert('Hello world!')</script>");
            Assert.That(actual, Is.EqualTo(RemoveReason.NotAllowedTag));
        }

        [Test]
        public void RemoveEventForNotAllowedTag_StyleTag()
        {
            RemoveReason? actual = null;
            var s = new HtmlSanitizer();
            s.RemovingTag += (sender, args) =>
            {
                actual = args.Reason;
            };
            s.Sanitize("<style> body {background-color:lightgrey;}</style>");
            Assert.That(actual, Is.EqualTo(RemoveReason.NotAllowedTag));
        }

Adding another tag to the text, works fine.


        [Test]
        public void RemoveEventForNotAllowedTag_ScriptTagAndSpan()
        {
            RemoveReason? actual = null;
            var s = new HtmlSanitizer();
            s.RemovingTag += (sender, args) =>
            {
                actual = args.Reason;
            };
            s.Sanitize("<span>Hi</span><script>alert('Hello world!')</script>");
            Assert.That(actual, Is.EqualTo(RemoveReason.NotAllowedTag));
        }

Thanks

Mailto gets stripped

Hi,

Mailto gets stripped. Is there a way to allow this?

Example:
Input: <a href="mailto:[email protected]">Contact me!</a>
Output: <a>Contact me!</a>

Kind regards,
Paul.

Compability with .NET Framework 4.0

When installs HtmlSanitizer and NuGet packages on project targeting .NET Framework 4.0, the command line tool throws the next error message:

Install-Package : Could not install package 'HtmlSanitizer 1.0.4925.29815'. You are trying to install this package into a project that targets '.NETFramework,Version=v4.0', but the package does not contain any assembly references or content files that are compatible with that framework.

It would be nice that NuGet packages allows installation on .NET Framework 4.0 projects.

Incompability with AngleSharp 0.9.5

While using Sanitize in ASP.net MVC using AngleSharp 0.9.5 I get the following error:

Could not load type 'AngleSharp.FormatExtensions' from assembly 'AngleSharp, Version=0.9.5.41771, Culture=neutral, PublicKeyToken=e83494dcdc6d31ea'.

The binding is properly set to:

<dependentAssembly>
  <assemblyIdentity name="AngleSharp" publicKeyToken="e83494dcdc6d31ea" culture="neutral" />
  <bindingRedirect oldVersion="0.0.0.0-0.9.5.41771" newVersion="0.9.5.41771" />
</dependentAssembly>

It does not happen with 0.9.4.

55 Failed Tests, 66 Passed

Current version (as of 2015-AUG-10) failed 55 unit tests, but passed 66.

The consistent issues for failures include (SanitizeUnicodeUrlTest() and others):

  1. Empty style tag (which is permitted)
    Expected string length 14 but was 20. Strings differ at index 4.
    Expected: "
    XSS
    ", ignoring case
    But was: "
    XSS
    "
    ---------------^
  2. Closed tag is valid but not expected (JavaScriptIncludeAndAngleBracketsTest() and others):
    Expected string length 4 but was 6. Strings differ at index 3.
    Expected: "
    ", ignoring case
    But was: "
    "
    --------------^
  3. Semi-colon in Style attribute is not expected (DisallowCssPropertyValueTest() and others):
    Expected string length 47 but was 48. Strings differ at index 35.
    Expected: "
    Test
    ", ignoring case
    But was: "
    Test
    "
    ----------------------------------------------^

I am willing to contribute but the question arises as the intended results. All tests would pass HTML validation, though fail "good" html syntax verification (i.e. open tags are not considered a syntax failure in HTML). Please let me know if / how I might participate in this project.

Paul
(PS, Code passed Fortify SCA 4.30 scans with 0 issues)

CoreCLR support

Any plans to move to xproj / with CoreCLR support?

AngleSharp has it already, so won't be that difficult I guess.

Sanitize url containing []

Hi @mganss

I'm using your library HtmlSanitizer in my website. It works very well. This week we got a problem that the HtmlSanitizer() remove an image url from photobucket.
Input: <a href="http://media.photobucket.com/user/jade95_2010/media/PHOTOGRAPHY-VARIED/cat-face.jpg.html?filters[term]=cat&amp;filters[primary]=images&amp;filters[featured_media]=1220&amp;filters[secondary]=videos&amp;sort=1&amp;o=2" target="">text</a>.
Output: <a target>text</a>

The image url doesn't pass this check uri.IsWellFormedOriginalString(). Can you explain why we treat this url unsafe? Is there any security concern?

Thanks.

Unicode being converted to HTMLEntities.

HI

All international charters are being converted to HTMLEntities values. This is fine if you are protecting on output but we need to sanatize before it's stored to the database. There is no real alternative to this for us.

Is there any way that the international characters could be left as they are without html decoding and introducing security issues, breaking the html when < characters appear etc?

HtmlSanitizer doesn't recursively sanitize the input data.

Hi,

I am using HtmlSanitization library in my project to sanitize the input data. It works very nicely in all inputs and thanks a lot for the library. While testing I came across scenario in which I want HtmlSanitizer to recursively sanitize the input string which it is not currently doing.

Following is my input string:
<<abc></abc>script>alert('XSS')<<abc></abc>/script>

"abc" is my custom tag which is not whitelisted. As expected HtmlSanitizer removes the <abc></abc> tag. But since this one level sanitization, it the output string still contains malicious input data which is
<script>alert('XSS')</script>

HtmlSanitization should be iteration based and it should recursively sanitize the input string. Do you have any plans to support such functionality as it will be much needed functionality.

Please do let me know your road map and your thoughts.

Thanks
Amit.

Online Demo

Hi.

HtmlSanitizer sounds cool. Is there any online demo available somewhere? Thanks!

HtmlSanitizer does not Sanitize certain Text

Please consider the following piece of text:

<%25whscheck onmouseover=""alert(1)"">mouseOverThisText""

This Text when Inserted into an text field on IE9 will actually cause the javascript to fire (IE9 Closes the tag).. For some reason The Library does not sanitize this. COuld you please check?

Can't sanitize full documents

We're trying to sanitize full html documents and we're losing the outer parts of the document. From the looks of the code it is automatically wrapping the html text in a body tag before passing it through. Is there any way around this?

We're loading html emails into a browser window which is why they are full documents. We put them into an iframe so they don't mess with the surrounding page. The iframe is sandboxed, but it would be nice to have the peace of mind of knowing we tried to sanitize the html as well.

Enable throwing exception in case of unallowed input

I would really like to be able to have an Exception occur when an unallowed tag occurs, instead of just stripping it. This is useful when you can expect input will NOT contain malicious or unwanted HTML, but you want to make sure and also be notified if it DOES.

Sample cases:

  • If you have no control over input, but it is provided by external party, and could turn out to be malicious or this party to be compromised themselves at some point in the future.
  • Or in the case when you do have control over the input, but disallow or strip HTML in your own system. And then want to be notified when this functionality turns out not to work or has a fail-over, in order to be able to then quickly fix this yourself.

In my system the exception would just bubble up and we would be notified though some health monitoring that we have running.

Note that this issue is not a bug report, but a feature request. If I have time I could fork this project, implement it myself and do a pull request.

License

Would it be possible to update license.md into text file containing license text instead of a link to wikipedia article? That article could be changed or removed and something more permanent in git is preferable... Thanks.

System.MissingMethod exception with AngleSharp 0.9.4

After updating to AngleSharp 0.9.4, HtmlSanitizer.Sanitize() started throwing a MissingMethodException.

Method not found: 'System.String AngleSharp.IStyleFormattable.ToCss()'
at Ganss.XSS.HtmlSanitizer.SanitizeStyle(IHtmlElement element, String baseUrl)
at Ganss.XSS.HtmlSanitizer.Sanitize(String html, String baseUrl, IMarkupFormatter outputFormatter)

Reverting to AngleSharp 0.9.3 fixed the problem.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.