rrrene / html_sanitize_ex Goto Github PK

HTML sanitizer for Elixir

License: MIT License

Elixir 99.39% Shell 0.61%

html_sanitize_ex's Introduction

HtmlSanitizeEx

html_sanitize_ex provides a fast and straightforward HTML Sanitizer written in Elixir which lets you include HTML authored by third-parties in your web application while protecting against XSS.

It is the first Hex package to come out of the elixirstatus.com project, where it will be used to sanitize user announcements from the Elixir community.

What can it do?

html_sanitize_ex parses a given HTML string and, based on the used Scrubber, either completely strips it from HTML tags or sanitizes it by only allowing certain HTML elements and attributes to be present.

NOTE: The one thing missing at this moment is support for styles. To add this, we have to implement a Scrubber for CSS, to prevent nasty CSS hacks using <style> tags and attributes.

Otherwise html_sanitize_ex is a full-featured HTML sanitizer.

Installation

Add html_sanitize_ex as a dependency in your mix.exs file.

defp deps do
  [{:html_sanitize_ex, "~> 1.4"}]
end

After adding you are done, run mix deps.get in your shell to fetch the new dependency.

The only dependency of html_sanitize_ex is mochiweb which is used to parse HTML.

Usage

Depending on the scrubber you select, it can strip all tags from the given string:

text = "<a href=\"javascript:alert('XSS');\">text here</a>"
HtmlSanitizeEx.strip_tags(text)
# => "text here"

Or allow certain basic HTML elements to remain:

text = "<h1>Hello <script>World!</script></h1>"
HtmlSanitizeEx.basic_html(text)
# => "<h1>Hello World!</h1>"

There are built-in scrubbers that cover common use cases, but you can also easily define custom scrubbers (see the next section).

The following default scrubbing options exist:

HtmlSanitizeEx.basic_html(html)
HtmlSanitizeEx.html5(html)
HtmlSanitizeEx.markdown_html(html)
HtmlSanitizeEx.strip_tags(html)

There is also one scrubber primarily used for testing:

HtmlSanitizeEx.noscrub(html)

Before using a built-in scrubber, you should verify that it functions in the way you expect. The built-in scrubbers are located in /lib/html_sanitize_ex/scrubber

Custom Scrubbers

A custom scrubber has the advantage of allowing you to support only the minimum functionality needed for your use case.

With a custom scrubber, you define which tags, attributes, and uri schemes (e.g. https, mailto, javascript, etc.) are allowed. Anything not allowed can then be stripped out.

There are also utility functions to remove CDATA sections and comments which you will generally include.

Here is an example of a custom scrubber which allows only p, h1, and a tags, and restricts the href attribute to only the https and mailto URI schemes. It also removes CDATA sections and comments.

Note that the scrubber should include Meta.strip_everything_not_covered() at the end.

defmodule MyProject.MyScrubber do
  require HtmlSanitizeEx.Scrubber.Meta
  alias HtmlSanitizeEx.Scrubber.Meta

  Meta.remove_cdata_sections_before_scrub()
  Meta.strip_comments()

  Meta.allow_tag_with_these_attributes("p", [])
  Meta.allow_tag_with_these_attributes("h1", [])
  Meta.allow_tag_with_uri_attributes("a", ["href"], ["https", "mailto"])

  Meta.strip_everything_not_covered()
end

Then, you can use the scrubber in your project by giving it as the second argument to Scrubber.scrub/2:

defmodule MyProject.MyModule do
  alias HtmlSanitizeEx.Scrubber
  alias MyProject.MyScrubber

  def sanitize_html(html) do
    Scrubber.scrub(html, MyScrubber)
  end
end

A great way to make a custom scrubber is to use one the of built-in scrubbers closest to your use case as a template. The built in scrubbers are located in /lib/html_sanitize_ex/scrubber

Contributing

Fork it!
Create your feature branch (git checkout -b my-new-feature)
Commit your changes (git commit -am 'Add some feature')
Push to the branch (git push origin my-new-feature)
Create new Pull Request

Author

René Föhring (@rrrene)

License

html_sanitize_ex is released under the MIT License. See the LICENSE file for further details.

html_sanitize_ex's People

Contributors

Stargazers

Watchers

Forkers

manukall smeevil mykook nubleer timlang praveenperera b3k matbox henryj aef- qilongxu marvelm sviccari losvedir cspeisman andre-bonfatti-movile j-san alecnmk shivno jtomaszewski akdilsiz khamilowicz aandr memolab sublimecoder mikeandrianov iwarshak mdlkxzmcp germsvel zambal carakan juanperi lkoller krisinfosec dbernheisel ismailunal suexcxine expert360 mathieuprog citizenlabdotco onixus74 it-cti tyler-pierce lurnid savonarola vizakenjack apb9785 sightsource xerb byhbt bunste gawrongrzegorz yojee rukykf aetherus idyll sadique-the-alchemist pjullrich bincoitk31

html_sanitize_ex's Issues

Whitespace for images

I found a bit of a tricky issue with stripping HTML -> plaintext. If I have an image separating 2 paragraphs, they are pushed right up against each other without a space. For example:

iex(3)> HtmlSanitizeEx.strip_tags("<p>I'm a paragraph.</p><img src='xx' /><p>I'm another.</p>")                              
"I'm a paragraph.I'm another."

I would expect there to be a space between them. Is it best to do something like this?:

iex(4)> "<p>I'm a paragraph.</p><img src='xx' /><p>I'm another.</p>" |> String.replace("<img", " <img") |> HtmlSanitizeEx.strip_tags()    
"I'm a paragraph. I'm another."

This use case (html to plaintext) might be fraught with a bunch of pitfalls that exist no matter what, but I'm just not seeing yet.

Allow img tag with src as data:image/png;base64

I am converting a canvas to image to export in PDF from html. But HtmlSanitizeEx.Scrubber.Meta.allow_tag_with_uri_attributes(
"img",
["src"],
["http", "https", "mailto", "/", "data:image/png;base64"]
)

remove the src attribute from img tag.

A huge URI crashes application when scheme is fetched with `Regex.named_captures/3`

Fixed by #42

Strip newlines and white space?

Can this strip newlines and white spaces?

I would expect stripping the following "\n Computers" into "Computers"

Scrub specific attribute alongside Meta

How do I implement a custom scrubber that uses HtmlSanitizeEx.Scrubber.Meta while still allowing me to define some custom scrubbing for a specific attribute?

Practically I want to allow <img>-tags, but only if their href-attribute starts with a specific domain name.

Here is my current scrubber-module:

defmodule MyProject.Scrubber do
  require HtmlSanitizeEx.Scrubber.Meta
  alias HtmlSanitizeEx.Scrubber.Meta

  Meta.remove_cdata_sections_before_scrub
  Meta.strip_comments

  Meta.allow_tag_with_uri_attributes   "a", ["href"], ["http", "https"]
  Meta.allow_tag_with_these_attributes "a", ["name", "title"]

  Meta.allow_tag_with_these_attributes "h1", []
  Meta.allow_tag_with_these_attributes "h2", []
  Meta.allow_tag_with_these_attributes "h3", []
  Meta.allow_tag_with_these_attributes "h4", []

  Meta.allow_tag_with_these_attributes "blockquote", []
  Meta.allow_tag_with_these_attributes "p", []

  Meta.allow_tag_with_these_attributes "strong", []
  Meta.allow_tag_with_these_attributes "em", []
  Meta.allow_tag_with_these_attributes "strike", []
  Meta.allow_tag_with_these_attributes "sup", []
  Meta.allow_tag_with_these_attributes "sub", []

  Meta.allow_tag_with_these_attributes "br", []

  Meta.allow_tag_with_uri_attributes   "img", ["src"], ["https"]
  Meta.allow_tag_with_these_attributes "img", ["alt"]

  Meta.allow_tag_with_these_attributes "ul", []
  Meta.allow_tag_with_these_attributes "ol", []
  Meta.allow_tag_with_these_attributes "li", []

  Meta.allow_tag_with_these_attributes "hr", []

  Meta.strip_everything_not_covered
end

@doc attribute being set multiple times for same function/arity

In scrubber/no_scrub.ex there are three clauses for strip/1 which all attempt to set the @doc attribute for that function. This throws a compiler warning:

warning: redefining @doc attribute previously set at line 15.

Please remove the duplicate docs. If instead you want to override a previously defined @doc, attach the @doc attribute to a function head:

    @doc """
    new docs
    """
    def scrub(...)

  lib/html_sanitize_ex/scrubber/no_scrub.ex:24: HtmlSanitizeEx.Scrubber.NoScrub.scrub/1

warning: redefining @doc attribute previously set at line 15.

Please remove the duplicate docs. If instead you want to override a previously defined @doc, attach the @doc attribute to a function head:

    @doc """
    new docs
    """
    def scrub(...)

  lib/html_sanitize_ex/scrubber/no_scrub.ex:29: HtmlSanitizeEx.Scrubber.NoScrub.scrub/1

Should be an easy fix - I'll submit a PR.

Add scrubbing option for escaping HTML (instead of stripping)

At the moment it seems there's only options for stripping HTML. I think it would make sense to also add an option to escape HTML, e.g. HtmlSanitizeEx.escape_html(html).

iex> HtmlSanitizeEx.escape_html("<strong>bold?</strong>")
"&lt;strong&gt;bold?&lt;/strong&gt;"

HTML Escaping

Hey @rrrene,

First, thank you for the work in the lib, it's met our needs and working great!

I'm trying to understand expectations and behaviour around HTML entities. It seems that &, <, and > are escaped, while others are not, even when using the NoScrub scrubber:

iex(1)> HtmlSanitizeEx.noscrub("&")
"&amp;"
iex(2)> HtmlSanitizeEx.noscrub("<")
"&lt;"
iex(3)> HtmlSanitizeEx.noscrub(">")
"&gt;"
iex(4)> HtmlSanitizeEx.noscrub("'")
"'"
iex(5)> HtmlSanitizeEx.noscrub("°") 
"°"

Is that intended behaviour or a bug?

Meta.strip_everything_not_covered leaves the content within a <script> tag.

I'm trying to scrape my own blog post here: https://sergiotapia.me/phoenix-framework-uploading-to-amazon-s3-e70657bd2013

Actual HTML is:

<script type="application/ld+json">{"@context":"http://schema.org","@type":"NewsArticle","image":{"@type":"ImageObject","width":848,"height":346,"url":"https://cdn-images-1.medium.com/max/848/1*LJy3CSjxbpcsK165R5zemQ.png"},"datePublished":"2017-08-05T17:17:49.000Z","dateModified":"2017-08-09T11:42:36.889Z","headline":"Phoenix Framework — Direct Uploading to Amazon S3.","name":"Phoenix Framework — Direct Uploading to Amazon S3.","keywords":["Web Development","Elixir","Phoenix Framework","Amazon S3"],"author":{"@type":"Person","name":"Sergio Tapia","url":"https://sergiotapia.me/@sergiocodes"},"creator":["Sergio Tapia"],"publisher":{"@type":"Organization","name":"sergiotapia","url":"https://sergiotapia.me","logo":{"@type":"ImageObject","width":106,"height":60,"url":"https://cdn-images-1.medium.com/max/106/1*ITIwmsAcKr1uJyEGpkEN9Q.jpeg"}},"mainEntityOfPage":"https://sergiotapia.me/phoenix-framework-uploading-to-amazon-s3-e70657bd2013"}</script>

In my custom scrubber I have this:

defmodule HtmlScrubber do
    require HtmlSanitizeEx.Scrubber.Meta
    alias HtmlSanitizeEx.Scrubber.Meta

    @valid_schemes ["http", "https", "mailto"]

    # Removes any CDATA tags before the traverser/scrubber runs.
    Meta.remove_cdata_sections_before_scrub

    Meta.strip_comments

    Meta.allow_tag_with_uri_attributes   "a", ["href"], @valid_schemes
    Meta.allow_tag_with_these_attributes "a", ["name", "title"]

    Meta.allow_tag_with_these_attributes "b", []
    Meta.allow_tag_with_these_attributes "blockquote", []
    Meta.allow_tag_with_these_attributes "br", []
    Meta.allow_tag_with_these_attributes "code", []
    Meta.allow_tag_with_these_attributes "del", []
    Meta.allow_tag_with_these_attributes "em", []
    Meta.allow_tag_with_these_attributes "h1", []
    Meta.allow_tag_with_these_attributes "h2", []
    Meta.allow_tag_with_these_attributes "h3", []
    Meta.allow_tag_with_these_attributes "h4", []
    Meta.allow_tag_with_these_attributes "h5", []
    Meta.allow_tag_with_these_attributes "hr", []
    Meta.allow_tag_with_these_attributes "i", []

    Meta.allow_tag_with_uri_attributes   "img", ["src"], @valid_schemes
    Meta.allow_tag_with_these_attributes "img", ["width", "height", "title", "alt"]

    Meta.allow_tag_with_these_attributes "li", []
    Meta.allow_tag_with_these_attributes "ol", []
    Meta.allow_tag_with_these_attributes "p", []
    Meta.allow_tag_with_these_attributes "pre", []
    Meta.allow_tag_with_these_attributes "span", []
    Meta.allow_tag_with_these_attributes "strong", []
    Meta.allow_tag_with_these_attributes "table", []
    Meta.allow_tag_with_these_attributes "tbody", []
    Meta.allow_tag_with_these_attributes "td", []
    Meta.allow_tag_with_these_attributes "th", []
    Meta.allow_tag_with_these_attributes "thead", []
    Meta.allow_tag_with_these_attributes "tr", []
    Meta.allow_tag_with_these_attributes "u", []
    Meta.allow_tag_with_these_attributes "ul", []

    Meta.strip_everything_not_covered
  end

The content ends up looking like:

Uploading to Amazon S3. – sergiotapia{\"@context\":\"http://schema.o...*snip*

So it's removing the <script> tag, but I'd like it to also remove the contents for the script tag. Any suggestions? I appreciate the help!

HtmlSanitizeEx.html5 doesn't work with some html5-specific attributes

Currently HtmlSanitizeEx.html5("<article>1</article>") returns "1". This is wrong because

is valid html attribute. Same withand`.

Here is PR for fixing it: #35

HtmlSanitizeEx.strip_tags works unpredictably

Hey @rrrene I'll show you 3 cases:

"some\r\ntext" |> HtmlSanitizeEx.strip_tags # => "some\r\ntext"
"some<b>text with break between tags</b>\r\n<i>will remove break</i> |> HtmlSanitizeEx.strip_tags # => "sometext with break between tagswill remove break"
"some text\r\n<b>break only from one side</b>" |> HtmlSanitizeEx.strip_tags # => "some text\r\nbreak only from one side"

It's not critical, but kinda interesting. WDYT?

Handling of whitespace in the CSS scrubber

First of all thank you for the great library!

I’m trying to understand how the CSS scrubber works:  If I understand correctly, first, the regex groups all non whitespace characters and then checks if it’s either an allowed keyword or a measured unit.

This works fine for e.g. rgb(0,0,0), because it does not contain whitespaces. Thus, the string rgb(0,0,0) is passed to measured_unit? as is.

However, for rgb(0, 0, 0), three calls to measured_unit? are made:

1. rgb(0,
2. 0,
3. 0)

Is this intentional? The string still gets matched, because the regex in measured_unit? also matches 0, and 0). But it doesn’t really match the rgb(..) part of the regex, just the last part of the regex that is supposed to match e.g. 5em.

whitespace truncated

Hi,

I noticed the following behaviour :

iex(1)> HtmlSanitizeEx.basic_html("This <b>is</b> <b>an</b> <i>example</i> of <u>space</u> eating.")
"This <b>is</b><b>an</b><i>example</i> of <u>space</u> eating."

The sanitiser is eating whitespace if directly followed by an other whitelisted tag. I would expect the result to be :

"This <b>is</b> <b>an</b> <i>example</i> of <u>space</u> eating."

Gerard.

Encoded "script" Elements don't get scrubbed

Hello there,

first of all, thank you very much for this super useful library ❤️🙏

I'm running into an "issue" (I'm unsure whether it's an issue or intended behaviour) that some elements don't get scrubbed if their < and > signs are encoded. For example:

# Works as expected
iex(1)> HtmlSanitizeEx.html5("<script>alert('xss');</script>")
"alert('xss');"

# Doesn't work as expected (the "script" tags aren't removed)
iex(2)> HtmlSanitizeEx.html5("&lt;script&gt;alert('xss');&lt;/script&gt;")
"&lt;script&gt;alert('xss');&lt;/script&gt;"

If I render the second string in my html with raw(@safe_content), it becomes <script>alert('xss');</script> again.

Now, I'm unsure about the implications of this. In my case, the string is user input and I render the content as described in my HEEX template because it can contain code snippets. What do you think? Is there a possible vulnerability here or does everything work as intended? :)

Floki and html_sanitize_ex in the same project?

Hello! Thanks for this html sanitizing library, it's awesome and I'm using it a lot...

However I need Floki included as well for various reasons and that uses just the mochiweb_html package that only includes the html parser. I can see their thinking for this but it causes problems with generating a release; all of the modules conflict:

==> Failed to build release:

Duplicated modules: 
    mochiweb_html specified in mochiweb and mochiweb_html
    mochiweb_charref specified in mochiweb and mochiweb_html
    mochiutf8 specified in mochiweb and mochiweb_html
    mochinum specified in mochiweb and mochiweb_html

You can find mochiweb_html here:

https://github.com/philss/mochiweb_html

I've decided that I'll ask Floki to change over to the full version of mochiweb but I'll leave this here for posterity. Feel free to close this! Thanks.

Multiple pass idea

I wanted to run this idea past you to look at a problem we ran into at work. We wanted to scrub a tags and remove any tags which ended up empty. So <a href="not valid">okay</a> would turn into okay rather than <a>okay</a>.

We did this by implementing a macro like:

    defmacro strip_tag_that_has_no_scrubbed_attributes(tag_name) do
      quote do
        def scrub({unquote(tag_name), attributes, children}) do
          scrub_attributes(unquote(tag_name), attributes)
          |> case do
            [] ->
              children

            scrubbed ->
              {unquote(tag_name), scrubbed, children}
          end
        end

        defp scrub_attributes(unquote(tag_name), attributes) do
          Enum.map(attributes, fn attr ->
            scrub_attribute(unquote(tag_name), attr)
          end)
          |> Enum.reject(&is_nil(&1))
        end
      end

We ended up having to copy paste one of the Meta functions to make this work. One idea that I was waffling on is that if the Traversal took multiple passes and looked for a stable state, then you could write functions which handle a single part only. The cost is more traversals, but simpler thought process. For example, we could have a scrub function which scrubs the href tag of a links and another scrub function which removes a tags with empty attributes. The order of operations wouldn't matter in that case, then.

Allowed pattern is scrubbed

I need a similar scrubber as the ones seen in the documentation, but when I try to create and scrub some text it also scrubs the pattern I allowed.

# custom scrubber
defmodule LinksOnlyScrubber do
  require HtmlSanitizeEx.Scrubber.Meta
  alias HtmlSanitizeEx.Scrubber.Meta

  Meta.remove_cdata_sections_before_scrub()
  Meta.strip_comments()

  Meta.allow_tag_with_uri_attributes("a", ["href"], ["https", "mailto", "http"])

  Meta.strip_everything_not_covered()
end

Actual result

The tag has been removed

iex> HtmlSanitizeEx.Scrubber.scrub "This is a <a href=\"https://www.youtube.com/\">test</a>", LinksOnlyScrubber
"This is a test"

Expected result

The text should be left untouched

Versions

html_sanitize_ex 1.4.2
mochiweb 2.22.0

Markdown quotes are sanitized in HtmlSanitizeEx.markdown_html/1

Hi there,

so I just figured out, that Markdown quotes are sanitized.

This markdown:

> some quote

should not be scrubbed to

&gt;

I use HtmlSanitizeEx.markdown_html/1 for sanitizing.
Same result with HtmlSanitizeEx.basic_html/1

Or expected behavior and I got it wrong?

kind regards!

Is there any way to customize the list of accepted html tags?

Actually I think the whole interface should be changed from:

HtmlSanitizeEx.noscub(html)
HtmlSanitizeEx.html5(html)

up to

HtmlSanitizeEx.sanitize(html, :noscub)
HtmlSanitizeEx.sanitize(html, :html5)
HtmlSanitizeEx.sanitize(html, ~w(h1 h2 h3 div))

where 2nd argument either atom for using existing strategy or custom list of accepted attributes. What do you think?

camelCase attribute transform error

hi there ~

i am trying to do some custom sanitize logic for svg tag, but i found camelCase attribute will be transform to lowercase unexpectedly.

for example:

iex(12)> HtmlSanitizeEx.noscrub(~s(<svg viewBox="0 0 1024" width="20px"></svg>))
"<svg viewbox=\"0 0 1024\" width=\"20px\"></svg>"

viewBox -> viewbox

and also, svg tag has tons of those cameCase attruites, this make it very hard to sanitize ...

any plan to support Meta.allow_tag_with_these_attributes("svg", [*]) like syntax ?

No source code in 1.4.1 mix release

ls deps/html_sanitize_ex      
CHANGELOG.md        LICENSE             README.md           hex_metadata.config mix.exs

I don't seem to have a lib folder or any source code besides the mix.exs file. That results in:

iex(1)> HtmlSanitizeEx.strip_tags(File.read!("import-people.html"))
** (UndefinedFunctionError) function HtmlSanitizeEx.strip_tags/1 is undefined (module HtmlSanitizeEx is not available)

version is 1.3.0-rc3 in master

i would expect a more recent one than 1.3.0?

stripping <script> tags leaves source code alone

HtmlSanitizeEx.basic_html("<script>test</script>")

should probably return

""
and not

"test"

HTML5 Options Strips <blockquote>

The HTML5 sanitize scrubber option strips <blockquote> however the basic_html option does not. I was wondering if this was by design or if it was an oversight. If this was an oversight would you accept a PR with it added? I need to allow blockquotes as well as HTML5 elements in my webpage.

Is there a new release?

Hi, before asking the question I want to thank you for all your efforts.
I saw you commited and fix some bug and added documents , but the version was released in hex.pm is for Aug 30, 2021.
is there possible to release new version, I want to use in my open source project and getting from hex.pm?

Thank you in advance

Write an excerpt scrubber?

Hi there @rrrene!

I'm wondering how difficult it would be to write an excerpt scrubber that only took the first X text nodes inside top level p's and div's maybe?

It's difficult to measure the these things based on the length of the underlying text but I think your library will help!

I'd like to try to make the excerpts visually appealing at the edge cases too by saying if a text node was short we can play around with if we show the next paragraph or not (i.e. I have 255 character limit but the first paragraph is 225 chars and the second is chopped off with only 30 chars it's pretty pointless and you can drop short last paragraphs or add some extra chars).

It's so damn obvious for a human to know if they should just include the whole of the second paragraph in the excerpt or not just by looking at it, I'd like to get about 80% of the way there :-)

Let me know if you think this is something you want in your sanitizer - it might be considered out of scope!

CaseClauseError when parsing page HTML

iex(1)> HtmlSanitizeEx.html5(HTTPoison.get!("http://www.coolest-gadgets.com/20161026/orange-peel-pro-work/").body)
** (CaseClauseError) no case clause matching: ["//www.facebook.com/plugins/likebox.php?href=http%3A", "//www.facebook.com/plugins/likebox.php?href=http", "%3A", "", "", "", "%"]
    (html_sanitize_ex) lib/html_sanitize_ex/scrubber/html5.ex:63: HtmlSanitizeEx.Scrubber.HTML5.scrub_attribute/2
    (elixir) lib/enum.ex:1255: Enum."-map/2-lists^map/1-0-"/2
    (html_sanitize_ex) lib/html_sanitize_ex/scrubber/html5.ex:64: HtmlSanitizeEx.Scrubber.HTML5.scrub_attributes/2
    (html_sanitize_ex) lib/html_sanitize_ex/scrubber/html5.ex:64: HtmlSanitizeEx.Scrubber.HTML5.scrub/1
    (html_sanitize_ex) lib/html_sanitize_ex/traverser.ex:10: HtmlSanitizeEx.Traverser.traverse/2
    (html_sanitize_ex) lib/html_sanitize_ex/traverser.ex:11: HtmlSanitizeEx.Traverser.traverse/2
    (html_sanitize_ex) lib/html_sanitize_ex/traverser.ex:22: HtmlSanitizeEx.Traverser.traverse/2
    (html_sanitize_ex) lib/html_sanitize_ex/traverser.ex:10: HtmlSanitizeEx.Traverser.traverse/2
    (html_sanitize_ex) lib/html_sanitize_ex/traverser.ex:11: HtmlSanitizeEx.Traverser.traverse/2
    (html_sanitize_ex) lib/html_sanitize_ex/traverser.ex:22: HtmlSanitizeEx.Traverser.traverse/2
    (html_sanitize_ex) lib/html_sanitize_ex/traverser.ex:10: HtmlSanitizeEx.Traverser.traverse/2
    (html_sanitize_ex) lib/html_sanitize_ex/traverser.ex:11: HtmlSanitizeEx.Traverser.traverse/2
    (html_sanitize_ex) lib/html_sanitize_ex/traverser.ex:22: HtmlSanitizeEx.Traverser.traverse/2
    (html_sanitize_ex) lib/html_sanitize_ex/traverser.ex:10: HtmlSanitizeEx.Traverser.traverse/2
    (html_sanitize_ex) lib/html_sanitize_ex/traverser.ex:11: HtmlSanitizeEx.Traverser.traverse/2
    (html_sanitize_ex) lib/html_sanitize_ex/traverser.ex:22: HtmlSanitizeEx.Traverser.traverse/2
    (html_sanitize_ex) lib/html_sanitize_ex/traverser.ex:10: HtmlSanitizeEx.Traverser.traverse/2
    (html_sanitize_ex) lib/html_sanitize_ex/traverser.ex:11: HtmlSanitizeEx.Traverser.traverse/2
    (html_sanitize_ex) lib/html_sanitize_ex/traverser.ex:22: HtmlSanitizeEx.Traverser.traverse/2
    (html_sanitize_ex) lib/html_sanitize_ex/traverser.ex:10: HtmlSanitizeEx.Traverser.traverse/2

It seems that it crashes on:

<iframe src="//www.facebook.com/plugins/likebox.php?href=http%3A%2F%2Fwww.facebook.com%2Fcoolestgadgets&amp;width=300&amp;height=290&amp;colorscheme=light&amp;show_faces=true&amp;border_color=white&amp;stream=false&amp;header=true&amp;appId=131623183581267" scrolling="no" frameborder="0" style="border:none; overflow:hidden; width:300px; height:290px;" allowTransparency="true"></iframe>

Error when using the `allow_tags_with_style_attributes` macro

When I make use of the Meta.allow_tags_with_style_attributes macro in my custom scrubber, I get this error message when I try to compile:

== Compilation error in file lib/testing_elixir_deps.ex ==
** (CompileError) lib/testing_elixir_deps.ex:59: undefined function scrub_css/1 (expected TestingElixirDeps.MyScrubber to define such a function or for it to be imported, but none are available)

** (exit) shutdown: 1
    (mix 1.14.0) lib/mix/tasks/compile.all.ex:78: Mix.Tasks.Compile.All.compile/4
    (mix 1.14.0) lib/mix/tasks/compile.all.ex:59: Mix.Tasks.Compile.All.with_logger_app/2
    (mix 1.14.0) lib/mix/tasks/compile.all.ex:33: Mix.Tasks.Compile.All.run/1
    (mix 1.14.0) lib/mix/task.ex:421: anonymous fn/3 in Mix.Task.run_task/4
    (mix 1.14.0) lib/mix/tasks/compile.ex:134: Mix.Tasks.Compile.run/1
    (mix 1.14.0) lib/mix/task.ex:421: anonymous fn/3 in Mix.Task.run_task/4
    (iex 1.14.0) lib/iex/helpers.ex:108: IEx.Helpers.recompile/1
    iex:17: (file)

Here's the code for the custom scrubber I wrote:

  defmodule MyScrubber do
    require HtmlSanitizeEx.Scrubber.Meta
    alias HtmlSanitizeEx.Scrubber.Meta

    # Removes any CDATA tags before the traverser/scrubber runs.
    Meta.remove_cdata_sections_before_scrub()
    Meta.strip_comments()

    Meta.allow_tags_with_style_attributes(["p", "span", "html", "body"])
    Meta.allow_tags_and_scrub_their_attributes(["b", "i"])

    Meta.strip_everything_not_covered()
  end

You can reproduce the error by creating and using a custom scrubber with the Meta.allow_tags_with_style_attributes macro.

Package unmaintained? Add another maintainer?

First off, this is an awesome package, thank you @rrrene for sharing this with the world!

It appears that this this wonderful free piece of software is not actively maintained at the moment, as the last commit was over a year ago (July 2020), and minor pull requests are accumulating. I imagine this is due to René's extreme business.

I don't think there are any severe problems yet, but I'm concerned the package will begin to fail if/when breaking changes are introduced by major Elixir version upgrades and/or changes to mochiweb 2.22.0 become necessary. It also looks like there are some nice-to-have pull requests. Does the package need another maintainer?

Consider allowing class on code tag in basic_html

Scrubbing with basic_html works great as a sanitisation pass after converting markdown. There's one issue.

Markdown parsers, that allow for github flavoured markdown, will add class to the code tag with the language name. Currently that class is removed (as are all class attributes). I understand the intent of removing all attributes, but it makes the scrubber unwieldy for working with markdown and requires implementing custom module.

Experiencing high memory usage during a production complie.

I use html_sanitize_ex in my Elixir Phoenix app. Works great.

I had an issue during a recent deploy where my compile command was getting Killed while it was compiling html_sanitize_ex. It was a repeatable failure for me.

After some investigation it seems that the RAM usage goes way up while compiling this specific hex package.

I do use a lower end Linode as a build server. It has 1 GB of ram with a 512 MB swap disk but other than this, it has served me well so I consider this a bug.

You can follow along my original thread about the problem here:

https://elixirforum.com/t/thoughts-on-mix-complie-getting-killed-on-my-build-machine-linode/33727/4

Further whitespace issues

As noted in #4, mochiweb_html ignores whitespace between tags. The fix (replacing the input string using a regexp) doesn't really fix the issues though.

In this example, the whitespace is not only spaces, but also a newline. Space is not the only space character in HTML.

iex(39)> HtmlSanitizeEx.basic_html("<a href=\"almost\">on my mind</a>  <a href=\"almost\">all day long</a>")
"<a href=\"almost\">on my mind</a> <a href=\"almost\">all day long</a>"
iex(40)> HtmlSanitizeEx.basic_html("<a href=\"almost\">on my mind</a>  \n<a href=\"almost\">all day long</a>")
"<a href=\"almost\">on my mind</a><a href=\"almost\">all day long</a>"

In this example, mochiweb properly parses the textarea contents, which is later escaped on output. But since we regex-replaced the space with  , that sequense is also escaped:

iex(50)> HtmlSanitizeEx.html5("<textarea> <script></script></textarea>") 
"<textarea>&amp;#32;&lt;script&gt;&lt;/script&gt;</textarea>"

The first issue could probably be solved with an extended regexp to match all space characters, while the second one could only be solved by making the parser keep all text nodes.

strip_tags escapes ampersands, gt and lt

strip_tags replaces &, > and < symbols for corresponding HTML entities.
This is unexpected, since those are not tags and they're not being stripped.
Should it be a part of a different function instead?

how to declare it ?

How to use this ?

use HtmlSanitizeEx fails so i declare it with alias HtmlSanitizeEx but still :

** (CompileError) lib/inilab/notifier/formatter/helpers.ex:66: undefined function strip_tags/1

Allow img src with basic_html

Hi,

thanks for the plugin. I use HtmlSanitizeEx.basic_html(some_html_from_editor) to sanitize user input from an editor. Img with src(base64) is removed. How can i allow img with src?

thanks