HTML Filtering

Suppose you have a web application that displays HTML input from an untrusted source. Webmail applications such as HotMail fall into this category, since anyone can send an HTML email to a HotMail user. BBS systems and CGI guestbooks that allow people to include HTML tags in their postings also fall into this category.

If this untrusted HTML includes javascript, your web browser will run it when it sees it. As far as your browser can tell, this javascript has come from the web application itself, not some untrusted third party. That's bad because javascript can manipulate and submit forms, so in a webmail situation the javascript can use the interface to do anything you can do. Someone could send you an email that includes javascript which causes your browser to forward all your stored email to another address when you view the mail. Or it could just send abusive email from your account. All very bad. See http://httpd.apache.org/info/css-security/ for some background on this issue.

So what's the problem ? The web application can just remove all scripting constructs from the untrusted HTML before displaying it, right ?

Yes, in theory, but that's harder than you might think. This BugTraq posting gives some examples of ways to hide scripting constructs in HTML. As you can see, it's a lot more complicated than just removing <SCRIPT> tags.

I'm in full agreement with the author's conclusions; if your application is going to allow user input to include HTML then you need to fully parse that HTML. You need a list of tags that you wish to allow, a list of attributes that you wish to allow for each tag, and a set of values to allow for each attribute. A blacklist doesn't cut it.

Don't think "What needs to be blocked ?", because you're bound to miss something, and that will leave your application wide open to scripting attacks. Instead think "What needs to be allowed ?", because although you're still bound to miss something, the consequence will be that a couple of obscure tags fail to display correctly, rather than your application being rendered insecure.

Not convinced ? Think you can just add the constructs in the BugTraq posting above to your list of strings to delete from untrusted HTML ? Well, that won't do it. Those are just some examples, there are many more techniques.

I'm quite good at bypassing HTML filters; I've found this type of hole against a total of 11 public webmail systems that try to filter untrusted HTML so far. I'm aware of several techniques that I've never seen published. Despite this, when I saw that BugTraq posting, more than half of the techniques it describes were new to me. The conclusion is clear: there are lots more techniques out there. As new versions of browsers appear with new undocumented features, new script hiding techniques will appear. It's a security nightmare.

One way out of the nightmare is to simply refuse to display HTML from an untrusted source at all. Another is to force users of the application to turn off scripting support in their browsers. I've seen both of these methods used successfully in security aware webmail systems (Hushmail and Acmemail respectively).

Or, you can use a whitelist-based parsing filter. Just don't make any mistakes :) Of the 11 vulnerable systems I mentioned above, 1 had a whitelist-based filter, but I was able to get javascript through because of an error in the filter's implementation.

If your application is written in Perl, then building in a strong whiltelist based parsing script filter is easy - just use the HTML::StripScripts module. You can play with this module on my script filter test page if you like.


Hosted by Exonetric
2007
Home