Cleaning user input/output with javascript and node.js

Fri, Jan 14, 2011

I was trying to find a good function/library which I could use to sanitize HTML using Javascript.

Like many before me, I've spent a good number of programming hours/days/weeks in the past trying to write my own regular expressions to do this. But I wouldn't wish it on my worst enemy to do that ever again. The techniques for XSS are many and advanced and you'll be fighting an uphill battle if you think you can "roll your own".

I wanted to do it both on the client side as well as the serverside (node.js). After some wasted time googling around I posted a question on stack overflow and found out that there are a number of good libraries out there.

There are some other helpful questions on the subject on stackoverflow.com and a few users linked to this little gem: Google caja. It's actually a library intended to help website owners to embed javascript and HTML into their site in a "safe way".

That's not exactly what I needed, however it does have a stand-alone function / object for cleaning html_output:

Caja's HTML sanitizer

It's not super-well documented, but it has a few comments that are good enough. I hope that helps somebody.

You can also see the documentation for how you define the white lists for various tags etc. here: Caja's whitelists

HTML sanitation specifically for node.js

If  you're (like me) using node.js, there are a couple of resources to look into:
  • Node Validator A user on github called "chriso" has made an HTML validator for node.js called "node-validator", you can check his project out on github. You can use it both on the client and server side and you can use it both for sanitation and for validation. Chriso's library is actually pretty awesome. You can filter emails, IPs and URLs specially so it's pretty comprehensive. You can install it using npm, so install should be easy.
  • Caja HTML Sanitizer (for node.js) A user on stackoverflow called "theSmaw" packaged the Caja HTML sanitizer into another node.js package, under the confusing title of Caja-HTML-Sanitizer.