Fri, Jan 14, 2011
Like many before me, I've spent a good number of programming hours/days/weeks in the past trying to write my own regular expressions to do this. But I wouldn't wish it on my worst enemy to do that ever again. The techniques for XSS are many and advanced and you'll be fighting an uphill battle if you think you can "roll your own".
I wanted to do it both on the client side as well as the serverside (node.js). After some wasted time googling around I posted a question on stack overflow and found out that there are a number of good libraries out there.
That's not exactly what I needed, however it does have a stand-alone function / object for cleaning html_output:
It's not super-well documented, but it has a few comments that are good enough. I hope that helps somebody.
You can also see the documentation for how you define the white lists for various tags etc. here: Caja's whitelists
HTML sanitation specifically for node.jsIf you're (like me) using node.js, there are a couple of resources to look into:
- Node Validator A user on github called "chriso" has made an HTML validator for node.js called "node-validator", you can check his project out on github. You can use it both on the client and server side and you can use it both for sanitation and for validation. Chriso's library is actually pretty awesome. You can filter emails, IPs and URLs specially so it's pretty comprehensive. You can install it using npm, so install should be easy.
- Caja HTML Sanitizer (for node.js) A user on stackoverflow called "theSmaw" packaged the Caja HTML sanitizer into another node.js package, under the confusing title of Caja-HTML-Sanitizer.