HTML escaping isn’t hard to do. You just convert < into < and a few other conversions. .NET provides a very simple way to do it:
var encodedContent = HttpUtility.HtmlEncode(userEnteredContent);
ASP.NET MVC makes it pretty easy too:
<p> <%= Html.Encode(ViewData.Model.UserEnteredContent) %> </p>
But the tricky bit is remembering to do it every single time. If we forget in once place in the whole app, we’ve introduced a security hole. It’s also pretty hard to test automatically that we’ve covered every possible instance – we could write a test for each part of the markup that we know contains user-entered content, but we’re just as likely to miss a test as to miss a call to Html.Encode(). Plus, it would be pretty boring and time-consuming.
I’m very much a fan of making it easier to do the right thing than to do the wrong thing. On my current project, we decided to look for ways to make HTML escaping happen automatically. This way, we’d be secure by default, and have to explicitly mark out those parts where we want to allow HTML in user content (e.g. if we had a facility where users could enter HTML formatted comments). We discussed a few options.
Encoding on the way in
We could choose to encode the user’s data on the way in, and store it in the database in escaped form. This could be implemented using a filter, or with some code in Global.asax.cs. However, this feels wrong, for a few reasons:
- HTML encoding is a rendering problem, so the solution should be in rendering.
- We aren’t storing the data as the user intended it, possibly reducing our options for using this data in the future (e.g. consumption by non-HTML fronted systems).
- If you’ve encoded on the way in, you mustn’t then encode on the way out. Otherwise you will double encode and the user will see all sorts of HTML entities like > when they typed >. Many ASP.NET MVC helper controls (e.g. <%= Html.ActionLink(ViewData.SomethingTheUserEntered, "action", "controller") %>) escape the text you pass them, so you’d have to rewrite all the helpers without the encoding.
There’s a longer thread about the rights and wrongs of encoding on the way in on the MSDN forums. At the moment, I’m firmly in the wrong camp.
Encoding on the way out
Having decided not to encode on the way in, we considered two ways to make encoding happen by default.
- Between the Controller and the View: Use a filter to encode all strings in the ViewData object passed from the controller to the view
- In the View: Make the <%= %> syntax automagically do HTML escaping. Oh, but don’t do it when I’m generating tags, as in <%= Html.ActionLink(…) %>
Option 1 is fiddly and imperfect. I wrote a filter to examine the contents of the ViewData dictionary and the ViewData.Model, using reflection to search for read-write properties of type string, and replace their value with the encoded version. Not only is this very fiddly (think of all the recursive reflection), particularly when it comes to collections, but it can’t encode read-only properties or methods on the objects passed to the view. In hindsight, I shouldn’t be surprised that this is difficult – like encoding on the way in, this is a reponsibility being executed in the wrong place. This is a view concern and belongs in the view.
Option 2 isn’t trivial either – we need to hook into the ASP.NET compiler architecture to override <%= %> and need to find a way not to escape genuine HTML generated inside <%= %>. Fortunately we found Steve Sanderson’s blog post (good find Muhammad!) with a working solution.
Steve’s post is about a year old, which in ASP.NET MVC terms is positively ancient (Preview 2?), so it’s no longer the MVC toolkit that generates HTML, but the HtmlHelper class at the core of MVC. I didn’t fancy rewriting all of the MVC HTML generators (it was tedious enough going through the codebase removing Html.Encode(…)), so for now our views explicitly cast to Steve’s RawHtml class when necessary:
<span id="home"> <%= (RawHtml) Html.ActionLink(ViewData.Model.SomeUserEnteredContent, "action", "controller") %> <%= ViewData.Model.SomeOtherUserEnteredContent %> </span>
SomeUserEnteredContent will be encoded by Html.ActionLink, and SomeOtherUserEnteredContent will be encoded by <%= %>. The <a> tag emitted by Html.ActionLink will not be encoded.
We also considered the possibility that HTML encoding could be a domain concern. In my opinion, this could be the case for an application that was heavy on user-entered marked-up text, such as a blog engine or an issue tracker where all the text fields are Textile formatted. Then each different field knows whether to allow HTML markup through, and if so, which HTML tags are considered safe, and this is part of the domain model of the application.
This seems like the most efficient way to make it easy to do the right thing. I’d be very interested to hear any other views, especially if you think this is not the right thing to do.
Finally, I’m surprised by how little coverage this has received within the ASP.NET MVC community and example code that you see out there on the web. Both that forum thread and Steve’s blog post date back a year. It is just assumed that everyone knows about this?