Friday, December 21, 2007

Output Character Encoding for Security - PHP

Objective: Allow for untrusted data to be displayed by the browser to the client without the threat of malicious javascript
(ie cross site scripting) or html characters.

Here's a quick snippet of code that can be used in php to encode text which will be displayed in the browser. Since we are using multiple layers of security (ie defense in depth) the output text should have been already vetted with a white list filter when originally supplied to the page (see previous article).

This code will perform two actions.

  1. The supplied data will be converted to utf8 (utf8_encode). Unless specifically required, there is no reason for the code to deal with the nuances and issues with other character formats. You may not want to take this action if you are supporting international text.

  2. We then apply the php function htmlentities. This will convert the supplied text into a html entities which will be displayed accurately on the page, but not interpreted by the browser. This will prevent the supplied text being interpreted as valid html or javascript.



function encode($dirty_data){
$utf8_dirty=utf8_encode($dirty_data);
$encoded_data=htmlentities($utf8_dirty, ENT_QUOTES);
return $encoded_data;
}


Examples:

I've created a basic page which accepts an url argument, applies the encode function, and then displays the data to the page.

(snippet)

<?php echo encode($_GET['arg']); ?>

The following URL contains a basic cross site scripting attack against this page.

http://localhost/javascript/encoding.php?arg=%3Cscript%3Ealert('xss')%3C/script%3E

By using the encode function the supplied data is safely displayed to the screen as:

<script>alert('xss')</script>

Viewing the source of the page will show that the characters have been safely encoded as follows:

<script>alert('xss')</script>

(Note that the underscore character _ will not actually be present. I added this so wordpress would stop interpreting the characters for this example)

This code can be used to safely allow user data to be displayed on the client browser. This code will safely encode characters to prevent html modification cross site scripting attacks. However, this code would not allow the user to supply any sort of html tags such as <b></b> for bold or <i></i> for italics. If rich text formating is desired then I would recommend a more robust filtering solution. Take a look at the OWASP AntiSammy project for more info on safely accepting rich text formating

-Michael Coates