Friday, April 18, 2008

Stripping Non-Printable Characters

Here's a quick way to strip non-printable characters in PHP. This is pretty handy for cleaning data before putting it in a DB.

$val = preg_replace('/[^\r\n\t\x20-\x7E\xA0-\xFF]/', ' ', $val)


Being a Perl compatible regexp, you can use it in your language of choice so long as it supports PCRE.

UPDATE:

Steve Laniel reminded me that you can use a Posix regexp to do roughly the same thing:

$val = preg_replace( '/[^[:print:]]/', '', $val )

This is a lot simpler for most cases; although, the patterns are slightly different (Posix is \x20-\x7E).

2 comments:

Steve Laniel said...

Doesn't the [:print:] Unicode character class capture the printable characters? So then the nonprintable ones would be [^[:print:]], and you could delete them with

$val = preg_replace( '/[^[:print:]]/', '', $val )

No?

Travis Whitton said...

Good point... I guess I'm a sucker for doing things the hard way. I'll update the tip because this is definitely an easier approach. Thanks!