Agile Power: Java RegExp - Filter unicode character subset

Task: Find a specific subset of unicode characters in a Java string. In the given case I was asked to prevent an upload of a malformed csv-file through a servlet. Before the file is written to the db the submitted file has to be checked for specified illegal characters like Chr(13) etc.

Solution: Usage of Java regular expressions.

Java offers a wide range of possibilities to search within a string. I've used the following code in my solution:

boolean containsInvalidCharacters(String s) {

Pattern p = null;

Matcher m = null;

// Use a compiled pattern if you check serveral times for the same pattern

p = Pattern.compile("[\u0000-\u0020[\u0100-\uFFFF]]");

m = p.matcher(s);

return m.find();

}

The pattern excludes all unicode characters between 0x0000 - 0x0020 and 0x100 - 0xFFFF. With the find function no .* or + is needed is addition to the pattern.

Nasty Traps: Be aware of the .matches() method with is provided by the Matcher and the String class. This method checks for an exact match with a given string. If you just want to check for the occurence of a substring use .find() method.

Helpful links:

java.util.regex Pattern Class (some examples)

ASCII & ANSI Table (for reference)

ISO 8859 - latin1 (normally used on windows machines e.g.)

Wednesday, February 4, 2009

Java RegExp - Filter unicode character subset

No comments:

Post a Comment

Max's latest Tweet

Max's latest Tweet

Subscribe to the Agile Power Blog

Blog Archive

Contributors