Wednesday, February 4, 2009

Java RegExp - Filter unicode character subset

Task: Find a specific subset of unicode characters in a Java string. In the given case I was asked to prevent an upload of a malformed csv-file through a servlet. Before the file is written to the db the submitted file has to be checked for specified illegal characters like Chr(13) etc.

Solution: Usage of Java regular expressions.

Java offers a wide range of possibilities to search within a string. I've used the following code in my solution:

boolean containsInvalidCharacters(String s) { 
Pattern p = null;
Matcher m = null;
// Use a compiled pattern if you check serveral times for the same pattern
p = Pattern.compile("[\u0000-\u0020[\u0100-\uFFFF]]"); 
m = p.matcher(s); 
return m.find();
}
The pattern excludes all unicode characters between 0x0000 - 0x0020 and 0x100 - 0xFFFF. With the find function no .* or + is needed is addition to the pattern.

Nasty Traps: Be aware of the .matches() method with is provided by the Matcher and the String class. This method checks for an exact match  with a given string. If you just want to check for the occurence  of a substring use .find() method.


Helpful links: 



No comments:

Post a Comment