How to filter string for unwanted characters using regex?

2022-09-01 08:09:07

Basically , I am wondering if there is a handy class or method to filter a String for unwanted characters. The output of the method should be the 'cleaned' String. Ie:

String dirtyString = "This contains spaces which are not allowed"

String result = cleaner.getCleanedString(dirtyString);

Expecting result would be:

"Thiscontainsspaceswhicharenotallowed"

A better example:

String reallyDirty = " this*is#a*&very_dirty&String"

String result = cleaner.getCleanedString(dirtyString);

I expect the result to be:

"thisisaverydirtyString"

Because, i let the cleaner know that ' ', '*', '#', '&' and '_' are dirty characters. I can solve it by using a white/black list array of chars. But I don't want to re-invent the wheel.

I was wondering if there is already such a thing that can 'clean' strings using a regex. Instead of writing this myself.

Addition: If you think cleaning a String could be done differently/better then I'm all ears as well of course

Another addition: - It is not only for spaces, but for any kind of character.


答案 1

Edited based on your update:

dirtyString.replaceAll("[^a-zA-Z0-9]","")

答案 2

If you're using guava on your project (and if you're not, I believe you should consider it), the CharMatcher class handles this very nicely:

Your first example might be:

result = CharMatcher.WHITESPACE.removeFrom(dirtyString);

while your second might be:

result = CharMatcher.anyOf(" *#&").removeFrom(dirtyString);
// or alternatively
result = CharMatcher.noneOf(" *#&").retainFrom(dirtyString);

or if you want to be more flexible with whitespace (tabs etc), you can combine them rather than writing your own:

CharMatcher illegal = CharMatcher.WHITESPACE.or(CharMatcher.anyOf("*#&"));
result = illegal.removeFrom(dirtyString);

or you might instead specify legal characters, which depending on your requirements might be:

CharMatcher legal = CharMatcher.JAVA_LETTER; // based on Unicode char class
CharMatcher legal = CharMatcher.ASCII.and(CharMatcher.JAVA_LETTER); // only letters which are also ASCII, as your examples
CharMatcher legal = CharMatcher.inRange('a', 'z'); // lowercase only
CharMatcher legal = CharMatcher.inRange('a', 'z').or(CharMatcher.inRange('A', 'Z')); // either case

followed by as above.retainFrom(dirtyString)

Very nice, powerful API.