正则表达式学习指南(六)----Dot (Any Character)
2012-01-17 11:36
369 查看
The Dot Matches (Almost) Any Character
In regular expressions, the dot or period is one of the most commonly usedmetacharacters. Unfortunately, it is also the most commonly misused metacharacter.
The dot matches a single character, without caring what that character is. The only exception are newline characters. In all regex flavors discussed in this tutorial, the dot willnot match a newline character by default. So by default, the dot is
short for thenegated character class [^\n] (UNIX regex flavors) or[^\r\n] (Windows regex flavors).
This exception exists mostly because of historic reasons. The first tools that used regular expressions were line-based. They would read a file line by line, and apply the regular expression separately to each line. The effect is that with these tools, the
string could never contain newlines, so the dot could never match them.
Modern tools and languages can apply regular expressions to very large strings or even entire files. All regex flavors discussed here have an option to make the dot match all characters, including newlines. In RegexBuddy,EditPad Pro orPowerGREP,
you simply tick the checkbox labeled "dot matches newline".
In Perl, the mode where the dot also matches newlines is called "single-line mode". This is a bit unfortunate, because it is easy to mix up this term with "multi-line mode". Multi-line mode only affectsanchors, and single-line mode only affects
the dot. You can activate single-line mode by adding an s after the regex code, like this:m/^regex$/s;.
Other languages and regex libraries have adopted Perl's terminology. When using theregex classes of the .NET framework, you activate this mode by specifyingRegexOptions.Singleline, such as inRegex.Match("string",
"regex", RegexOptions.Singleline).
In all programming languages and regex libraries I know, activating single-line mode has no effect other than making the dot match newlines. So if you expose this option to your users, please give it a clearer label like was done in RegexBuddy, EditPad Pro
and PowerGREP.
JavaScript and VBScript do not have an option to make the dot match line break characters. In those languages, you can use acharacter class such as[\s\S] to match
any character. This character matches a character that is either a whitespace character (including line break characters), or a character that is not a whitespace character. Since all characters are either whitespace or non-whitespace, this character class
matches any character.
Use The Dot Sparingly
The dot is a very powerful regex metacharacter. It allows you to be lazy. Put in a dot, and everything will match just fine when you test the regex on valid data. The problem is that the regex will also match in cases where it should not match. If you arenew to regular expressions, some of these cases may not be so obvious at first.
I will illustrate this with a simple example. Let's say we want to match a date in mm/dd/yy format, but we want to leave the user the choice of date separators. The quick solution is\d\d.\d\d.\d\d. Seems fine at first. It will match
a date like02/12/03 just fine. Trouble is:
02512703 is also considered a valid date by this regular expression. In this match, the first dot matched5, and the second matched7. Obviously not what we intended.
\d\d[- /.]\d\d[- /.]\d\d is a better solution. This regex allows a dash, space, dot and forward slash as date separators. Remember that the dot is not a metacharacter inside acharacter class, so we do not need to escape
it with a backslash.
This regex is still far from perfect. It matches 99/99/99 as a valid date.[0-1]\d[- /.][0-3]\d[- /.]\d\d is a step ahead, though it will still match19/39/99. How perfect you want your regex
to be depends on what you want to do with it. If you are validating user input, it has to be perfect. If you are parsing data files from a known source that generates its files in the same way every time, our last attempt is probably more than sufficient to
parse the data without errors. You can find abetter regex to match dates in the example section.
Use Negated Character Sets Instead of the Dot
I will explain this in depth when I present you the repeat operatorsstar and plus, but the warning is important enough to mention it here as well. I will illustrate with an example.
Suppose you want to match a double-quoted string. Sounds easy. We can have any number of any character between the double quotes, so".*" seems to do the trick just fine. The dot matches any character, and the star allows the dot to
be repeated any number of times, including zero. If you test this regex onPut a "string" between double quotes, it will match"string" just fine. Now go ahead and test it on
Houston, we have a problem with "string one" and "string two". Please respond.
Ouch. The regex matches "string one" and "string two". Definitely not what we intended. The reason for this is that thestar isgreedy.
In the date-matching example, we improved our regex by replacing the dot with a character class. Here, we will do the same. Our original definition of a double-quoted string was faulty. We do not want any number ofany character between the quotes.
We want any number of characters that are not double quotes or newlines between the quotes. So the proper regex is"[^"\r\n]*".
相关文章推荐
- 30分钟学习正则表达式指南(二)
- 正则表达式学习指南(十八)----Lookahead and Lookbehind
- 正则表达式学习指南(二十一)----If-Then-Else Conditionals
- 正则表达式学习指南(十九)----Testing The Same Part of a String for More Than One
- 正则表达式学习指南(二十三)----POSIX Bracket Expressions
- 正则表达式学习指南(二)----教程目录
- 正则表达式学习指南(二十)----Continuing from The Previous Match
- 正则表达式学习指南(二十四)----Adding Comments
- (转)Python爬虫学习笔记(2):Python正则表达式指南
- 正则表达式学习指南(三)----字符
- 正则表达式学习指南(四)----How a Regex Engine Works Internally
- 正则表达式学习指南(五)----Character Classes
- 学习笔记之Shell脚本学习指南 & sed与awk & 正则表达式
- 正则表达式学习指南(七)----Start of String and End of String Anchors
- 正则表达式学习指南(八)----Word Boundaries
- 正则表达式学习指南(十)----Making a Token Optional
- 正则表达式学习指南(十二)----Grouping and Backreferences
- 正则表达式学习指南(十三)----Named Capturing Groups
- 正则表达式学习指南(十四)----Unicode
- 正则表达式学习指南(一)----入门简介