您的位置:首页 > 其它

R语言-正则表达式初步 v0.1

2017-12-30 21:25 197 查看
相关环境

基础正则字符组

简单的字符组减法

字符组简记法

POSIX字符组

量词代码占坑

贪婪与最小

捕获分组

锚点挖坑

环视挖坑

实例挖坑

———–更新记录————

2017.12.30 代码占坑 && happy new year !! [手动高兴颜文字]

2017.12.31 使用md重新排版、排序,修改案例文本

———–我是分割线———

欢迎转载,转载请注明出处http://write.blog.csdn.net/mdeditor#!postId=78939587

相关环境

version


##                _
## platform       x86_64-w64-mingw32
## arch           x86_64
## os             mingw32
## system         x86_64, mingw32
## status
## major          3
## minor          4.3
## year           2017
## month          11
## day            30
## svn rev        73796
## language       R
## version.string R version 3.4.3 (2017-11-30)
## nickname       Kite-Eating Tree


Sys.getlocale()


## [1] "LC_COLLATE=Chinese (Simplified)_China.936;LC_CTYPE=Chinese (Simplified)_China.936;LC_MONETARY=Chinese (Simplified)_China.936;LC_NUMERIC=C;LC_TIME=Chinese (Simplified)_China.936"


library(stringr)
str_extract_all("abbccc","b")


## [[1]]
## [1] "b" "b"


基础正则字符组

txt <- "\t1 If it doesn't challenge you, it won't change you._\n\t2 如果一件事不能难住你,它也不能改变你。_\n\t3 THE MOST DIFFICULT THING IS THE DECISION TO ACT, THE REST IS MERELY TENACITY._\n\t4 最困难的是下决定,剩下的只是坚持。_"
cat(txt)


##  1 If it doesn't challenge you, it won't change you._
##  2 如果一件事不能难住你,它也不能改变你。_
##  3 THE MOST DIFFICULT THING IS THE DECISION TO ACT, THE REST IS MERELY TENACITY._
##  4 最困难的是下决定,剩下的只是坚持。_


str_extract_all(txt,"[a-z]") #小写字母


## [[1]]
##  [1] "f" "i" "t" "d" "o" "e" "s" "n" "t" "c" "h" "a" "l" "l" "e" "n" "g"
## [18] "e" "y" "o" "u" "i" "t" "w" "o" "n" "t" "c" "h" "a" "n" "g" "e" "y"
## [35] "o" "u"


str_extract_all(txt,"[a-z]") #大写字母


## [[1]]
##  [1] "f" "i" "t" "d" "o" "e" "s" "n" "t" "c" "h" "a" "l" "l" "e" "n" "g"
## [18] "e" "y" "o" "u" "i" "t" "w" "o" "n" "t" "c" "h" "a" "n" "g" "e" "y"
## [35] "o" "u"


str_extract_all(txt,"[0-9]")


## [[1]]
## [1] "1" "2" "3" "4"


str_extract_all(txt,"[a-zA-Z]")


## [[1]]
##  [1] "I" "f" "i" "t" "d" "o" "e" "s" "n" "t" "c" "h" "a" "l" "l" "e" "n"
## [18] "g" "e" "y" "o" "u" "i" "t" "w" "o" "n" "t" "c" "h" "a" "n" "g" "e"
## [35] "y" "o" "u" "T" "H" "E" "M" "O" "S" "T" "D" "I" "F" "F" "I" "C" "U"
## [52] "L" "T" "T" "H" "I" "N" "G" "I" "S" "T" "H" "E" "D" "E" "C" "I" "S"
## [69] "I" "O" "N" "T" "O" "A" "C" "T" "T" "H" "E" "R" "E" "S" "T" "I" "S"
## [86] "M" "E" "R" "E" "L" "Y" "T" "E" "N" "A" "C" "I" "T" "Y"


okok,我知道一个一个字母看着很吃力,我们提前学习正则表达式中
+
的作用,
+
表示
+
前的字符串至少匹配一次,比如
a+
可以表示
a``aa``aaa
等等.

str_extract_all(txt,"[a-zA-z]+")


## [[1]]
##  [1] "If"        "it"        "doesn"     "t"         "challenge"
##  [6] "you"       "it"        "won"       "t"         "change"
## [11] "you"       "_"         "_"         "THE"       "MOST"
## [16] "DIFFICULT" "THING"     "IS"        "THE"       "DECISION"
## [21] "TO"        "ACT"       "THE"       "REST"      "IS"
## [26] "MERELY"    "TENACITY"  "_"         "_"


现在看起来顺眼多了, 但是需要注意, 当a-zA-Z合并时, 下划线也会被表示出来. 但这在其他语言中应该不是普遍存在的情况. 这一点尤其需要注意.

既然[a-z]和[A-Z]可以合并为[a-zA-Z],那么一个想当然的方法就是继续向
[]
里添加其他内容,比如我们发现英文句子被空格半角逗号句号分隔开,现在我们将
、'、,、.
加入到正则表达式中:

str_extract_all(txt,"[a-zA-z ',.]+")


## [[1]]
## [1] " If it doesn't challenge you, it won't change you._"
## [2] " "
## [3] "_"
## [4] " THE MOST DIFFICULT THING IS THE DECISION TO ACT, THE REST IS MERELY TENACITY._"
## [5] " "
## [6] "_"


可以看到,英文句子被我们完整的提取出来了

说了半天英文,接下来终于到中文了。

str_extract_all(txt,"[\u4e00-\u9fa5]+")


## [[1]]
## [1] "如果一件事不能难住你" "它也不能改变你"       "最困难的是下决定"
## [4] "剩下的只是坚持"


\u4e00
\u9fa5
是个什么gui?磨人的小妖精?猜测是UTF-8中的某种神秘代码,测试一下:

iconv("\u4e00","UTF-8","GBK")


## [1] "一"


iconv("\u9fa5","UTF-8","GBK")


## [1] "龥"


嗯,确实是UTF-8中的代码,和a-z的道理一样,取中文前后两个字符作为首尾,就可以提取其间的所有字符。

所以同样的道理,我们应该也可以把正则写作
[一-龥]
,动手试验一下,你说那个字不认识?来,敲黑板,这个字念yu4,别问我怎么知道,我也是现查的。

str_extract_all(txt,"[一-龥]+")


## [[1]]
## [1] "如果一件事不能难住你" "它也不能改变你"       "最困难的是下决定"
## [4] "剩下的只是坚持"


试验成功,
[一-龥]
正确的返回了我们想要的结果。

根据经验,(虽然也没啥经验),正则表达式也满足“结合律”,即
[[a-z][A-Z]]
=
[a-zA-Z]
,但是记得不要漏写最外层
[]


需要重点提到的是
.
,半角句号,几乎能匹配任何字符的元字符,表示除换行符以外任何“单个”字符,比如
a.c
可以表示”abc”、“acc”、“a啊c”、“a c”,但是不能表示“ac”、“abbc”,试验如下:

txt <- "ac,abc,abbc,acc,accc,a c"
str_extract_all(txt,"a.c")


## [[1]]
## [1] "abc" "acc" "acc" "a c"


简单的字符组减法

更进一步的,如果想排除某些字符串,只需用减号
-
就可以了,比如去掉元音字母
aeiou


str_extract_all(txt,"[[a-zA-z ',.]-[aeiou]]")


## [[1]]
##   [1] " " "I" "f" " " "t" " " "d" "s" "n" "'" "t" " " "c" "h" "l" "l" "n"
##  [18] "g" " " "y" "," " " "t" " " "w" "n" "'" "t" " " "c" "h" "n" "g" " "
##  [35] "y" "." "_" " " "_" " " "T" "H" "E" " " "M" "O" "S" "T" " " "D" "I"
##  [52] "F" "F" "I" "C" "U" "L" "T" " " "T" "H" "I" "N" "G" " " "I" "S" " "
##  [69] "T" "H" "E" " " "D" "E" "C" "I" "S" "I" "O" "N" " " "T" "O" " " "A"
##  [86] "C" "T" "," " " "T" "H" "E" " " "R" "E" "S" "T" " " "I" "S" " " "M"
## [103] "E" "R" "E" "L" "Y" " " "T" "E" "N" "A" "C" "I" "T" "Y" "." "_" " "
## [120] "_"


str_extract_all(txt,"[a-zA-z ',.]-[aeiou]") # 易错方法,看一看,哪里错了?


## [[1]]
## character(0)


根据上面的例子,我们发现似乎正则中使用负号作为“非”,试验一下

str_extract_all(txt,"-[aeiou]")


## [[1]]
## character(0)


str_extract_all(txt,"[-aeiou]")


## [[1]]
##  [1] "i" "o" "e" "a" "e" "e" "o" "u" "i" "o" "a" "e" "o" "u"


str_extract_all(txt,"[^aeiou]")


## [[1]]
##   [1] "\t" "1"  " "  "I"  "f"  " "  "t"  " "  "d"  "s"  "n"  "'"  "t"  " "
##  [15] "c"  "h"  "l"  "l"  "n"  "g"  " "  "y"  ","  " "  "t"  " "  "w"  "n"
##  [29] "'"  "t"  " "  "c"  "h"  "n"  "g"  " "  "y"  "."  "_"  "\n" "\t" "2"
##  [43] " "  "如" "果" "一" "件" "事" "不" "能" "难" "住" "你" "," "它" "也"
##  [57] "不" "能" "改" "变" "你" "。" "_"  "\n" "\t" "3"  " "  "T"  "H"  "E"
##  [71] " "  "M"  "O"  "S"  "T"  " "  "D"  "I"  "F"  "F"  "I"  "C"  "U"  "L"
##  [85] "T"  " "  "T"  "H"  "I"  "N"  "G"  " "  "I"  "S"  " "  "T"  "H"  "E"
##  [99] " "  "D"  "E"  "C"  "I"  "S"  "I"  "O"  "N"  " "  "T"  "O"  " "  "A"
## [113] "C"  "T"  ","  " "  "T"  "H"  "E"  " "  "R"  "E"  "S"  "T"  " "  "I"
## [127] "S"  " "  "M"  "E"  "R"  "E"  "L"  "Y"  " "  "T"  "E"  "N"  "A"  "C"
## [141] "I"  "T"  "Y"  "."  "_"  "\n" "\t" "4"  " "  "最" "困" "难" "的" "是"
## [155] "下" "决" "定" "," "剩" "下" "的" "只" "是" "坚" "持" "。" "_"


字符组简记法

接下来是可能遇到的一些正则表达式表示方法,需要注意的是R语言中的转义,比如正则表达式中,
\d
表示数字,但是R语言中
\
本身表示转义,因此需要再次使用
\
\
进行转义,所以
\d
最终表示为
\\d


str_extract_all(txt,"\\d") # 匹配数字,等价于[0-9]
str_extract_all(txt,"\\D") # 匹配非数字,等价于[^\\d]
str_extract_all(txt,"\\w") # 匹配字母、数字、下划线、中文(其他语言可能只包括字母、数字、下划线)
str_extract_all(txt,"\\W") # 匹配非\w(包括 标点 \t \n \s等)
str_extract_all(txt,"\\s") # 匹配空格
str_extract_all(txt,"\\S") # 匹配非空格


POSIX字符组

另一种可能遇到的表示方法是POSIX字符组/方括号表达式(bracket expression),功能见备注(R中没有实践,可能有偏差),不再赘述。

[:alnum:] # 字母+数字

[:alpha:] # 字母

[:blank:] # 空格和制表符

[:cntrl:] # 控制字符

[:digit:] # 数字

[:graph:] # 非空字符(空白字符、控制字符以外的字符)

[:lower:] # 小写字母

[:print:] # 类似[:graph:],但包括空白字符(个人理解为打印字符,即打印出来可见的字符)

[:punct:] # 标点符号

[:space:] # 所有的空白字符([:blank:]、换行符、回车符等)

[:upper:] # 大写字母

[:xdigit:] # 16进制数字(0-9a-fA-F)

量词(代码占坑)

txt <- "aaaaaabbbbbccccc"
# #
str_extract(txt,"a*") # *贪婪,*前的字符重复0次或更多


## [1] "aaaaaa"


str_extract(txt,"a*c") #这就叫重复0次


## [1] "c"


# +
str_extract(txt,"a+") # + 至少匹配一次


## [1] "aaaaaa"


str_extract(txt,"a+c") #这就叫至少匹配1次


## [1] NA


# ?需要搭配食用
txt <- "世界是你们的,也是我们的,但终究是他们的,而不是他的"
# 一段文本,想统计有多少"他"或者"他们"
str_extract_all(txt,"他(们)?")


## [[1]]
## [1] "他们" "他"


txt <- "我有一堆手机尾号,比如6666,1349,6688,8888,1666,6661,7788,请为我挑出吉祥号"
str_extract_all(txt,"6{4}") # 四连号6666


## [[1]]
## [1] "6666"


str_extract_all(txt,"6{4}|8{4}") # 四连号6666 88888


## [[1]]
## [1] "6666" "8888"


str_extract_all(txt,"6{2}8{2}") # aabb


## [[1]]
## [1] "6688"


str_extract_all(txt,"6{3,}.") # 至少3个6,但是第二个666位数不对


## [[1]]
## [1] "6666," "666,"  "6661"


str_extract_all(txt,"[0-9]?6{3,}.") # 至少3个6


## [[1]]
## [1] "6666," "1666," "6661"


str_extract_all(txt,"(6|7){2}8{2}")


## [[1]]
## [1] "6688" "7788"


txt <- "我新进了一批吉祥号9999,999,99999,但是999999我想自己保留"
str_extract_all(txt,"9{3,}")


## [[1]]
## [1] "9999"   "999"    "99999"  "999999"


str_extract_all(txt,"9{3,5}")


## [[1]]
## [1] "9999"  "999"   "99999" "99999"


贪婪与最小

正则默认是贪婪匹配

?表示重复0或1次,但是跟在其他限定符后,就表示最小匹配

*? 重复任意次,但尽可能少重复

+? 重复1次或更多次,但尽可能少重复

?? 重复0次或1次,但尽可能少重复

{n,m}? 重复n到m次,但尽可能少重复

{n,}? 重复n次以上,但尽可能少重复

比如爬虫想要获取html某一结点信息

txt <- "<p>这是第一段文字</p><p>这是第二段文字</p>"
str_extract_all(txt,"<p>(.*)</p>") # <开始,>结束,中间填充任意个.,结果没有区分开两个结点


## [[1]]
## [1] "<p>这是第一段文字</p><p>这是第二段文字</p>"


str_extract_all(txt,"<p>(.*?)</p>") # *是贪婪的,*?是最短匹配


## [[1]]
## [1] "<p>这是第一段文字</p>" "<p>这是第二段文字</p>"


捕获分组

txt <- "6666,6767,676767,6768,6868,010101"
str_extract_all(txt,"(67)\\1")


## [[1]]
## [1] "6767" "6767"


str_extract_all(txt,"([0-9]{2})\\1")


## [[1]]
## [1] "6666" "6767" "6767" "6868" "0101"


txt <- "0101,012012,00110011,010101,01010101"
str_extract_all(txt,"([0-9]{2,})\\1") # 把0101作为一组


## [[1]]
## [1] "0101"     "012012"   "00110011" "0101"     "01010101"


str_extract_all(txt,"([0-9]{2,})+")


## [[1]]
## [1] "0101"     "012012"   "00110011" "010101"   "01010101"


txt <- "我我说的话有重复重复,不服不服不服你来打我呀打我呀打我呀打我呀"
str_extract_all(txt,"([\u4e00-\u9fa5]+)\\1") # 我+我的分组,所以代表我我,由于+是贪婪,所以打我呀被贪婪为打我呀打我呀,及分组


## [[1]]
## [1] "我我"                     "重复重复"
## [3] "不服不服"                 "打我呀打我呀打我呀打我呀"


str_extract_all(txt,"([\u4e00-\u9fa5]+?)\\1")


## [[1]]
## [1] "我我"         "重复重复"     "不服不服"     "打我呀打我呀"
## [5] "打我呀打我呀"


str_extract_all(txt,"([\u4e00-\u9fa5]+)\\1+")


## [[1]]
## [1] "我我"                     "重复重复"
## [3] "不服不服不服"             "打我呀打我呀打我呀打我呀"


str_extract_all(txt,"([\u4e00-\u9fa5]+)\\1?") # 想一想,哪里出错了


## [[1]]
## [1] "我我说的话有重复重复"
## [2] "不服不服不服你来打我呀打我呀打我呀打我呀"


锚点挖坑

环视挖坑

实例挖坑

内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签:  正则表达式 r语言