R语言-正则表达式初步 v0.1
2017-12-30 21:25
197 查看
相关环境
基础正则字符组
简单的字符组减法
字符组简记法
POSIX字符组
量词代码占坑
贪婪与最小
捕获分组
锚点挖坑
环视挖坑
实例挖坑
———–更新记录————
2017.12.30 代码占坑 && happy new year !! [手动高兴颜文字]
2017.12.31 使用md重新排版、排序,修改案例文本
———–我是分割线———
欢迎转载,转载请注明出处http://write.blog.csdn.net/mdeditor#!postId=78939587
okok,我知道一个一个字母看着很吃力,我们提前学习正则表达式中
现在看起来顺眼多了, 但是需要注意, 当a-zA-Z合并时, 下划线也会被表示出来. 但这在其他语言中应该不是普遍存在的情况. 这一点尤其需要注意.
既然[a-z]和[A-Z]可以合并为[a-zA-Z],那么一个想当然的方法就是继续向
可以看到,英文句子被我们完整的提取出来了
说了半天英文,接下来终于到中文了。
嗯,确实是UTF-8中的代码,和a-z的道理一样,取中文前后两个字符作为首尾,就可以提取其间的所有字符。
所以同样的道理,我们应该也可以把正则写作
试验成功,
根据经验,(虽然也没啥经验),正则表达式也满足“结合律”,即
需要重点提到的是
根据上面的例子,我们发现似乎正则中使用负号作为“非”,试验一下
[:alnum:] # 字母+数字
[:alpha:] # 字母
[:blank:] # 空格和制表符
[:cntrl:] # 控制字符
[:digit:] # 数字
[:graph:] # 非空字符(空白字符、控制字符以外的字符)
[:lower:] # 小写字母
[:print:] # 类似[:graph:],但包括空白字符(个人理解为打印字符,即打印出来可见的字符)
[:punct:] # 标点符号
[:space:] # 所有的空白字符([:blank:]、换行符、回车符等)
[:upper:] # 大写字母
[:xdigit:] # 16进制数字(0-9a-fA-F)
?表示重复0或1次,但是跟在其他限定符后,就表示最小匹配
*? 重复任意次,但尽可能少重复
+? 重复1次或更多次,但尽可能少重复
?? 重复0次或1次,但尽可能少重复
{n,m}? 重复n到m次,但尽可能少重复
{n,}? 重复n次以上,但尽可能少重复
比如爬虫想要获取html某一结点信息
基础正则字符组
简单的字符组减法
字符组简记法
POSIX字符组
量词代码占坑
贪婪与最小
捕获分组
锚点挖坑
环视挖坑
实例挖坑
———–更新记录————
2017.12.30 代码占坑 && happy new year !! [手动高兴颜文字]
2017.12.31 使用md重新排版、排序,修改案例文本
———–我是分割线———
欢迎转载,转载请注明出处http://write.blog.csdn.net/mdeditor#!postId=78939587
相关环境
version
## _ ## platform x86_64-w64-mingw32 ## arch x86_64 ## os mingw32 ## system x86_64, mingw32 ## status ## major 3 ## minor 4.3 ## year 2017 ## month 11 ## day 30 ## svn rev 73796 ## language R ## version.string R version 3.4.3 (2017-11-30) ## nickname Kite-Eating Tree
Sys.getlocale()
## [1] "LC_COLLATE=Chinese (Simplified)_China.936;LC_CTYPE=Chinese (Simplified)_China.936;LC_MONETARY=Chinese (Simplified)_China.936;LC_NUMERIC=C;LC_TIME=Chinese (Simplified)_China.936"
library(stringr) str_extract_all("abbccc","b")
## [[1]] ## [1] "b" "b"
基础正则字符组
txt <- "\t1 If it doesn't challenge you, it won't change you._\n\t2 如果一件事不能难住你,它也不能改变你。_\n\t3 THE MOST DIFFICULT THING IS THE DECISION TO ACT, THE REST IS MERELY TENACITY._\n\t4 最困难的是下决定,剩下的只是坚持。_" cat(txt)
## 1 If it doesn't challenge you, it won't change you._ ## 2 如果一件事不能难住你,它也不能改变你。_ ## 3 THE MOST DIFFICULT THING IS THE DECISION TO ACT, THE REST IS MERELY TENACITY._ ## 4 最困难的是下决定,剩下的只是坚持。_
str_extract_all(txt,"[a-z]") #小写字母
## [[1]] ## [1] "f" "i" "t" "d" "o" "e" "s" "n" "t" "c" "h" "a" "l" "l" "e" "n" "g" ## [18] "e" "y" "o" "u" "i" "t" "w" "o" "n" "t" "c" "h" "a" "n" "g" "e" "y" ## [35] "o" "u"
str_extract_all(txt,"[a-z]") #大写字母
## [[1]] ## [1] "f" "i" "t" "d" "o" "e" "s" "n" "t" "c" "h" "a" "l" "l" "e" "n" "g" ## [18] "e" "y" "o" "u" "i" "t" "w" "o" "n" "t" "c" "h" "a" "n" "g" "e" "y" ## [35] "o" "u"
str_extract_all(txt,"[0-9]")
## [[1]] ## [1] "1" "2" "3" "4"
str_extract_all(txt,"[a-zA-Z]")
## [[1]] ## [1] "I" "f" "i" "t" "d" "o" "e" "s" "n" "t" "c" "h" "a" "l" "l" "e" "n" ## [18] "g" "e" "y" "o" "u" "i" "t" "w" "o" "n" "t" "c" "h" "a" "n" "g" "e" ## [35] "y" "o" "u" "T" "H" "E" "M" "O" "S" "T" "D" "I" "F" "F" "I" "C" "U" ## [52] "L" "T" "T" "H" "I" "N" "G" "I" "S" "T" "H" "E" "D" "E" "C" "I" "S" ## [69] "I" "O" "N" "T" "O" "A" "C" "T" "T" "H" "E" "R" "E" "S" "T" "I" "S" ## [86] "M" "E" "R" "E" "L" "Y" "T" "E" "N" "A" "C" "I" "T" "Y"
okok,我知道一个一个字母看着很吃力,我们提前学习正则表达式中
+的作用,
+表示
+前的字符串至少匹配一次,比如
a+可以表示
a``aa``aaa等等.
str_extract_all(txt,"[a-zA-z]+")
## [[1]] ## [1] "If" "it" "doesn" "t" "challenge" ## [6] "you" "it" "won" "t" "change" ## [11] "you" "_" "_" "THE" "MOST" ## [16] "DIFFICULT" "THING" "IS" "THE" "DECISION" ## [21] "TO" "ACT" "THE" "REST" "IS" ## [26] "MERELY" "TENACITY" "_" "_"
现在看起来顺眼多了, 但是需要注意, 当a-zA-Z合并时, 下划线也会被表示出来. 但这在其他语言中应该不是普遍存在的情况. 这一点尤其需要注意.
既然[a-z]和[A-Z]可以合并为[a-zA-Z],那么一个想当然的方法就是继续向
[]里添加其他内容,比如我们发现英文句子被空格半角逗号句号分隔开,现在我们将
、'、,、.加入到正则表达式中:
str_extract_all(txt,"[a-zA-z ',.]+")
## [[1]] ## [1] " If it doesn't challenge you, it won't change you._" ## [2] " " ## [3] "_" ## [4] " THE MOST DIFFICULT THING IS THE DECISION TO ACT, THE REST IS MERELY TENACITY._" ## [5] " " ## [6] "_"
可以看到,英文句子被我们完整的提取出来了
说了半天英文,接下来终于到中文了。
str_extract_all(txt,"[\u4e00-\u9fa5]+")
## [[1]] ## [1] "如果一件事不能难住你" "它也不能改变你" "最困难的是下决定" ## [4] "剩下的只是坚持"
\u4e00和
\u9fa5是个什么gui?磨人的小妖精?猜测是UTF-8中的某种神秘代码,测试一下:
iconv("\u4e00","UTF-8","GBK")
## [1] "一"
iconv("\u9fa5","UTF-8","GBK")
## [1] "龥"
嗯,确实是UTF-8中的代码,和a-z的道理一样,取中文前后两个字符作为首尾,就可以提取其间的所有字符。
所以同样的道理,我们应该也可以把正则写作
[一-龥],动手试验一下,你说那个字不认识?来,敲黑板,这个字念yu4,别问我怎么知道,我也是现查的。
str_extract_all(txt,"[一-龥]+")
## [[1]] ## [1] "如果一件事不能难住你" "它也不能改变你" "最困难的是下决定" ## [4] "剩下的只是坚持"
试验成功,
[一-龥]正确的返回了我们想要的结果。
根据经验,(虽然也没啥经验),正则表达式也满足“结合律”,即
[[a-z][A-Z]]=
[a-zA-Z],但是记得不要漏写最外层
[]
需要重点提到的是
.,半角句号,几乎能匹配任何字符的元字符,表示除换行符以外任何“单个”字符,比如
a.c可以表示”abc”、“acc”、“a啊c”、“a c”,但是不能表示“ac”、“abbc”,试验如下:
txt <- "ac,abc,abbc,acc,accc,a c" str_extract_all(txt,"a.c")
## [[1]] ## [1] "abc" "acc" "acc" "a c"
简单的字符组减法
更进一步的,如果想排除某些字符串,只需用减号-就可以了,比如去掉元音字母
aeiou
str_extract_all(txt,"[[a-zA-z ',.]-[aeiou]]")
## [[1]] ## [1] " " "I" "f" " " "t" " " "d" "s" "n" "'" "t" " " "c" "h" "l" "l" "n" ## [18] "g" " " "y" "," " " "t" " " "w" "n" "'" "t" " " "c" "h" "n" "g" " " ## [35] "y" "." "_" " " "_" " " "T" "H" "E" " " "M" "O" "S" "T" " " "D" "I" ## [52] "F" "F" "I" "C" "U" "L" "T" " " "T" "H" "I" "N" "G" " " "I" "S" " " ## [69] "T" "H" "E" " " "D" "E" "C" "I" "S" "I" "O" "N" " " "T" "O" " " "A" ## [86] "C" "T" "," " " "T" "H" "E" " " "R" "E" "S" "T" " " "I" "S" " " "M" ## [103] "E" "R" "E" "L" "Y" " " "T" "E" "N" "A" "C" "I" "T" "Y" "." "_" " " ## [120] "_"
str_extract_all(txt,"[a-zA-z ',.]-[aeiou]") # 易错方法,看一看,哪里错了?
## [[1]] ## character(0)
根据上面的例子,我们发现似乎正则中使用负号作为“非”,试验一下
str_extract_all(txt,"-[aeiou]")
## [[1]] ## character(0)
str_extract_all(txt,"[-aeiou]")
## [[1]] ## [1] "i" "o" "e" "a" "e" "e" "o" "u" "i" "o" "a" "e" "o" "u"
str_extract_all(txt,"[^aeiou]")
## [[1]] ## [1] "\t" "1" " " "I" "f" " " "t" " " "d" "s" "n" "'" "t" " " ## [15] "c" "h" "l" "l" "n" "g" " " "y" "," " " "t" " " "w" "n" ## [29] "'" "t" " " "c" "h" "n" "g" " " "y" "." "_" "\n" "\t" "2" ## [43] " " "如" "果" "一" "件" "事" "不" "能" "难" "住" "你" "," "它" "也" ## [57] "不" "能" "改" "变" "你" "。" "_" "\n" "\t" "3" " " "T" "H" "E" ## [71] " " "M" "O" "S" "T" " " "D" "I" "F" "F" "I" "C" "U" "L" ## [85] "T" " " "T" "H" "I" "N" "G" " " "I" "S" " " "T" "H" "E" ## [99] " " "D" "E" "C" "I" "S" "I" "O" "N" " " "T" "O" " " "A" ## [113] "C" "T" "," " " "T" "H" "E" " " "R" "E" "S" "T" " " "I" ## [127] "S" " " "M" "E" "R" "E" "L" "Y" " " "T" "E" "N" "A" "C" ## [141] "I" "T" "Y" "." "_" "\n" "\t" "4" " " "最" "困" "难" "的" "是" ## [155] "下" "决" "定" "," "剩" "下" "的" "只" "是" "坚" "持" "。" "_"
字符组简记法
接下来是可能遇到的一些正则表达式表示方法,需要注意的是R语言中的转义,比如正则表达式中,\d表示数字,但是R语言中
\本身表示转义,因此需要再次使用
\对
\进行转义,所以
\d最终表示为
\\d,
str_extract_all(txt,"\\d") # 匹配数字,等价于[0-9] str_extract_all(txt,"\\D") # 匹配非数字,等价于[^\\d] str_extract_all(txt,"\\w") # 匹配字母、数字、下划线、中文(其他语言可能只包括字母、数字、下划线) str_extract_all(txt,"\\W") # 匹配非\w(包括 标点 \t \n \s等) str_extract_all(txt,"\\s") # 匹配空格 str_extract_all(txt,"\\S") # 匹配非空格
POSIX字符组
另一种可能遇到的表示方法是POSIX字符组/方括号表达式(bracket expression),功能见备注(R中没有实践,可能有偏差),不再赘述。[:alnum:] # 字母+数字
[:alpha:] # 字母
[:blank:] # 空格和制表符
[:cntrl:] # 控制字符
[:digit:] # 数字
[:graph:] # 非空字符(空白字符、控制字符以外的字符)
[:lower:] # 小写字母
[:print:] # 类似[:graph:],但包括空白字符(个人理解为打印字符,即打印出来可见的字符)
[:punct:] # 标点符号
[:space:] # 所有的空白字符([:blank:]、换行符、回车符等)
[:upper:] # 大写字母
[:xdigit:] # 16进制数字(0-9a-fA-F)
量词(代码占坑)
txt <- "aaaaaabbbbbccccc" # # str_extract(txt,"a*") # *贪婪,*前的字符重复0次或更多
## [1] "aaaaaa"
str_extract(txt,"a*c") #这就叫重复0次
## [1] "c"
# + str_extract(txt,"a+") # + 至少匹配一次
## [1] "aaaaaa"
str_extract(txt,"a+c") #这就叫至少匹配1次
## [1] NA
# ?需要搭配食用 txt <- "世界是你们的,也是我们的,但终究是他们的,而不是他的" # 一段文本,想统计有多少"他"或者"他们" str_extract_all(txt,"他(们)?")
## [[1]] ## [1] "他们" "他"
txt <- "我有一堆手机尾号,比如6666,1349,6688,8888,1666,6661,7788,请为我挑出吉祥号" str_extract_all(txt,"6{4}") # 四连号6666
## [[1]] ## [1] "6666"
str_extract_all(txt,"6{4}|8{4}") # 四连号6666 88888
## [[1]] ## [1] "6666" "8888"
str_extract_all(txt,"6{2}8{2}") # aabb
## [[1]] ## [1] "6688"
str_extract_all(txt,"6{3,}.") # 至少3个6,但是第二个666位数不对
## [[1]] ## [1] "6666," "666," "6661"
str_extract_all(txt,"[0-9]?6{3,}.") # 至少3个6
## [[1]] ## [1] "6666," "1666," "6661"
str_extract_all(txt,"(6|7){2}8{2}")
## [[1]] ## [1] "6688" "7788"
txt <- "我新进了一批吉祥号9999,999,99999,但是999999我想自己保留" str_extract_all(txt,"9{3,}")
## [[1]] ## [1] "9999" "999" "99999" "999999"
str_extract_all(txt,"9{3,5}")
## [[1]] ## [1] "9999" "999" "99999" "99999"
贪婪与最小
正则默认是贪婪匹配?表示重复0或1次,但是跟在其他限定符后,就表示最小匹配
*? 重复任意次,但尽可能少重复
+? 重复1次或更多次,但尽可能少重复
?? 重复0次或1次,但尽可能少重复
{n,m}? 重复n到m次,但尽可能少重复
{n,}? 重复n次以上,但尽可能少重复
比如爬虫想要获取html某一结点信息
txt <- "<p>这是第一段文字</p><p>这是第二段文字</p>" str_extract_all(txt,"<p>(.*)</p>") # <开始,>结束,中间填充任意个.,结果没有区分开两个结点
## [[1]] ## [1] "<p>这是第一段文字</p><p>这是第二段文字</p>"
str_extract_all(txt,"<p>(.*?)</p>") # *是贪婪的,*?是最短匹配
## [[1]] ## [1] "<p>这是第一段文字</p>" "<p>这是第二段文字</p>"
捕获分组
txt <- "6666,6767,676767,6768,6868,010101" str_extract_all(txt,"(67)\\1")
## [[1]] ## [1] "6767" "6767"
str_extract_all(txt,"([0-9]{2})\\1")
## [[1]] ## [1] "6666" "6767" "6767" "6868" "0101"
txt <- "0101,012012,00110011,010101,01010101" str_extract_all(txt,"([0-9]{2,})\\1") # 把0101作为一组
## [[1]] ## [1] "0101" "012012" "00110011" "0101" "01010101"
str_extract_all(txt,"([0-9]{2,})+")
## [[1]] ## [1] "0101" "012012" "00110011" "010101" "01010101"
txt <- "我我说的话有重复重复,不服不服不服你来打我呀打我呀打我呀打我呀" str_extract_all(txt,"([\u4e00-\u9fa5]+)\\1") # 我+我的分组,所以代表我我,由于+是贪婪,所以打我呀被贪婪为打我呀打我呀,及分组
## [[1]] ## [1] "我我" "重复重复" ## [3] "不服不服" "打我呀打我呀打我呀打我呀"
str_extract_all(txt,"([\u4e00-\u9fa5]+?)\\1")
## [[1]] ## [1] "我我" "重复重复" "不服不服" "打我呀打我呀" ## [5] "打我呀打我呀"
str_extract_all(txt,"([\u4e00-\u9fa5]+)\\1+")
## [[1]] ## [1] "我我" "重复重复" ## [3] "不服不服不服" "打我呀打我呀打我呀打我呀"
str_extract_all(txt,"([\u4e00-\u9fa5]+)\\1?") # 想一想,哪里出错了
## [[1]] ## [1] "我我说的话有重复重复" ## [2] "不服不服不服你来打我呀打我呀打我呀打我呀"
锚点挖坑
环视挖坑
实例挖坑
相关文章推荐
- [C#][固定格式网页解析]使用正则表达式处理网页的初步体会
- Javascript正则表达式的初步学习
- 最初步的正则表达式引擎:nfa的转换规则。
- 正则表达式-初步学习
- R语言之正则表达式
- MySQL正则表达式初步
- [C#][固定格式网页解析]使用正则表达式处理网页的初步体会
- 正则表达式之初步学习
- D5_正则表达式初步
- 最初步的正则表达式引擎:支持子表达式替换
- MySQL正则表达式初步
- 正则表达式详解(一)正则表达式初步
- R语言进阶之二:文本(字符串)处理与正则表达式
- 正则表达式详解(一)正则表达式初步
- 中文正则表达式初步使用
- MySQL正则表达式初步
- R语言之——正则表达式
- [C#][固定格式网页解析]使用正则表达式处理网页的初步体会
- editplus中常用的正则表达式v0.1
- R语言-正则表达式,替换