[Getting and Cleaning data] Week 4
2016-03-17 09:25
465 查看
Week 4
Editing text variables
Regular expressions
Working with dates
More details could be found in the html file here
Names of variables should be
All lower cases when possible
Descriptive (Diagnosis versus Dx)
Not duplicated
Not have underscores or dots or white spaces
Variables with caracter values
Should usually be made into factor variables(depend on application)
Should be descriptive(use TRUE/FALSE instead of 0/1 and Male/Femal versus 0/2 or M/F)
Step 1: Fixing charactre vectors
Step 2: Fixing character vectors
Good for automatically splitting variable names.
Important paramters:x and split
Step 3: Quick aside
Step 4: Fixing character vectors
Applies a function to each element in a vector or list.
Implortant parameted: x Fun
Step 5: Peer review data
Step 6: Fixing character vectors
Step 7: Fixing character vectors
Step 8: Find values
Step 9: More on
Step 10: More useful string functions
A ‘regular expression’ is a pattern that describes a set of strings. Two types of regular expressions are used in R, extended regular expressions (the default) and Perl-like regular expressions used by perl = TRUE. There is a also fixed = TRUE which can be considered to use a literal regular expression.
Here we cansider the extended regular expressions used in
Most characters, including all letters and digits, are regular expressions that match themselves. Any metacharacter with special meaning may be quoted by preceding it with a backslash. The metacharacters in extended regular expressions are
Positions
1:
2:
3:
4:
Quantifiers
1:
2:
3:
4:
5:
6:
Others:
1:
2:
3:
4:
5:
6:
Character classes:
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
R function summary:
1: Identify match to a pattern:
2: Extract match to a pattern:
3: Locate pattern within a string, i.e. give the start position of matched patterns.
4: Replace a pattern:
5: Split a string using a pattern:
Step 2: Data class.
Step 3: Formatting dates.
Step 4: Creating dates.
Step 5: Converting to Julian.
Step 6:
Step 7: Dealing with time.
Step 8: Some functions have slightly different syntax.
Step 9: Dealing with vector of dates.
Editing text variables
Regular expressions
Working with dates
More details could be found in the html file here
Week 4
Editing text variables
Important points about text in data setNames of variables should be
All lower cases when possible
Descriptive (Diagnosis versus Dx)
Not duplicated
Not have underscores or dots or white spaces
Variables with caracter values
Should usually be made into factor variables(depend on application)
Should be descriptive(use TRUE/FALSE instead of 0/1 and Male/Femal versus 0/2 or M/F)
Step 1: Fixing charactre vectors
topupperand
tolowerfunctions.
if(!file.exists("./data")) dir.create("./data") fileUrl <- "https://data.baltimorecity.gov/api/views/dz54-2aru/rows.csv?accessType=DOWNLOAD" download.file(fileUrl, destfile = "./data/cameras.csv") cameraData <- read.csv("./data/cameras.csv") names(cameraData) tolower(names(cameraData))
Step 2: Fixing character vectors
strsplitfunction.
Good for automatically splitting variable names.
Important paramters:x and split
splitNames <- strsplit(names(cameraData), "\\.") splitNames[[5]] splitNames[[6]]
Step 3: Quick aside
lists
myList <- list(letters = c("A", "b", "c"), numbers = 1:3, matrix(1:15, 5)) head(myList)
Step 4: Fixing character vectors
sapply
Applies a function to each element in a vector or list.
Implortant parameted: x Fun
splitNames[[6]][1] firstElement <- function(x) x[1] sapply(splitNames, firstElement)
Step 5: Peer review data
if(!file.exists("./data")) dir.create("./data") # download data set fileUrl1 <- "https://dl.dropbox.com/u/7710864/data/reviews-apr29.csv" fileUrl2 <- "https://dl.dropbox.com/u/7710864/data/solutions-apr29.csv" download.file(fileUrl1, destfile = "./data/reviews.csv") download.file(fileUrl2, destfile = "./data/solution.csv") # load data set reviews <- read.csv("./data/reviews.csv") solutions <- read.csv("./data/solution.csv") # view data set head(reviews, 2) head(solutions, 2)
Step 6: Fixing character vectors
sub()(replace the first match)
names(reviews) sub("_", "", names(reviews))
Step 7: Fixing character vectors
gsub()(replace globally)
testName <- "this_is_a_test" sub("_", "", testName) gsub("_", "", testName)
Step 8: Find values
grep()and
grepl()functions
grep("Alameda", cameraData$intersection) # return index table(grepl("Alameda", cameraData$intersection)) # return true or false cameraData2 <- cameraData[!grepl("Alameda", cameraData$intersection), ]
Step 9: More on
grep()
grep("Alameda", cameraData$intersection, value = TRUE) # retrun names containing "Aladema" grep("JeffStreet", cameraData$intersection) length(grep("JeffStreet", cameraData$intersection))
Step 10: More useful string functions
library(stringr) nchar("Jeffrey Leek") substr("jeffrey Leek", 1, 7) paste("Jeffrey", "Leek") paste0("Jeffrey", "Leek") str_trim("Jeff ")
Regular expressions
Regular expressions:A ‘regular expression’ is a pattern that describes a set of strings. Two types of regular expressions are used in R, extended regular expressions (the default) and Perl-like regular expressions used by perl = TRUE. There is a also fixed = TRUE which can be considered to use a literal regular expression.
Here we cansider the extended regular expressions used in
grep, grepl, regexpr, gregexpr, sub, gsuband
strsplit.
Most characters, including all letters and digits, are regular expressions that match themselves. Any metacharacter with special meaning may be quoted by preceding it with a backslash. The metacharacters in extended regular expressions are
. \ | ( ) [ { ^ $ * + ?, but note that whether these have a special meaning depends on the context.
Positions
1:
^matches the begining.
2:
$matches the end.
3:
\bmatches the empty string at either edge of a word.
4:
\Bmatches the empty string provided it is not at an edge of a word.
Quantifiers
1:
*matches at least 0 times.
2:
+matches at least 1 times.
3:
?matches at most 1 times.
4:
{m}matches exactly m times.
5:
{m.}matches at least m times.
6:
{n, m}matches between n to m times.
Others:
1:
[ ]matches any character appearing in
[]. ex:
[a-z]
2:
[^ ]matches any character not appearing in
[ ].
3:
.matches any character.
4:
|matches alternative metacharacters.
5:
\suppress the special meaning of metacharacters in regular expression.
6:
()groups expression.
Character classes:
1:
[:digit:]or
\dequivalent to
[0-9].
2:
[:lower:]equivalent to
[a-z].
3:
[:upper:]equivalent to
[A-Z].
4:
[:alpha:]equivalent to
[a-zA-Z]or
[[:lower:][:upper:]].
5:
[:alnum:]equivalent to
[A-z0-9]or
[[:digit:][:alpha:]].
6:
\wequivalent to
[[:apnum]_]or
[A-z0-9_].
7:
\Wequivalent
[^A-z0-9].
8:
[:xdigit:]matches
0 1 2 3 4 5 6 7 8 9 A B C D E F a b c d e f.
9:
[:blank:]matches space or tab.
10:
[:space:]marches tab, newline, vertical tab, form feed, carriage return, space.
11:
\sspace ” “.
12:
\Snot space.
13:
[:punct]matches
! " # $ % & ’ ( ) * + , - . / : ; < = > ? @ [ ] ^ _ ` { | } ~.
14:
[:graph:]equivalent to
[[:alnum:][:punct:]].
15:
[:print:]equivalent to
[[:alnum:][:punct:]\\s].
16:
[:cntrl:]control characters, like
\nor
\r,
[\x00-\x1F\x7F].
R function summary:
1: Identify match to a pattern:
grep(..., value = FALSE),
grepl(),
stringr::str_detect().
2: Extract match to a pattern:
grep(..., value = TRUE),
stringr::str_extract(),
stringr::str_extract_all().
3: Locate pattern within a string, i.e. give the start position of matched patterns.
regexpr(),
gregexpr(),
stringr::str_locate(),
string::str_locate_all().
4: Replace a pattern:
sub(),
gsub(),
stringr::str_replace(),
stringr::str_replace_all().
5: Split a string using a pattern:
strsplit(),
stringr::str_split().
Working with dates
Step 1: Starting simple.date()returns a character that gives you the date and time.
d1 <- date() d1 class(d1)
Step 2: Data class.
d2 <- Sys.Date() d2 class(d2)
Step 3: Formatting dates.
%d= days as number(0-31).
%a= abbreviated weekday.
%A= unabbreviated weekday.
%m= month(00-12).
%b= abbreviated month.
%B= unabbreviated month.
%y= 2 digit year.
%Y= 4 digit year.
format(d2, "%a %b %d")
Step 4: Creating dates.
# if returns NA, please use lct <- Sys.getlocale("LC_TIME") Sys.setlocale("LC_TIME", "C") x <- c("1jan1960", "2jan1960", "31mar1960", "30Jul1960") z <- as.Date(x, "%d%b%Y") z z[1] - z[2] as.numeric(z[1] - z[2])
Step 5: Converting to Julian.
weekdays(d2) months(d2) julian(d2)
Step 6:
lubridatepackage.
library(lubridate) ymd("20140108") mdy("08/04/2013") dmy("03-04-2013")
Step 7: Dealing with time.
ymd_hms("2011-08-03 10:15:03") ymd_hms("2011-08-03 10:15:03", tz = "Pacific/Auckland")
Step 8: Some functions have slightly different syntax.
x <- dmy(c("1jan2013", "2jan2013", "31mar2013", "30Jul2013")) wday(x[1]) wday(x[1], label = TRUE) ymd("1989 May 17") mdy("March 12 1975") dmy(25081985) ymd("1920/1/2") ymd_hms(now()) hms("03:22:14")
Step 9: Dealing with vector of dates.
dt2 <- c("2014-05-14", "2014-09-22", "2014-07-11") ymd(dt2)
相关文章推荐
- 向大家推荐一个收集整理正则表达式的网站
- 最严谨的校验email地址的正则表达式及各种语言对应版
- JAVA中正则表达式匹配,替换,查找,切割的方法
- iOS中使用正则表达式NSRegularExpression 来验证textfiled输入的内容
- 处理Plot rendering error
- R语言书籍的学习路线图
- R语言学习笔记 三
- R语言学习笔记 四
- R语言学习笔记 五
- 20120919_01
- 20120919_02
- 20120919_3
- Review JDBC
- Win2003SN
- 解决rJava无法找到jvm.dll的问题
- System Center 2012
- How to configure a RPC Client Access
- R改工作空间
- R语言基础画图
- R的若干基因及争论