Data Transformacion Rstudio

Download as pdf or txt
Download as pdf or txt
You are on page 1of 2

Data Transformation with dplyr : : CHEAT SHEET

dplyr
dplyr functions work with pipes and expect tidy data. In tidy data:
Manipulate Cases Manipulate Variables
A B C A B C
& EXTRACT CASES EXTRACT VARIABLES
pipes
Row functions return a subset of rows as a new table. Column functions return a set of columns as a new vector or table.
Each variable is in Each observation, or x %>% f(y)
its own column case, is in its own row becomes f(x, y) pull(.data, var = -1) Extract column values as
filter(.data, …) Extract rows that meet logical
a vector. Choose by name or index.

Summarise Cases
w
www
ww criteria. filter(iris, Sepal.Length > 7)
w
www pull(iris, Sepal.Length)

distinct(.data, ..., .keep_all = FALSE) Remove select(.data, …)


rows with duplicate values. 
 Extract columns as a table. Also select_if().
These apply summary functions to columns to create a new
table of summary statistics. Summary functions take vectors as
input and return one value (see back).
w
www
ww distinct(iris, Species)
sample_frac(tbl, size = 1, replace = FALSE,
w
www select(iris, Sepal.Length, Species)

weight = NULL, .env = parent.frame()) Randomly Use these helpers with select (),
summary function select fraction of rows. 
 e.g. select(iris, starts_with("Sepal"))
summarise(.data, …)

Compute table of summaries. 

w
www
ww sample_frac(iris, 0.5, replace = TRUE)
sample_n(tbl, size, replace = FALSE, weight =
contains(match) num_range(prefix, range) :, e.g. mpg:cyl
ends_with(match) one_of(…) -, e.g, -Species
w
ww summarise(mtcars, avg = mean(mpg))

count(x, ..., wt = NULL, sort = FALSE)



NULL, .env = parent.frame()) Randomly select
size rows. sample_n(iris, 10, replace = TRUE)
matches(match) starts_with(match)

Count number of rows in each group defined slice(.data, …) Select rows by position. MAKE NEW VARIABLES
slice(iris, 10:15)
by the variables in … Also tally().

w
ww count(iris, Species)
w
www
ww top_n(x, n, wt) Select and order top n entries (by
group if grouped data). top_n(iris, 5, Sepal.Width)
These apply vectorized functions to columns. Vectorized funs take
vectors as input and return vectors of the same length as output
(see back).
VARIATIONS vectorized function
summarise_all() - Apply funs to every column.
summarise_at() - Apply funs to specific columns. mutate(.data, …) 

summarise_if() - Apply funs to all cols of one type. Logical and boolean operators to use with filter() Compute new column(s).
<
>
<=
>=
is.na()
!is.na()
%in%
!
|
&
xor() w
wwww
w mutate(mtcars, gpm = 1/mpg)

transmute(.data, …)

Group Cases See ?base::Logic and ?Comparison for help. Compute new column(s), drop others.
Use group_by() to create a "grouped" copy of a table. 

dplyr functions will manipulate each "group" separately and
w
ww transmute(mtcars, gpm = 1/mpg)

mutate_all(.tbl, .funs, …) Apply funs to every


then combine the results. ARRANGE CASES column. Use with funs(). Also mutate_if().


mtcars %>%
arrange(.data, …) Order rows by values of a
column or columns (low to high), use with
w
www mutate_all(faithful, funs(log(.), log2(.)))
mutate_if(iris, is.numeric, funs(log(.)))

w
www
ww group_by(cyl) %>% w
www
ww desc() to order from high to low.
arrange(mtcars, mpg) mutate_at(.tbl, .cols, .funs, …) Apply funs to

w summarise(avg = mean(mpg)) arrange(mtcars, desc(mpg))


ww
specific columns. Use with funs(), vars() and
the helper functions for select().

mutate_at(iris, vars( -Species), funs(log(.)))
group_by(.data, ..., add = ungroup(x, …) ADD CASES add_column(.data, ..., .before = NULL, .after =
FALSE) Returns ungrouped copy 
 NULL) Add new column(s). Also add_count(),
add_row(.data, ..., .before = NULL, .after = NULL)
Returns copy of table 

grouped by …
g_iris <- group_by(iris, Species)
of table.
ungroup(g_iris)
w
www
ww
Add one or more rows to a table.
add_row(faithful, eruptions = 1, waiting = 1)
w
www
ww add_tally(). add_column(mtcars, new = 1:32)

rename(.data, …) Rename columns.



rename(iris, Length = Sepal.Length)
w
wwww
RStudio® is a trademark of RStudio, Inc. • CC BY SA RStudio • [email protected] • 844-448-1212 • rstudio.com • Learn more with browseVignettes(package = c("dplyr", "tibble")) • dplyr 0.7.0 • tibble 1.2.0 • Updated: 2019-08
Vector Functions Summary Functions Combine Tables
TO USE WITH MUTATE () TO USE WITH SUMMARISE () COMBINE VARIABLES COMBINE CASES dplyr
mutate() and transmute() apply vectorized summarise() applies summary functions to x y
functions to columns to create new columns. columns to create a new table. Summary A B C A B D A B C A B D A B C

Vectorized functions take vectors as input and


return vectors of the same length as output.
functions take vectors as input and return single
values as output.
a
b
c
t
u
v
1
2
3
+ a
b
d
t
u
w
3
2
1
= a
b
c
t
u
v
1
2
3
a
b
d
t
u
w
3
2
1 x
a
b
c
t
u
v
1
2
3

A B C

vectorized function summary function Use bind_cols() to paste tables beside each
other as they are. + y
C v 3
d w 4

COUNTS bind_cols(…) Returns tables placed side by


OFFSETS
dplyr::n() - number of values/rows side as a single table.  Use bind_rows() to paste tables below each
dplyr::lag() - Offset elements by 1 BE SURE THAT ROWS ALIGN.
dplyr::n_distinct() - # of uniques other as they are.
dplyr::lead() - Offset elements by -1 sum(!is.na()) - # of non-NA’s
CUMULATIVE AGGREGATES Use a "Mutating Join" to join one table to bind_rows(…, .id = NULL)
LOCATION DF
x
A
a
B
t
C
1
dplyr::cumall() - Cumulative all() columns from another, matching values with Returns tables one on top of the other
mean() - mean, also mean(!is.na()) the rows that they correspond to. Each join
x b u 2
as a single table. Set .id to a column
dplyr::cumany() - Cumulative any() x c v 3
median() - median retains a different combination of values from name to add a column of the original
cummax() - Cumulative max() z c v 3
z d w 4
dplyr::cummean() - Cumulative mean() the tables. table names (as pictured)
LOGICALS
cummin() - Cumulative min()
cumprod() - Cumulative prod() mean() - Proportion of TRUE’s A B C D left_join(x, y, by = NULL, A B C intersect(x, y, …)
cumsum() - Cumulative sum() sum() - # of TRUE’s a t 1 3 copy=FALSE, suffix=c(“.x”,“.y”),…) c v 3
Rows that appear in both x and y.
b u 2 2
c v 3 NA Join matching values from y to x.
RANKINGS POSITION/ORDER A B C setdiff(x, y, …)
dplyr::first() - first value right_join(x, y, by = NULL, copy = a t 1 Rows that appear in x but not y.
dplyr::cume_dist() - Proportion of all values <= A B C D
b u 2
dplyr::last() - last value a t 1 3 FALSE, suffix=c(“.x”,“.y”),…)
dplyr::dense_rank() - rank w ties = min, no gaps
dplyr::nth() - value in nth location of vector
b u 2 2
Join matching values from x to y. A B C union(x, y, …)
dplyr::min_rank() - rank with ties = min d w NA 1
a t 1 Rows that appear in x or y. 

dplyr::ntile() - bins into n bins b u 2
RANK A B C D inner_join(x, y, by = NULL, copy = (Duplicates removed). union_all()
dplyr::percent_rank() - min_rank scaled to [0,1] c v 3
a t 1 3 FALSE, suffix=c(“.x”,“.y”),…) d w 4 retains duplicates.
dplyr::row_number() - rank with ties = "first" quantile() - nth quantile  b u 2 2
Join data. Retain only rows with
min() - minimum value matches.
MATH max() - maximum value
+, - , *, /, ^, %/%, %% - arithmetic ops A B C D full_join(x, y, by = NULL, Use setequal() to test whether two data sets
log(), log2(), log10() - logs SPREAD a t 1 3
copy=FALSE, suffix=c(“.x”,“.y”),…) contain the exact same rows (in any order).
b u 2 2
<, <=, >, >=, !=, == - logical comparisons IQR() - Inter-Quartile Range c v 3 NA Join data. Retain all values, all rows.
dplyr::between() - x >= left & x <= right mad() - median absolute deviation d w NA 1

dplyr::near() - safe == for floating point sd() - standard deviation EXTRACT ROWS
numbers var() - variance x y
MISC A B.x C B.y D Use by = c("col1", "col2", …) to A B C A B D

dplyr::case_when() - multi-case if_else()


a t 1 t 3
specify one or more common a
b
t
u
1
2 + a
b
t
u
3
2 =
Row Names
b u 2 u 2
c v 3 NA NA columns to match on. c v 3 d w 1
iris %>% mutate(Species = case_when( left_join(x, y, by = "A")
Species == "versicolor" ~ "versi",
Species == "virginica" ~ "virgi", Tidy data does not use rownames, which store a
variable outside of the columns. To work with the A.x B.x C A.y B.y Use a named vector, by = c("col1" = Use a "Filtering Join" to filter one table against
TRUE ~ Species)) rownames, first move them into a column. a t 1 d w
"col2"), to match on columns that the rows of another.
dplyr::coalesce() - first non-NA values by b u 2 b u
have different names in each table.
C A B c v 3 a t
element  across a set of vectors rownames_to_column() left_join(x, y, by = c("C" = "D")) semi_join(x, y, by = NULL, …)
A B A B C
dplyr::if_else() - element-wise if() + else() 1 a t 1 a t Move row names into col. a t 1 Return rows of x that have a match in y.
dplyr::na_if() - replace specific values with NA 2 b u 2 b u a <- rownames_to_column(iris, var A1 B1 C A2 B2 Use suffix to specify the suffix to b u 2 USEFUL TO SEE WHAT WILL BE JOINED.
pmax() - element-wise max() 3 c v 3 c v
= "C") a t 1 d w give to unmatched columns that
pmin() - element-wise min() b u 2 b u
have the same name in both tables. A B C anti_join(x, y, by = NULL, …)

dplyr::recode() - Vectorized switch() A B C A B column_to_rownames()
c v 3 a t
left_join(x, y, by = c("C" = "D"), suffix = c v 3 Return rows of x that do not have a
dplyr::recode_factor() - Vectorized switch()
 1 a t 1 a t
Move col in row names.  c("1", "2")) match in y. USEFUL TO SEE WHAT WILL
for factors 2 b u 2 b
3 c
u
v column_to_rownames(a, var = "C") NOT BE JOINED.
3 c v

Also has_rownames(), remove_rownames()

RStudio® is a trademark of RStudio, Inc. • CC BY SA RStudio • [email protected] • 844-448-1212 • rstudio.com • Learn more with browseVignettes(package = c("dplyr", "tibble")) • dplyr 0.7.0 • tibble 1.2.0 • Updated: 2019-08

You might also like