Functions for data summarization in R?

People remain uncertain when it comes to summarizing actual data easily in R. There are a variety of choices. So who is the right one? I addressed the query below. At first, you must pick one. And become an expert on this. That’s how you’re going to switch to the next.

In this article, I will discuss the primary methods of summarizing data sets. Let’s hope this makes the trip much smoother than it seems.

Methods for summarizing data in R

apply()

Apply function returns a vector or array or a list of values achieved by applying a function to rows or columns. This is the easiest of all the tasks that can do this work. However, this feature is very unique to either row or column collapsing.

Usage

> apply(X, MARGIN, FUN, …)

Arguments

Values	Description
`x`	an array, including a matrix.
`MARGIN`	a vector giving the subscripts which the function will be applied over. E.g., for a matrix `1` indicates rows, `2` indicates columns, `c(1, 2)` indicates rows and columns.
`FUN`	the function to be applied. In the case of functions like `+`, `%*%`, etc., the function name must be backquoted or quoted.

Example

# Create a matrix
> mat <- matrix(c(1:20), nrow = 5, ncol=4)
> mat
     [,1] [,2] [,3] [,4]
[1,]    1    6   11   16
[2,]    2    7   12   17
[3,]    3    8   13   18
[4,]    4    9   14   19
[5,]    5   10   15   20

# 2 indicates columns
> apply(mat, 2, mean)
[1]  3  8 13 18

# 1 indicates rows
> apply(mat, 1, mean)
[1]  8.5  9.5 10.5 11.5 12.5

lapply()

lapply() function is useful for performing operations on list objects and returns a list object of the same length as the original set. lappy() returns a list of a similar length as the input list object, each element of which is the result of applying FUN to the corresponding element of the list. lapply() takes list, vector, or data frame as input and gives output in a list.

Usage

> lapply(X, FUN, …)

Arguments

Values	Description
`x`	A vector or an object
`FUN`	Function applied to each element of x

l in lapply() stands for list. The difference between lapply() and apply() lies between the output return. The output of lapply() is a list. lapply() can be used for other objects like data frames and lists.

lapply() function does not need MARGIN.

A very easy example can be to change the string value of a matrix to lower case with tolower function. We construct a matrix with the name of the famous movies. The name is in upper case format.

Example

> month <- month.abb
> month
 [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"

> lower_month <- lapply(month,tolower)
> str(lower_month)
List of 12
 $ : chr "jan"
 $ : chr "feb"
 $ : chr "mar"
 $ : chr "apr"
 $ : chr "may"
 $ : chr "jun"
 $ : chr "jul"
 $ : chr "aug"
 $ : chr "sep"
 $ : chr "oct"
 $ : chr "nov"
 $ : chr "dec"

sapply()

sapply() function takes a list, vector, or data frame as input and gives output in vector or matrix. It is useful for operations on list objects and returns a list object of the same length as the original set. sapply() function does the same job as lapply() function but returns a vector.

Usage

> sapply(X, FUN)

Arguments

Values	Description
`x`	A vector or an object.
`FUN`	Function applied to each element of x.

We can measure the minimum speed and stopping distances of cars from the cars dataset.

Example

# Let's load car dataset
> dt <- cars
> lmn_cars <- lapply(dt, min)
> smn_cars <- sapply(dt, min)

> lmn_cars
$speed
[1] 4

$dist
[1] 2

> smn_cars
speed  dist
    4     2

We can summarize the difference between apply(), sapply() and lapply() in the following table:

Function	Arguments	Objective	Input	Output
apply	apply(x, MARGIN, FUN)	Apply a function to the rows or columns or both	Data frame or matrix	vector, list, array
lapply	lapply(X, FUN)	Apply a function to all the elements of the input	List, vector or data frame	list
sapply	sappy(X FUN)	Apply a function to all the elements of the input	List, vector or data frame	vector or matrix

tapply()

Till now, all the function we discussed cannot do what Sql can achieve. Here is a function which completes the palette for R. Usage is “tapply(X, INDEX, FUN = NULL, …, simplify = TRUE)”, where X is “an atomic object, typically a vector” and INDEX is “a list of factors, each of same length as X”. Here is an example which will make the usage clear.

Usage

> tapply(X, INDEX, FUN = NULL, …, default = NA, simplify = TRUE)

Arguments

Values	Description
x	an R object for which a `split` method exists. Typically vector-like, allowing subsetting with `[`.
INDEX	a `list` of one or more `factor`s, each of same length as `X`. The elements are coerced to factors by `as.factor`.
FUN	Function applied to each element of x.

Example

> df <- iris
> tp <- tapply(df$Petal.Length, df$Species, mean)
> tp
    setosa versicolor  virginica
     1.462      4.260      5.552
>

by()

Now comes a slightly more complicated algorithm. Function ‘by’ is an object-oriented wrapper for ‘tapply’ applied to data frames. Hopefully the example will make it more clear.

Usage

> by(data, INDICES, FUN, …, simplify = TRUE)

Arguments

Values	Description
data	an R object, normally a data frame, possibly a matrix.
INDICES	a factor or a list of factors, each of length `nrow(data)`.
FUN	a function to be applied to (usually data-frame) subsets of `data`.
simplify	logical condition

Example

> df <- iris
> mean_col <- by(df[,1:4], df$Species, colMeans)
df$Species: setosa
Sepal.Length  Sepal.Width Petal.Length  Petal.Width
       5.006        3.428        1.462        0.246
------------------------------------------------------------
df$Species: versicolor
Sepal.Length  Sepal.Width Petal.Length  Petal.Width
       5.936        2.770        4.260        1.326
------------------------------------------------------------
df$Species: virginica
Sepal.Length  Sepal.Width Petal.Length  Petal.Width
       6.588        2.974        5.552        2.026

Conclusion

Hence, we saw functions that can help for summarizing data in R. Functions like by(), apply(), sapply(), tapply() and lapply() with definition and the usage along with an example for each.

This brings the end of this Blog. We really appreciate your time.

Hope you liked it.

Do visit our page www.zigya.com/blog for more informative blogs on Data Science

Keep Reading! Cheers!

Zigya Academy
BEING RELEVANT

Functions for data summarization in R?

Methods for summarizing data in R

apply()

Usage

Arguments

Example

lapply()

Usage

Arguments

Example

sapply()

Usage

Arguments

Example

tapply()

Usage

Arguments

Example

Usage

Arguments

Example

Conclusion

Zigya Acadmey

Zigya Acadmey

Leave a Reply