Press "Enter" to skip to content

Functions for data summarization in R?

Zigya Acadmey 0

People remain uncertain when it comes to summarizing actual data easily in R. There are a variety of choices. So who is the right one? I addressed the query below. At first, you must pick one. And become an expert on this. That’s how you’re going to switch to the next.

In this article, I will discuss the primary methods of summarizing data sets. Let’s hope this makes the trip much smoother than it seems.

Methods for summarizing data in R


Apply function returns a vector or array or a list of values achieved by applying a function to rows or columns. This is the easiest of all the tasks that can do this work. However, this feature is very unique to either row or column collapsing.


> apply(X, MARGIN, FUN, …)


xan array, including a matrix.
MARGINa vector giving the subscripts which the function will be applied over. E.g., for a matrix 1 indicates rows, 2 indicates columns, c(1, 2) indicates rows and columns. 
FUNthe function to be applied. In the case of functions like +%*%, etc., the function name must be backquoted or quoted.


# Create a matrix
> mat <- matrix(c(1:20), nrow = 5, ncol=4)
> mat
     [,1] [,2] [,3] [,4]
[1,]    1    6   11   16
[2,]    2    7   12   17
[3,]    3    8   13   18
[4,]    4    9   14   19
[5,]    5   10   15   20

# 2 indicates columns
> apply(mat, 2, mean)
[1]  3  8 13 18

# 1 indicates rows
> apply(mat, 1, mean)
[1]  8.5  9.5 10.5 11.5 12.5


lapply() function is useful for performing operations on list objects and returns a list object of the same length as the original set. lappy() returns a list of a similar length as the input list object, each element of which is the result of applying FUN to the corresponding element of the list. lapply() takes list, vector, or data frame as input and gives output in a list.


> lapply(X, FUN, …)


xA vector or an object
FUNFunction applied to each element of x

l in lapply() stands for list. The difference between lapply() and apply() lies between the output return. The output of lapply() is a list. lapply() can be used for other objects like data frames and lists.

lapply() function does not need MARGIN.

A very easy example can be to change the string value of a matrix to lower case with tolower function. We construct a matrix with the name of the famous movies. The name is in upper case format.


> month <-
> month
 [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"

> lower_month <- lapply(month,tolower)
> str(lower_month)
List of 12
 $ : chr "jan"
 $ : chr "feb"
 $ : chr "mar"
 $ : chr "apr"
 $ : chr "may"
 $ : chr "jun"
 $ : chr "jul"
 $ : chr "aug"
 $ : chr "sep"
 $ : chr "oct"
 $ : chr "nov"
 $ : chr "dec"


sapply() function takes a list, vector, or data frame as input and gives output in vector or matrix. It is useful for operations on list objects and returns a list object of the same length as the original set. sapply() function does the same job as lapply() function but returns a vector.


> sapply(X, FUN)


xA vector or an object.
FUNFunction applied to each element of x.

We can measure the minimum speed and stopping distances of cars from the cars dataset.


# Let's load car dataset
> dt <- cars
> lmn_cars <- lapply(dt, min)
> smn_cars <- sapply(dt, min)

> lmn_cars
[1] 4

[1] 2

> smn_cars
speed  dist
    4     2

We can summarize the difference between apply(), sapply() and lapply() in the following table:

applyapply(x, MARGIN, FUN)Apply a function to the rows or columns or bothData frame or matrixvector, list, array
lapplylapply(X, FUN)Apply a function to all the elements of the inputList, vector or data framelist
sapplysappy(X FUN)Apply a function to all the elements of the inputList, vector or data framevector or matrix


Till now, all the function we discussed cannot do what Sql can achieve. Here is a function which completes the palette for R. Usage is “tapply(X, INDEX, FUN = NULL, …, simplify = TRUE)”, where X is “an atomic object, typically a vector” and INDEX is “a list of factors, each of same length as X”. Here is an example which will make the usage clear.


> tapply(X, INDEX, FUN = NULL, …, default = NA, simplify = TRUE)


xan R object for which a split method exists. Typically vector-like, allowing subsetting with [.
INDEXlist of one or more factors, each of same length as X. The elements are coerced to factors by as.factor.
FUNFunction applied to each element of x.


> df <- iris
> tp <- tapply(df$Petal.Length, df$Species, mean)
> tp
    setosa versicolor  virginica
     1.462      4.260      5.552


Now comes a slightly more complicated algorithm. Function ‘by’ is an object-oriented wrapper for ‘tapply’ applied to data frames. Hopefully the example will make it more clear.


> by(data, INDICES, FUN, …, simplify = TRUE)


dataan R object, normally a data frame, possibly a matrix.
INDICESa factor or a list of factors, each of length nrow(data).
FUNa function to be applied to (usually data-frame) subsets of data.
simplifylogical condition


> df <- iris
> mean_col <- by(df[,1:4], df$Species, colMeans)
df$Species: setosa
Sepal.Length  Sepal.Width Petal.Length  Petal.Width
       5.006        3.428        1.462        0.246
df$Species: versicolor
Sepal.Length  Sepal.Width Petal.Length  Petal.Width
       5.936        2.770        4.260        1.326
df$Species: virginica
Sepal.Length  Sepal.Width Petal.Length  Petal.Width
       6.588        2.974        5.552        2.026


Hence, we saw functions that can help for summarizing data in R. Functions like by(), apply(), sapply(), tapply() and lapply() with definition and the usage along with an example for each.

This brings the end of this Blog. We really appreciate your time.

Hope you liked it.

Do visit our page for more informative blogs on Data Science

Keep Reading! Cheers!

Zigya Academy

Leave a Reply

Your email address will not be published. Required fields are marked *