People remain uncertain when it comes to summarizing actual data easily in R. There are a variety of choices. So who is the right one? I addressed the query below. At first, you must pick one. And become an expert on this. That’s how you’re going to switch to the next.
In this article, I will discuss the primary methods of summarizing data sets. Let’s hope this makes the trip much smoother than it seems.
Methods for summarizing data in R
Apply function returns a vector or array or a list of values achieved by applying a function to rows or columns. This is the easiest of all the tasks that can do this work. However, this feature is very unique to either row or column collapsing.
> apply(X, MARGIN, FUN, …)
|an array, including a matrix.|
|a vector giving the subscripts which the function will be applied over. E.g., for a matrix |
|the function to be applied. In the case of functions like |
# Create a matrix > mat <- matrix(c(1:20), nrow = 5, ncol=4) > mat [,1] [,2] [,3] [,4] [1,] 1 6 11 16 [2,] 2 7 12 17 [3,] 3 8 13 18 [4,] 4 9 14 19 [5,] 5 10 15 20 # 2 indicates columns > apply(mat, 2, mean)  3 8 13 18 # 1 indicates rows > apply(mat, 1, mean)  8.5 9.5 10.5 11.5 12.5
lapply() function is useful for performing operations on list objects and returns a list object of the same length as the original set.
lappy() returns a list of a similar length as the input list object, each element of which is the result of applying FUN to the corresponding element of the list.
lapply() takes list, vector, or data frame as input and gives output in a list.
> lapply(X, FUN, …)
|A vector or an object|
|Function applied to each element of x|
l in lapply() stands for list. The difference between lapply() and apply() lies between the output return. The output of lapply() is a list. lapply() can be used for other objects like data frames and lists.
lapply() function does not need MARGIN.
A very easy example can be to change the string value of a matrix to lower case with tolower function. We construct a matrix with the name of the famous movies. The name is in upper case format.
> month <- month.abb > month  "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec" > lower_month <- lapply(month,tolower) > str(lower_month) List of 12 $ : chr "jan" $ : chr "feb" $ : chr "mar" $ : chr "apr" $ : chr "may" $ : chr "jun" $ : chr "jul" $ : chr "aug" $ : chr "sep" $ : chr "oct" $ : chr "nov" $ : chr "dec"
sapply() function takes a list, vector, or data frame as input and gives output in vector or matrix. It is useful for operations on list objects and returns a list object of the same length as the original set.
sapply() function does the same job as
lapply() function but returns a vector.
> sapply(X, FUN)
|A vector or an object.|
|Function applied to each element of x.|
We can measure the minimum speed and stopping distances of cars from the cars dataset.
# Let's load car dataset > dt <- cars > lmn_cars <- lapply(dt, min) > smn_cars <- sapply(dt, min) > lmn_cars $speed  4 $dist  2 > smn_cars speed dist 4 2
We can summarize the difference between
lapply() in the following table:
|apply||apply(x, MARGIN, FUN)||Apply a function to the rows or columns or both||Data frame or matrix||vector, list, array|
|lapply||lapply(X, FUN)||Apply a function to all the elements of the input||List, vector or data frame||list|
|sapply||sappy(X FUN)||Apply a function to all the elements of the input||List, vector or data frame||vector or matrix|
Till now, all the function we discussed cannot do what Sql can achieve. Here is a function which completes the palette for R. Usage is “tapply(X, INDEX, FUN = NULL, …, simplify = TRUE)”, where X is “an atomic object, typically a vector” and INDEX is “a list of factors, each of same length as X”. Here is an example which will make the usage clear.
> tapply(X, INDEX, FUN = NULL, …, default = NA, simplify = TRUE)
|x||an R object for which a |
|FUN||Function applied to each element of x.|
> df <- iris > tp <- tapply(df$Petal.Length, df$Species, mean) > tp setosa versicolor virginica 1.462 4.260 5.552 >
Now comes a slightly more complicated algorithm. Function ‘by’ is an object-oriented wrapper for ‘tapply’ applied to data frames. Hopefully the example will make it more clear.
> by(data, INDICES, FUN, …, simplify = TRUE)
|data||an R object, normally a data frame, possibly a matrix.|
|INDICES||a factor or a list of factors, each of length |
|FUN||a function to be applied to (usually data-frame) subsets of |
> df <- iris > mean_col <- by(df[,1:4], df$Species, colMeans) df$Species: setosa Sepal.Length Sepal.Width Petal.Length Petal.Width 5.006 3.428 1.462 0.246 ------------------------------------------------------------ df$Species: versicolor Sepal.Length Sepal.Width Petal.Length Petal.Width 5.936 2.770 4.260 1.326 ------------------------------------------------------------ df$Species: virginica Sepal.Length Sepal.Width Petal.Length Petal.Width 6.588 2.974 5.552 2.026
Hence, we saw functions that can help for summarizing data in R. Functions like
lapply() with definition and the usage along with an example for each.
This brings the end of this Blog. We really appreciate your time.
Hope you liked it.
Do visit our page www.zigya.com/blog for more informative blogs on Data Science
Keep Reading! Cheers!