Press "Enter" to skip to content

What is tidyr package in R?

Zigya Acadmey 0

tidyr is new package that makes it easy to “tidy” your data. Tidy data is data that’s easy to work with: it’s easy to munge (with dplyr), visualise (with ggplot2 or ggvis) and model (with R’s hundreds of modelling packages). The two most important properties of tidy data are:

  • Each column is a variable.
  • Each row is an observation.

Arranging your data in this way makes it easier to work with because you have a consistent way of referring to variables (as column names) and observations (as row indices). When use tidy data and tidy tools, you spend less time worrying about how to feed the output from one function into the input of another, and more time answering your questions about the data.

To tidy messy data, you first identify the variables in your dataset, then use the tools provided by tidyr to move them into columns. tidyr provides three main functions for tidying your messy data: gather()separate() and spread().

gather()

gather() function: It takes multiple columns and gathers them into key-value pairs. Basically, it makes “wide” data longer. The gather() function will take multiple columns and collapse them into key-value pairs, duplicating all other columns as needed.

Syntax: 

> gather(data, key = “key”, value = “value”, …, na.rm = FALSE, convert = FALSE, 
         factor_key = FALSE) 

Arguments

ValuesDescription
datathe data frame.
key, valuethe names of new key and value columns, 
as strings or as symbols.
na.rmif set TRUE, it will remove rows from output where the value column is NA.
convertis set TRUE, it will automatically run type.convert() on the key column. 
This is useful if the column types are actually numeric,
 integer, or logical.
factor_keyif FALSE, the default, the key values will be stored as a character vector.

gather() takes multiple columns, and gathers them into key-value pairs: it makes “wide” data longer. Other names for gather include melt (reshape2), pivot (spreadsheets), and fold (databases). Here’s an example of how you might use gather() on a made-up dataset. In this experiment we’ve given three people two different drugs and recorded their heart rate:

Let’s see it with a example.

> library(tidyr)
> library(dplyr)

> messy <- data.frame(
+  name = c("Wilbur", "Petunia", "Gregory"),
+  a = c(67, 80, 64),
+  b = c(56, 90, 50)
+ )

> messy
#>      name  a  b
#> 1  Wilbur 67 56
#> 2 Petunia 80 90
#> 3 Gregory 64 50

We have three variables (name, drug and heartrate), but only name is currently in a column. We use gather() to gather the a and b columns into key-value pairs of drug and heartrate:

> messy %>%
+  gather(drug, heartrate, a:b)
#>      name drug heartrate
#> 1  Wilbur    a        67
#> 2 Petunia    a        80
#> 3 Gregory    a        64
#> 4  Wilbur    b        56
#> 5 Petunia    b        90
#> 6 Gregory    b        50

separate()

separate() function: It converts longer data to a wider format. The separate() function turns a single character column into multiple columns.

Sometimes two variables are clumped together in one column. separate() allows you to tease them apart (extract() works similarly but uses regexp groups instead of a splitting pattern or position). 

Syntax:

> separate(data, col, into, sep = ” “, remove = TRUE, convert = FALSE)

Arguments

ValuesDescription
dataA data frame.
colColumn name or position.
intoNames of new variables to create as character vector. 
Use NA to omit the variable in the output.
sepThe separator between the columns.
removeIf set TRUE, it will remove the input column from the output data frame.
convertIf TRUE, will run type.convert() with as.is = TRUE on new columns.

Example: 

We can say that the long datasets created using gather() is appropriate for use, but we can break down the Group variable even further using separate()

# Create a data frame
> tidy_dataframe = data.frame( 
                      S.No = c(1:10),  
                    set.1 = sample(10), 
                    set.2 = sample(10), 
                    set.3 = sample(10)) 
> tidy_dataframe
  S.No set.1 set.2 set.3
1    1     8     5     8
2    2     1     2     9
3    3     7     7     1
4    4    10     3    10
5    5     4     9     4
6    6     6     6     2

> long <- tidy_dataframe %>%
+ gather(Set, Quantity, set.1:set.3)

Next, we use separate() to split the key into set and number, using a regular expression to describe the character that separates them.

> tidy <- long %>%
+ separate(Set, into=c("set","No."), sep="\\.")
> head(tidy)
   S.No set No. Quantity
1     1 set   1        8
2     2 set   1        1
3     3 set   1        7
4     4 set   1       10
5     5 set   1        4
6     6 set   1        6

Conclusion

Hence, we saw what is tidyr package in R. Why it is so important, we also saw how to use separate() and gather function to pivot long and wide our data.

This brings the end of this Blog. We really appreciate your time.

Hope you liked it.

Do visit our page www.zigya.com/blog for more informative blogs on Data Science

Keep Reading! Cheers!

Zigya Academy
BEING RELEVANT

Leave a Reply

Your email address will not be published. Required fields are marked *