tidyr is new package that makes it easy to “tidy” your data. Tidy data is data that’s easy to work with: it’s easy to munge (with dplyr), visualise (with ggplot2 or ggvis) and model (with R’s hundreds of modelling packages). The two most important properties of tidy data are:
- Each column is a variable.
- Each row is an observation.
Arranging your data in this way makes it easier to work with because you have a consistent way of referring to variables (as column names) and observations (as row indices). When use tidy data and tidy tools, you spend less time worrying about how to feed the output from one function into the input of another, and more time answering your questions about the data.
To tidy messy data, you first identify the variables in your dataset, then use the tools provided by tidyr to move them into columns. tidyr provides three main functions for tidying your messy data:
gather() function: It takes multiple columns and gathers them into key-value pairs. Basically, it makes “wide” data longer. The gather() function will take multiple columns and collapse them into key-value pairs, duplicating all other columns as needed.
> gather(data, key = “key”, value = “value”, …, na.rm = FALSE, convert = FALSE, factor_key = FALSE)
|data||the data frame.|
|key, value||the names of new key and value columns, |
as strings or as symbols.
|na.rm||if set TRUE, it will remove rows from output where the value column is NA.|
|convert||is set TRUE, it will automatically run type.convert() on the key column. |
This is useful if the column types are actually numeric,
integer, or logical.
|factor_key||if FALSE, the default, the key values will be stored as a character vector.|
gather() takes multiple columns, and gathers them into key-value pairs: it makes “wide” data longer. Other names for gather include melt (reshape2), pivot (spreadsheets), and fold (databases). Here’s an example of how you might use
gather() on a made-up dataset. In this experiment we’ve given three people two different drugs and recorded their heart rate:
Let’s see it with a example.
> library(tidyr) > library(dplyr) > messy <- data.frame( + name = c("Wilbur", "Petunia", "Gregory"), + a = c(67, 80, 64), + b = c(56, 90, 50) + ) > messy #> name a b #> 1 Wilbur 67 56 #> 2 Petunia 80 90 #> 3 Gregory 64 50
We have three variables (name, drug and heartrate), but only name is currently in a column. We use
gather() to gather the a and b columns into key-value pairs of drug and heartrate:
> messy %>% + gather(drug, heartrate, a:b) #> name drug heartrate #> 1 Wilbur a 67 #> 2 Petunia a 80 #> 3 Gregory a 64 #> 4 Wilbur b 56 #> 5 Petunia b 90 #> 6 Gregory b 50
separate() function: It converts longer data to a wider format. The separate() function turns a single character column into multiple columns.
Sometimes two variables are clumped together in one column.
separate() allows you to tease them apart (
extract() works similarly but uses regexp groups instead of a splitting pattern or position).
> separate(data, col, into, sep = ” “, remove = TRUE, convert = FALSE)
|data||A data frame.|
|col||Column name or position.|
|into||Names of new variables to create as character vector. |
Use NA to omit the variable in the output.
|sep||The separator between the columns.|
|remove||If set TRUE, it will remove the input column from the output data frame.|
|convert||If TRUE, will run type.convert() with as.is = TRUE on new columns.|
We can say that the long datasets created using gather() is appropriate for use, but we can break down the Group variable even further using separate().
# Create a data frame > tidy_dataframe = data.frame( S.No = c(1:10), set.1 = sample(10), set.2 = sample(10), set.3 = sample(10)) > tidy_dataframe S.No set.1 set.2 set.3 1 1 8 5 8 2 2 1 2 9 3 3 7 7 1 4 4 10 3 10 5 5 4 9 4 6 6 6 6 2 > long <- tidy_dataframe %>% + gather(Set, Quantity, set.1:set.3)
Next, we use
separate() to split the key into set and number, using a regular expression to describe the character that separates them.
> tidy <- long %>% + separate(Set, into=c("set","No."), sep="\\.") > head(tidy) S.No set No. Quantity 1 1 set 1 8 2 2 set 1 1 3 3 set 1 7 4 4 set 1 10 5 5 set 1 4 6 6 set 1 6
Hence, we saw what is tidyr package in R. Why it is so important, we also saw how to use separate() and gather function to pivot long and wide our data.
This brings the end of this Blog. We really appreciate your time.
Hope you liked it.
Do visit our page www.zigya.com/blog for more informative blogs on Data Science
Keep Reading! Cheers!