Press "Enter" to skip to content

What is janitor package in R?

Zigya Acadmey 1

We all know the many hours spent cleaning and wrangling data. Sometimes I think my actual job is not “Data Scientist” but “Data Cleaner”.
Data, as you surely know, is not often in the best shape, so for many people like me, one of the most appreciated tools is the one that makes cleaning easy.
When I started working on what now is called data science I used python a lot for data wrangling but R is the “de facto best tool” for cleaning and munging right now due to the many good packages developed for that purpose.

Installing and Loading package

Type “install.packages(“janitor”)” and then press the Enter/Return key.

# to install janitor
> install.packages("janitor")

# to load it into your environment
> library(janitor)

Cleaning data

Favorites of janitor
The package janitor is awesome for data cleaning. Consider learning this package if using a lot of Excel sheets from other users. Excel sheets may have bad column names (i.e. with “?” or upper and lowercase letters) or empty data, etc. You want your R objects to be clean

1) for your sanity,
2) for readability of your code, and
3) for ease of coding.

clean_names()

Clean data.frame names with clean_names()

Call this function every time you read data.

It works in a %>% pipeline, and handles problematic variable names, especially those that are so well-preserved by readxl::read_excel() and readr::read_csv().

  • Parses letter cases and separators to a consistent format.
    • Default is to snake_case, but other cases like camelCase are available
  • Handles special characters and spaces, including transliterating characters like œ to oe.
  • Appends numbers to duplicated names
  • Converts “%” to “percent” and “#” to “number” to retain the meaning
  • Spacing (or lack thereof) around numbers is preserved

Making sure column names are clean on iris dataset.

> colnames(iris)
[1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"

> colnames(clean_names(iris))
[1] "sepal_length" "sepal_width"  "petal_length" "petal_width"  "species"

tabyl()

tabyl() takes a vector and returns a frequency table, like table(). But its additional features are:

  • It returns a data.frame – for manipulating further, or printing with knitr::kable().
  • Atomatically calculates percentages
  • It can (optionally) display NA values
    • When NA values are present, it will calculate an additional column valid_percent
  • It can (optionally) sort on counts
  • It can be called with %>% in a pipeline
  • When called on a factor, it will include missing levels in the result (levels not present in the vector)
> tabyl(iris, species)
    species  n   percent
     setosa 50 0.3333333
 versicolor 50 0.3333333
  virginica 50 0.3333333

tabyl() can be called on a piped-in data frame, which allows for fast, flexible exploration of data:

> library(janitor)
> iris %>% tabyl(species)
    species  n   percent
     setosa 50 0.3333333
 versicolor 50 0.3333333
  virginica 50 0.3333333

remove_empty()


syntax:

> remove_empty(dat, which = c("rows", "cols"), quiet = TRUE)
ValuesDescription
datthe input data.frame or matrix.
whichone of “rows”, “cols”, or c("rows", "cols"). Where no value of which is provided, defaults to removing both empty rows and empty columns, declaring the behavior with a printed message.
quietShould messages be suppressed (TRUE) or printed (FALSE) indicating the summary of empty columns or rows removed?

The remove_empty() function removes any columns that are entirely empty and entire rows that are entirely empty.

> library(janitor)
> df <- data.frame(col1 = c(NA, NA, NA,  NA, NA),
+                 col2 = c(NA, 2, 3, 4, 5)
+ )

> remove_empty(df, which=c("rows"))
  col1 col2
2   NA    2
3   NA    3
4   NA    4
5   NA    5

> remove_empty(df, which = c("cols"))
  col2
1   NA
2    2
3    3
4    4
5    5

excel_numeric_to_date()

Ever load data from Excel and see a value like 42223 where a date should be? This function converts those serial numbers to class Date.

> excel_numeric_to_date(41103)
[1] "2012-07-13"

Conclusion

Hence, we saw what is janitor package in R, how to install and load that package in Rstudio and R. We also learned about the some of the functions in the janitor package with each example.

This brings the end of this Blog. We really appreciate your time.

Hope you liked it.

Do visit our page www.zigya.com/blog for more informative blogs on Data Science

Keep Reading! Cheers!

Zigya Academy
BEING RELEVANT

  1. Dysleksja Dysleksja

    That is How much time do you spend updating this blog every day? I wish every blogger paid so much attention to their blogs.

Leave a Reply

Your email address will not be published. Required fields are marked *