We all know the many hours spent cleaning and wrangling data. Sometimes I think my actual job is not “Data Scientist” but “Data Cleaner”.
Data, as you surely know, is not often in the best shape, so for many people like me, one of the most appreciated tools is the one that makes cleaning easy.
When I started working on what now is called data science I used python a lot for data wrangling but R is the “de facto best tool” for cleaning and munging right now due to the many good packages developed for that purpose.

Installing and Loading package

Type “install.packages(“janitor”)” and then press the Enter/Return key.

# to install janitor
> install.packages("janitor")

# to load it into your environment
> library(janitor)

Cleaning data

Favorites of janitor
The package janitor is awesome for data cleaning. Consider learning this package if using a lot of Excel sheets from other users. Excel sheets may have bad column names (i.e. with “?” or upper and lowercase letters) or empty data, etc. You want your R objects to be clean

1) for your sanity,
2) for readability of your code, and
3) for ease of coding.

clean_names()

Clean data.frame names with `clean_names()`

Call this function every time you read data.

It works in a %>% pipeline, and handles problematic variable names, especially those that are so well-preserved by readxl::read_excel() and readr::read_csv().

Parses letter cases and separators to a consistent format.
- Default is to snake_case, but other cases like camelCase are available
Handles special characters and spaces, including transliterating characters like œ to oe.
Appends numbers to duplicated names
Converts “%” to “percent” and “#” to “number” to retain the meaning
Spacing (or lack thereof) around numbers is preserved

Making sure column names are clean on iris dataset.

> colnames(iris)
[1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"

> colnames(clean_names(iris))
[1] "sepal_length" "sepal_width"  "petal_length" "petal_width"  "species"

tabyl()

tabyl() takes a vector and returns a frequency table, like table(). But its additional features are:

It returns a data.frame – for manipulating further, or printing with knitr::kable().
Atomatically calculates percentages
It can (optionally) display NA values
- When NA values are present, it will calculate an additional column valid_percent
It can (optionally) sort on counts
It can be called with %>% in a pipeline
When called on a factor, it will include missing levels in the result (levels not present in the vector)

> tabyl(iris, species)
    species  n   percent
     setosa 50 0.3333333
 versicolor 50 0.3333333
  virginica 50 0.3333333

tabyl() can be called on a piped-in data frame, which allows for fast, flexible exploration of data:

> library(janitor)
> iris %>% tabyl(species)
    species  n   percent
     setosa 50 0.3333333
 versicolor 50 0.3333333
  virginica 50 0.3333333

remove_empty()

syntax:

> remove_empty(dat, which = c("rows", "cols"), quiet = TRUE)

Values	Description
`dat`	the input data.frame or matrix.
`which`	one of “rows”, “cols”, or `c("rows", "cols")`. Where no value of which is provided, defaults to removing both empty rows and empty columns, declaring the behavior with a printed message.
`quiet`	Should messages be suppressed (`TRUE`) or printed (`FALSE`) indicating the summary of empty columns or rows removed?

The remove_empty() function removes any columns that are entirely empty and entire rows that are entirely empty.

> library(janitor)
> df <- data.frame(col1 = c(NA, NA, NA,  NA, NA),
+                 col2 = c(NA, 2, 3, 4, 5)
+ )

> remove_empty(df, which=c("rows"))
  col1 col2
2   NA    2
3   NA    3
4   NA    4
5   NA    5

> remove_empty(df, which = c("cols"))
  col2
1   NA
2    2
3    3
4    4
5    5

excel_numeric_to_date()

Ever load data from Excel and see a value like 42223 where a date should be? This function converts those serial numbers to class Date.

> excel_numeric_to_date(41103)
[1] "2012-07-13"

Conclusion

Hence, we saw what is janitor package in R, how to install and load that package in Rstudio and R. We also learned about the some of the functions in the janitor package with each example.

This brings the end of this Blog. We really appreciate your time.

Hope you liked it.

Do visit our page www.zigya.com/blog for more informative blogs on Data Science

Keep Reading! Cheers!

Zigya Academy
BEING RELEVANT

What is janitor package in R?

Installing and Loading package

Cleaning data

clean_names()

Clean data.frame names with `clean_names()`

tabyl()

remove_empty()

syntax:

excel_numeric_to_date()

Zigya Acadmey

Zigya Acadmey

Leave a Reply

What is janitor package in R?

Installing and Loading package

Cleaning data

clean_names()

Clean data.frame names with clean_names()

tabyl()

remove_empty()

syntax:

excel_numeric_to_date()

Zigya Acadmey

Zigya Acadmey

Leave a Reply

Clean data.frame names with `clean_names()`