We all know the many hours spent cleaning and wrangling data. Sometimes I think my actual job is not “Data Scientist” but “Data Cleaner”.
Data, as you surely know, is not often in the best shape, so for many people like me, one of the most appreciated tools is the one that makes cleaning easy.
When I started working on what now is called data science I used python a lot for data wrangling but R is the “de facto best tool” for cleaning and munging right now due to the many good packages developed for that purpose.
Installing and Loading package
Type “install.packages(“janitor”)” and then press the Enter/Return key.
# to install janitor > install.packages("janitor") # to load it into your environment > library(janitor)
Favorites of janitor
The package janitor is awesome for data cleaning. Consider learning this package if using a lot of Excel sheets from other users. Excel sheets may have bad column names (i.e. with “?” or upper and lowercase letters) or empty data, etc. You want your R objects to be clean
1) for your sanity,
2) for readability of your code, and
3) for ease of coding.
Clean data.frame names with
Call this function every time you read data.
It works in a
%>% pipeline, and handles problematic variable names, especially those that are so well-preserved by
- Parses letter cases and separators to a consistent format.
- Default is to snake_case, but other cases like camelCase are available
- Handles special characters and spaces, including transliterating characters like
- Appends numbers to duplicated names
- Converts “%” to “percent” and “#” to “number” to retain the meaning
- Spacing (or lack thereof) around numbers is preserved
Making sure column names are clean on iris dataset.
> colnames(iris)  "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species" > colnames(clean_names(iris))  "sepal_length" "sepal_width" "petal_length" "petal_width" "species"
tabyl() takes a vector and returns a frequency table, like
table(). But its additional features are:
- It returns a data.frame – for manipulating further, or printing with
- Atomatically calculates percentages
- It can (optionally) display
NAvalues are present, it will calculate an additional column
- It can (optionally) sort on counts
- It can be called with
%>%in a pipeline
- When called on a factor, it will include missing levels in the result (levels not present in the vector)
> tabyl(iris, species) species n percent setosa 50 0.3333333 versicolor 50 0.3333333 virginica 50 0.3333333
tabyl() can be called on a piped-in data frame, which allows for fast, flexible exploration of data:
> library(janitor) > iris %>% tabyl(species) species n percent setosa 50 0.3333333 versicolor 50 0.3333333 virginica 50 0.3333333
> remove_empty(dat, which = c("rows", "cols"), quiet = TRUE)
|the input data.frame or matrix.|
|one of “rows”, “cols”, or |
|Should messages be suppressed (|
remove_empty() function removes any columns that are entirely empty and entire rows that are entirely empty.
> library(janitor) > df <- data.frame(col1 = c(NA, NA, NA, NA, NA), + col2 = c(NA, 2, 3, 4, 5) + ) > remove_empty(df, which=c("rows")) col1 col2 2 NA 2 3 NA 3 4 NA 4 5 NA 5 > remove_empty(df, which = c("cols")) col2 1 NA 2 2 3 3 4 4 5 5
Ever load data from Excel and see a value like
42223 where a date should be? This function converts those serial numbers to class
> excel_numeric_to_date(41103)  "2012-07-13"
Hence, we saw what is janitor package in R, how to install and load that package in Rstudio and R. We also learned about the some of the functions in the janitor package with each example.
This brings the end of this Blog. We really appreciate your time.
Hope you liked it.
Do visit our page www.zigya.com/blog for more informative blogs on Data Science
Keep Reading! Cheers!