This article explain how to recognise and erase duplicate data in R.
You’ll learn how to use the following R base and dplyr functions:
R base functions
duplicated()
determines which elements of a vector or data frame are duplicates of elements with smaller subscripts, and returns a logical vector indicating which elements (rows) are duplicates.unique()
: for extracting unique elements.
dplyr functions
unique()
: for extracting unique elements.
To use dplyr
functions you need to first load the tidyverse
library in your environment.
Load the tidyverse
packages, which include dplyr
:
> library(tidyverse)
We’ll use the R built-in iris data set, which we start by converting into a tibble data frame (tbl_df) for easier data analysis.
> data <- as_tibble(iris)
> data
# A tibble: 150 x 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
# ... with 140 more rows
Find and drop duplicate elements
To find the duplicate elements we are going to use duplicated()
function.
> x <- c(1, 1, 2, 5, 4, 3, 4, 7, 3)
> duplicated(x)
[1] FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE TRUE
To get the duplicate elements.
> x[duplicated(x)]
[1] 1 4 3
Now, to remove the duplicated elements. we will be using !duplicated()
function, where ! is a logical
> x[!duplicated(x)]
[1] 1 2 5 4 3 7
Now, with the above method you can remove the duplicate rows from a data frame based on a column values.
# Removing duplicated values from iris dataset
> data[!duplicated(data$Sepal.Width),]
# A tibble: 23 x 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 4.4 2.9 1.4 0.2 setosa
9 5.4 3.7 1.5 0.2 setosa
10 5.8 4 1.2 0.2 setosa
# ... with 13 more rows
Remove duplicate rows in a data frame
The function distinct()
[dplyr package] can be used to keep only unique/distinct rows from a data frame. If there are duplicate rows, only the first row is preserved. It’s an efficient version of the R base function unique()
.
> data %>% distinct()
# A tibble: 149 x 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
# ... with 139 more rows
Remove duplicated rows based on Sepal.Length
> data %>% distinct(Sepal.Length, .keep_all = TRUE)
# A tibble: 35 x 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.4 2.9 1.4 0.2 setosa
8 4.8 3.4 1.6 0.2 setosa
9 4.3 3 1.1 0.1 setosa
10 5.8 4 1.2 0.2 setosa
# ... with 25 more rows
Conclusion
To summarize, we studied how to find and remove duplicate elements from vector, matrix and data frames. Using unique()
, duplicated()
function.
This brings the end of this Blog. We really appreciate your time.
Hope you liked it.
Do visit our page www.zigya.com/blog for more informative blogs on Data Science
Keep Reading! Cheers!
Zigya Academy
BEING RELEVANT