R Series

Finding and Removing Duplicates in R

This article explain how to recognise and erase duplicate data in R.

You’ll learn how to use the following R base and dplyr functions:

R base functions

duplicated() determines which elements of a vector or data frame are duplicates of elements with smaller subscripts, and returns a logical vector indicating which elements (rows) are duplicates.
unique(): for extracting unique elements.

dplyr functions

unique(): for extracting unique elements.

To use dplyr functions you need to first load the tidyverse library in your environment.

Load the tidyverse packages, which include dplyr:

> library(tidyverse)

We’ll use the R built-in iris data set, which we start by converting into a tibble data frame (tbl_df) for easier data analysis.

> data <- as_tibble(iris)
> data
# A tibble: 150 x 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <dbl>       <dbl>        <dbl>       <dbl> <fct>
 1          5.1         3.5          1.4         0.2 setosa
 2          4.9         3            1.4         0.2 setosa
 3          4.7         3.2          1.3         0.2 setosa
 4          4.6         3.1          1.5         0.2 setosa
 5          5           3.6          1.4         0.2 setosa
 6          5.4         3.9          1.7         0.4 setosa
 7          4.6         3.4          1.4         0.3 setosa
 8          5           3.4          1.5         0.2 setosa
 9          4.4         2.9          1.4         0.2 setosa
10          4.9         3.1          1.5         0.1 setosa
# ... with 140 more rows

Find and drop duplicate elements

To find the duplicate elements we are going to use duplicated() function.

> x <- c(1, 1, 2, 5, 4, 3, 4, 7, 3)
> duplicated(x)
[1] FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE

To get the duplicate elements.

> x[duplicated(x)]
[1] 1 4 3

Now, to remove the duplicated elements. we will be using !duplicated() function, where ! is a logical

> x[!duplicated(x)]
[1] 1 2 5 4 3 7

Now, with the above method you can remove the duplicate rows from a data frame based on a column values.

# Removing duplicated values from iris dataset
> data[!duplicated(data$Sepal.Width),]
# A tibble: 23 x 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <dbl>       <dbl>        <dbl>       <dbl> <fct>
 1          5.1         3.5          1.4         0.2 setosa
 2          4.9         3            1.4         0.2 setosa
 3          4.7         3.2          1.3         0.2 setosa
 4          4.6         3.1          1.5         0.2 setosa
 5          5           3.6          1.4         0.2 setosa
 6          5.4         3.9          1.7         0.4 setosa
 7          4.6         3.4          1.4         0.3 setosa
 8          4.4         2.9          1.4         0.2 setosa
 9          5.4         3.7          1.5         0.2 setosa
10          5.8         4            1.2         0.2 setosa
# ... with 13 more rows

Remove duplicate rows in a data frame

The function distinct() [dplyr package] can be used to keep only unique/distinct rows from a data frame. If there are duplicate rows, only the first row is preserved. It’s an efficient version of the R base function unique().

> data %>% distinct()
# A tibble: 149 x 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <dbl>       <dbl>        <dbl>       <dbl> <fct>
 1          5.1         3.5          1.4         0.2 setosa
 2          4.9         3            1.4         0.2 setosa
 3          4.7         3.2          1.3         0.2 setosa
 4          4.6         3.1          1.5         0.2 setosa
 5          5           3.6          1.4         0.2 setosa
 6          5.4         3.9          1.7         0.4 setosa
 7          4.6         3.4          1.4         0.3 setosa
 8          5           3.4          1.5         0.2 setosa
 9          4.4         2.9          1.4         0.2 setosa
10          4.9         3.1          1.5         0.1 setosa
# ... with 139 more rows

Remove duplicated rows based on Sepal.Length

> data %>% distinct(Sepal.Length, .keep_all = TRUE)
# A tibble: 35 x 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <dbl>       <dbl>        <dbl>       <dbl> <fct>
 1          5.1         3.5          1.4         0.2 setosa
 2          4.9         3            1.4         0.2 setosa
 3          4.7         3.2          1.3         0.2 setosa
 4          4.6         3.1          1.5         0.2 setosa
 5          5           3.6          1.4         0.2 setosa
 6          5.4         3.9          1.7         0.4 setosa
 7          4.4         2.9          1.4         0.2 setosa
 8          4.8         3.4          1.6         0.2 setosa
 9          4.3         3            1.1         0.1 setosa
10          5.8         4            1.2         0.2 setosa
# ... with 25 more rows