Press "Enter" to skip to content

Finding and Removing Duplicates in R

Zigya Acadmey 0

This article explain how to recognise and erase duplicate data in R. 

You’ll learn how to use the following R base and dplyr functions:

R base functions

  • duplicated() determines which elements of a vector or data frame are duplicates of elements with smaller subscripts, and returns a logical vector indicating which elements (rows) are duplicates.
  • unique(): for extracting unique elements.

dplyr functions

  • unique(): for extracting unique elements.

To use dplyr functions you need to first load the tidyverse library in your environment.

Load the tidyverse packages, which include dplyr:

> library(tidyverse)

We’ll use the R built-in iris data set, which we start by converting into a tibble data frame (tbl_df) for easier data analysis.

> data <- as_tibble(iris)
> data
# A tibble: 150 x 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <dbl>       <dbl>        <dbl>       <dbl> <fct>
 1          5.1         3.5          1.4         0.2 setosa
 2          4.9         3            1.4         0.2 setosa
 3          4.7         3.2          1.3         0.2 setosa
 4          4.6         3.1          1.5         0.2 setosa
 5          5           3.6          1.4         0.2 setosa
 6          5.4         3.9          1.7         0.4 setosa
 7          4.6         3.4          1.4         0.3 setosa
 8          5           3.4          1.5         0.2 setosa
 9          4.4         2.9          1.4         0.2 setosa
10          4.9         3.1          1.5         0.1 setosa
# ... with 140 more rows

Find and drop duplicate elements

To find the duplicate elements we are going to use duplicated() function.

> x <- c(1, 1, 2, 5, 4, 3, 4, 7, 3)
> duplicated(x)

To get the duplicate elements.

> x[duplicated(x)]
[1] 1 4 3

Now, to remove the duplicated elements. we will be using !duplicated() function, where ! is a logical

> x[!duplicated(x)]
[1] 1 2 5 4 3 7

Now, with the above method you can remove the duplicate rows from a data frame based on a column values.

# Removing duplicated values from iris dataset
> data[!duplicated(data$Sepal.Width),]
# A tibble: 23 x 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <dbl>       <dbl>        <dbl>       <dbl> <fct>
 1          5.1         3.5          1.4         0.2 setosa
 2          4.9         3            1.4         0.2 setosa
 3          4.7         3.2          1.3         0.2 setosa
 4          4.6         3.1          1.5         0.2 setosa
 5          5           3.6          1.4         0.2 setosa
 6          5.4         3.9          1.7         0.4 setosa
 7          4.6         3.4          1.4         0.3 setosa
 8          4.4         2.9          1.4         0.2 setosa
 9          5.4         3.7          1.5         0.2 setosa
10          5.8         4            1.2         0.2 setosa
# ... with 13 more rows

Remove duplicate rows in a data frame

The function distinct() [dplyr package] can be used to keep only unique/distinct rows from a data frame. If there are duplicate rows, only the first row is preserved. It’s an efficient version of the R base function unique().

> data %>% distinct()
# A tibble: 149 x 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <dbl>       <dbl>        <dbl>       <dbl> <fct>
 1          5.1         3.5          1.4         0.2 setosa
 2          4.9         3            1.4         0.2 setosa
 3          4.7         3.2          1.3         0.2 setosa
 4          4.6         3.1          1.5         0.2 setosa
 5          5           3.6          1.4         0.2 setosa
 6          5.4         3.9          1.7         0.4 setosa
 7          4.6         3.4          1.4         0.3 setosa
 8          5           3.4          1.5         0.2 setosa
 9          4.4         2.9          1.4         0.2 setosa
10          4.9         3.1          1.5         0.1 setosa
# ... with 139 more rows

Remove duplicated rows based on Sepal.Length

> data %>% distinct(Sepal.Length, .keep_all = TRUE)
# A tibble: 35 x 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <dbl>       <dbl>        <dbl>       <dbl> <fct>
 1          5.1         3.5          1.4         0.2 setosa
 2          4.9         3            1.4         0.2 setosa
 3          4.7         3.2          1.3         0.2 setosa
 4          4.6         3.1          1.5         0.2 setosa
 5          5           3.6          1.4         0.2 setosa
 6          5.4         3.9          1.7         0.4 setosa
 7          4.4         2.9          1.4         0.2 setosa
 8          4.8         3.4          1.6         0.2 setosa
 9          4.3         3            1.1         0.1 setosa
10          5.8         4            1.2         0.2 setosa
# ... with 25 more rows


To summarize, we studied how to find and remove duplicate elements from vector, matrix and data frames. Using unique() , duplicated() function.

This brings the end of this Blog. We really appreciate your time.

Hope you liked it.

Do visit our page for more informative blogs on Data Science

Keep Reading! Cheers!

Zigya Academy

Leave a Reply

Your email address will not be published. Required fields are marked *