TheDeveloperBlog.com


Java

Java Remove Duplicates From ArrayList

Duplicate list elements. In ArrayLists we often find duplicate elements. How can we remove these? Many approaches exist, and some are better than others.


For long ArrayLists, a hashing mechanism (like HashSet) that records encountered elements is helpful. This avoids loops through the list. A custom method can be used.


A method. The removeDuplicates method accepts an ArrayList of Strings. In it, we loop through that argument list. If our HashSet contains the String, we do nothing.

Add: But if our HashSet does not contain the String, we add it to the HashSet (so it is detected next time).

Result: We also buildup our result ArrayList. This collection will retain the ordering of the original, but not duplicates.

Based on:

Java 7

Java program that removes duplicates

import java.util.ArrayList;
import java.util.HashSet;

public class Program {

    static ArrayList<String> removeDuplicates(ArrayList<String> list) {

	// Store unique items in result.
	ArrayList<String> result = new ArrayList<>();

	// Record encountered Strings in HashSet.
	HashSet<String> set = new HashSet<>();

	// Loop over argument list.
	for (String item : list) {

	    // If String is not in set, add it to the list and the set.
	    if (!set.contains(item)) {
		result.add(item);
		set.add(item);
	    }
	}
	return result;
    }

    public static void main(String[] args) {

	ArrayList<String> list = new ArrayList<>();
	list.add("dog");
	list.add("cat");
	list.add("dog");
	list.add("dog");
	list.add("cat");
	list.add("bird");

	// Remove duplicates from ArrayList of Strings.
	ArrayList<String> unique = removeDuplicates(list);
	for (String element : unique) {
	    System.out.println(element);
	}
    }
}

Output

dog
cat
bird

Method notes. The above method could be easily adapted to work upon Integers or other types. Try changing the String type to Integer. These generic collections handle all classes.

Numbers

HashSet constructor. This approach removes all duplicates, but the ordering of the elements is lost. We pass an ArrayList to the HashSet constructor, and then convert back to an ArrayList.

Note: This code is simpler and shorter. It does not suffer from any huge performance issues.

Reordering: Using this direct and simple approach causes the elements to be reordered. The HashSet will not retain ordering.

Java program that uses HashSet constructor

import java.util.ArrayList;
import java.util.HashSet;

public class Program {
    public static void main(String[] args) {

	// Create ArrayList with one duplicate element.
	ArrayList<String> values = new ArrayList<>();
	values.add("cat");
	values.add("dog");
	values.add("cat");
	values.add("bird");

	// Create HashSet from ArrayList.
	HashSet<String> set = new HashSet<>(values);

	// Create ArrayList from the set.
	ArrayList<String> result = new ArrayList<>(set);

	// The result.
	System.out.println(result.toString());
    }
}

Output

[cat, bird, dog]

Has duplicates. With a nested for-loop, we can check an ArrayList (or other array) for duplicate elements. We only search following (not preceding) elements for an exact match.

Tip: If an element and a following element are equal, the ArrayList has a duplicate. Otherwise, no duplicates exist.

Tip 2: This loop-based method is sometimes faster than using a set like HashSet. Its performance is poor on large collections.

Thus: Calling hasDuplicates on small lists where no duplicates are likely is often a good optimization.

Java program that checks for duplicates

import java.util.ArrayList;

public class Program {

    public static boolean hasDuplicates(ArrayList<String> list) {
	for (int i = 0; i < list.size(); i++) {
	    // Loop over all following elements.
	    for (int x = i + 1; x < list.size(); x++) {
		// If two elements equal, there is a duplicate.
		if (list.get(i) == list.get(x)) {
		    return true;
		}
	    }
	}
	// No duplicates found.
	return false;
    }

    public static void main(String[] args) {

	// This ArrayList has a duplicate.
	ArrayList<String> values = new ArrayList<>();
	values.add("green");
	values.add("blue");
	values.add("red");
	values.add("pink");
	values.add("pink");

	// See if duplicates exist.
	if (hasDuplicates(values)) {
	    System.out.println(true);
	}
	// Remove all pink strings.
	values.removeIf(element -> element == "pink");
	// No more duplicates.
	System.out.println(hasDuplicates(values));
    }
}

Output

true
false

Testing versus removal. Here is an optimization. If an ArrayList rarely contains duplicates, and has few elements, we can avoid modifying it. We can just test for duplicates.

Times: In this benchmark, we find that on a four-element ArrayList, calling hasDuplicates is over ten times faster than removeDuplicates.

Note: This optimization does not always work. If lists usually have duplicates, it will not help performance.

And: If lists have many thousands of elements, the nested for-loops will start to becomes low. A hashing approach is better at this scale.

Java program that times duplicate test, duplicate removal

import java.util.*;

public class Program {

    static boolean hasDuplicates(ArrayList<String> list) {
	// See if duplicates exist (see above example).
	for (int i = 0; i < list.size(); i++) {
	    for (int x = i + 1; x < list.size(); x++) {
		if (list.get(i) == list.get(x)) {
		    return true;
		}
	    }
	}
	return false;
    }

    static ArrayList<String> removeDuplicates(ArrayList<String> list) {
	// Remove Duplicates: place them in new list (see above example).
	ArrayList<String> result = new ArrayList<>();
	HashSet<String> set = new HashSet<>();
	for (String item : list) {
	    if (!set.contains(item)) {
		result.add(item);
		set.add(item);
	    }
	}
	return result;
    }

    public static void main(String[] args) {

	ArrayList<String> elements = new ArrayList<>();
	elements.add("one");
	elements.add("two");
	elements.add("three");
	elements.add("four");

	long t1 = System.currentTimeMillis();

	// Version 1: test to see if duplicates exist.
	for (int i = 0; i < 10000000; i++) {
	    if (hasDuplicates(elements)) {
		System.out.println(false);
	    }
	}

	long t2 = System.currentTimeMillis();

	// Version 2: process list even if no duplicates.
	for (int i = 0; i < 10000000; i++) {

	    ArrayList<String> copy = removeDuplicates(elements);
	    if (copy == null) {
		System.out.println(false);
	    }
	}

	long t3 = System.currentTimeMillis();

	// ... Benchmark results.
	System.out.println(t2 - t1);
	System.out.println(t3 - t2);
    }
}

Results

 128 ms:    hasDuplicates()
1409 ms:    removeDuplicates()

Performance notes. For small lists, simply looping over elements already added, and not adding duplicate ones, is faster. But this approach would lead to problems with large collections.


Another optimization. Another strategy is to not add duplicates in the first place. A custom method could prevent duplicates from being added.


Sets versus loops. With set-based methods, we can remove duplicate Strings from an ArrayList. Other approaches, which involve nested loops to test elements, are superior in some cases.


For duplicate removal ("dedupe") there are many possible options. This is a good thing: it means we can select the one that is best for our specific use case, our list and data.