C-Sharp | Java | Python | Swift | GO | WPF | Ruby | Scala | F# | JavaScript | SQL | PHP | Angular | HTML
Set: The set, seen, tracks which elements have already been encountered. Sets have only unique elements.
SetSo: We append elements to our list that have not been seen yet. This means all duplicates are removed, but ordering is not affected.
Python program that removes duplicates from list
def remove_duplicates(values):
output = []
seen = set()
for value in values:
# If value has not been encountered yet,
# ... add it to both list and set.
if value not in seen:
output.append(value)
seen.add(value)
return output
# Remove duplicates from this list.
values = [5, 5, 1, 1, 2, 3, 4, 4, 5]
result = remove_duplicates(values)
print(result)
Output
[5, 1, 2, 3, 4]
And: It converts the list into a set—and sets must have only unique elements. So this removes duplicates.
Then: We convert the set back into a list. In these two conversions, the elements are reordered (sorted).
Python program that removes duplicates, reorders
# Our input list.
values = [5, 5, 1, 1, 2, 3, 4, 4, 5]
# Convert to a set and back into a list.
set = set(values)
result = list(set)
print(result)
Output
[1, 2, 3, 4, 5]
Note: This could be a performance disaster on extremely large collections. But for small things, it is effective.
Note 2: For large collections, using a set or dictionary to check for duplicates is a better option.
Python program that implements has_duplicates
def has_duplicates(values):
# For each element, check all following elements for a duplicate.
for i in range(0, len(values)):
for x in range(i + 1, len(values)):
if values[i] == values[x]:
return True
return False
# Test the has_duplicates method.
print(has_duplicates([10, 20, 30, 40]))
print(has_duplicates([1, 2, 3, 1, 2]))
print(has_duplicates([40, 30, 20, 40]))
print(has_duplicates(["cat", "dog", "bird", "dog"]))
print(has_duplicates([None, 0, 1, 2]))
Output
False
True
True
True
False
Version 1: This version of the code checks to see if duplicates exist in the list before removing them.
Version 2: Here we just remove duplicates immediately, without checking to see if any duplicates exist.
Result: For a six-element list with no duplicates, using nested for-loops to check was faster than using the set built-in.
However: This test assumes no duplicates are ever found. This is a worthwhile optimization if duplicates are rare and lists are small.
Python program that times has_duplicates method, set
import time
def has_duplicates(values):
# Same as above example.
for i in range(0, len(values)):
for x in range(i + 1, len(values)):
if values[i] == values[x]:
return True
return False
# Contains no duplicates.
elements = [100, 200, 300, 400, 500, 600]
print(time.time())
# Version 1: test before using set.
for i in range(0, 10000000):
if has_duplicates(elements):
unique = list(set(elements))
else:
unique = elements
print(time.time())
# Version 2: always use set.
for i in range(0, 10000000):
unique = list(set(elements))
print(time.time())
Output
1420055657.667
1420055657.838 has_duplicates: 0.171 s [PyPy]
1420055658.061 set, list: 0.223 s