Contains Duplicate — The Warm-Up That Reveals Your Data Structure Instincts [LC 217]

Sanjeev SharmaSanjeev Sharma
16 min read

Advertisement

Problem Statement

Given an integer array nums, return true if any value appears at least twice in the array, and false if every element is distinct.

Example 1:

Input:  nums = [1, 2, 3, 1]
Output: true
Explanation: 1 appears at index 0 and index 3.

Example 2:

Input:  nums = [1, 2, 3, 4]
Output: false
Explanation: All elements are distinct.

Example 3:

Input:  nums = [1, 1, 1, 3, 3, 4, 3, 2, 4, 2]
Output: true

Constraints:

  • 1 <= nums.length <= 10^5
  • -10^9 <= nums[i] <= 10^9

Why This Problem Matters

On the surface, Contains Duplicate looks like a throwaway warmup. One loop, check for repeats — done in 30 seconds, right?

That's exactly the trap.

Experienced interviewers at Amazon, Meta, and Google use LC 217 as a diagnostic probe, not as a skills test in isolation. The problem is simple enough that nearly every candidate can write a solution. What separates strong candidates is how they think about it:

  • Do they jump straight to brute force without pausing?
  • Do they immediately reach for a HashSet — and can they explain why?
  • Do they proactively discuss what happens when the interviewer says: "Now pretend you can't use extra memory"?
  • Do they notice that sorting modifies the input array — and ask whether that matters?

The problem is a data structure reflex test. A senior engineer who has internalized hash-based lookups will reach for a set before finishing reading the problem. An interviewer watching that happen in real time knows they are looking at someone with strong fundamentals.

It is also a launchpad. In a 45-minute interview, a good interviewer will spend only 5-7 minutes on LC 217 and then pivot into LC 219 (duplicate within k distance) or LC 220 (duplicate within value range t). Your ability to extend your LC 217 solution becomes the real interview.

The bottom line: never phone in this problem. Solve it cleanly, articulate the trade-offs clearly, and show that you are already thinking about what comes next.


Three Approaches — Built from Scratch

Approach 1: Brute Force — O(n²) Time, O(1) Space

The most natural first instinct: check every pair of elements.

For each element at index i:
    For each element at index j (where j > i):
        If nums[i] == nums[j]: return True
Return False

Why it works: We exhaustively compare every possible pair. If any two match, we found a duplicate.

Why it fails at scale: For an array of 100,000 elements, this is ~5 billion comparisons. Completely unusable in practice, and an interviewer will expect you to name this limitation immediately.

The one subtle bug beginners make: Starting the inner loop at j = i instead of j = i + 1. That compares an element with itself and produces a false positive for any non-empty array. The inner loop must start at i + 1.

When to mention it: Bring it up yourself as your baseline — "The naive approach is O(n²) with nested loops, O(1) space. I can do better." This shows you can identify suboptimal paths without needing to be told.


Approach 2: Sort First — O(n log n) Time, O(1) Space

A smarter observation: if any duplicate exists, sorting will place it adjacent to its twin.

Sort nums in place
For i from 1 to len(nums)-1:
    If nums[i] == nums[i-1]: return True
Return False

Why it works: After sorting, all equal elements cluster together. A single linear pass through adjacent pairs is sufficient to detect any duplicate.

Time complexity: O(n log n) dominated by the sort. The subsequent scan is O(n) and doesn't change the overall complexity.

Space complexity: O(1) if sorted in place (Python's sort(), JavaScript's sort() sort in place). However, some sorting algorithms require O(log n) stack space for recursion — worth mentioning in a discussion.

The critical flaw: it mutates the input.

This is what interviewers want you to notice without being prompted. If nums is passed by reference and the caller depends on the original order — say it is a financial ledger where order corresponds to timestamps — you have silently corrupted the input. In production code, that is a bug that may not surface until much later.

If you want to preserve the original and still use sorting, you must copy the array first: sorted_nums = sorted(nums) in Python or [...nums].sort() in JavaScript. That copy costs O(n) space, which eliminates the space advantage of sorting.

This nuance is what separates candidates who have thought carefully about real-world implications from those who just pattern-match to a solution.


Approach 3: HashSet — O(n) Time, O(n) Space

This is the optimal approach and the one every interviewer expects you to lead with (or arrive at quickly).

Core idea: Use a set as a memory of "numbers we have already seen." As we walk through the array, we check this memory before processing each new number. If the number is already in memory, we found a duplicate immediately and can stop.

Initialize an empty set: seen = {}
For each number in nums:
    If number is already in seen: return True
    Add number to seen
Return False

Why a set and not an array or dictionary?

Because sets are backed by a hash table. Lookup (num in seen) and insertion (seen.add(num)) are both O(1) average case — constant time regardless of how many elements are already stored. A plain array would require O(n) lookup, giving us the same O(n²) total as brute force.

Time complexity: O(n) — we visit each element at most once.

Space complexity: O(n) — in the worst case (all distinct elements), we store every number in the set before returning false.

The honest trade-off: We are buying time with space. We go from O(n log n) time + O(1) space (sorting) to O(n) time + O(n) space (HashSet). Which is better depends on your system's constraints — and articulating this is what the interviewer wants to hear.


The Trade-off Discussion — What Interviewers Actually Want You to Say

When you present both the sorting and HashSet approaches, pause and say something like:

"These are fundamentally different trade-offs. HashSet is faster — O(n) vs O(n log n) — but it costs O(n) extra space. Sorting is more memory-efficient but modifies the array and is slightly slower. In a memory-constrained environment like an embedded system or when processing very large datasets, I would lean toward sorting. In most backend services where memory is cheap and latency matters, HashSet is the right call."

Then ask: "Does the input need to remain unchanged?" This single question shows production-level thinking.

Here are the key points to hit spontaneously (not when prompted):

QuestionSorting AnswerHashSet Answer
Time complexity?O(n log n)O(n)
Space complexity?O(1) in-placeO(n)
Mutates input?YesNo
Early exit on first duplicate?No — must sort firstYes — stops immediately
Works with streams / real-time data?No — needs full arrayYes — can process incrementally

The streaming data point is a sleeper insight. If you are reading numbers from a network socket and want to detect the first duplicate without buffering the entire stream for sorting — HashSet is the only viable approach.


Visual Dry Run — HashSet on [1, 2, 3, 1]

Let us trace through the optimal solution step by step.

nums = [1, 2, 3, 1]
seen = {}

Step 1: num = 1
  Is 1 in seen? No
  Add 1 to seen
  seen = {1}

Step 2: num = 2
  Is 2 in seen? No
  Add 2 to seen
  seen = {1, 2}

Step 3: num = 3
  Is 3 in seen? No
  Add 3 to seen
  seen = {1, 2, 3}

Step 4: num = 1
  Is 1 in seen? YES  <-- duplicate found!
  Return True immediately

We never even finish the loop. This early-exit behavior is an advantage of the HashSet approach over sorting — you can return as soon as you find the first duplicate rather than waiting for the entire sort to complete.

For an array like [1, 1, 2, 3, 4, 5, ..., 100000], this means we return on the second element instead of sorting all 100,000 elements first. In the best case, HashSet is O(1). Sorting is always O(n log n) regardless.


Common Mistakes

1. Comparing an element with itself in brute force. Starting the inner loop at j = i instead of j = i + 1. This causes every non-empty array to return true because nums[0] == nums[0] is always true.

2. Using a dictionary (map) when a set is sufficient. Some candidates build seen = {} and store seen[num] = True. This works but signals weak data structure precision. A set is the right tool when you only need membership testing, not key-value storage. Use the exact right tool.

3. Forgetting that sorting mutates the input. Solving the problem correctly in isolation but missing the real-world implication. In an interview, always ask: "Is it acceptable to modify the input array?" This question alone impresses senior interviewers because it demonstrates systems thinking.

4. Not discussing early exit. Presenting the HashSet solution but describing it as "O(n) in all cases." The best case is O(1) — if the first two elements are duplicates, we return immediately. Always give the tightest bound you can.

5. Reaching for the one-liner without explanation. return len(nums) != len(set(nums)) works in Python, but presenting only this in an interview gives the interviewer nothing to evaluate. The one-liner builds the entire set before comparing sizes — it cannot exit early. Show the explicit loop first, then mention the one-liner as a concise alternative with its trade-off (no early exit).


Solutions

Python

from typing import List

class Solution:

    # Approach 1: Brute Force — O(n^2) time, O(1) space
    # Check every pair. Start inner loop at i+1 to avoid self-comparison.
    def containsDuplicate_brute(self, nums: List[int]) -> bool:
        n = len(nums)
        for i in range(n):
            for j in range(i + 1, n):  # j starts at i+1, NOT i
                if nums[i] == nums[j]:
                    return True
        return False

    # Approach 2: Sort — O(n log n) time, O(1) space (but mutates input)
    # After sorting, any duplicate will be adjacent.
    # Note: sorted() creates a copy (safe). nums.sort() mutates in place.
    def containsDuplicate_sort(self, nums: List[int]) -> bool:
        nums_sorted = sorted(nums)          # O(n) space for the copy
        for i in range(1, len(nums_sorted)):
            if nums_sorted[i] == nums_sorted[i - 1]:
                return True
        return False

    # Approach 3: HashSet — O(n) time, O(n) space  <-- OPTIMAL
    # Track each element we have seen. Return True the moment we see a repeat.
    # Early exit: best case O(1) if first two elements are duplicates.
    def containsDuplicate(self, nums: List[int]) -> bool:
        seen = set()
        for num in nums:
            if num in seen:         # O(1) average lookup
                return True
            seen.add(num)           # O(1) average insert
        return False

    # One-liner variant (no early exit — builds full set before comparing)
    # Use this only when brevity is explicitly valued over performance clarity.
    def containsDuplicate_oneliner(self, nums: List[int]) -> bool:
        return len(nums) != len(set(nums))

JavaScript

// Approach 1: Brute Force — O(n^2) time, O(1) space
// Compare every pair. j starts at i+1 to avoid comparing element with itself.
function containsDuplicate_brute(nums) {
    const n = nums.length;
    for (let i = 0; i < n; i++) {
        for (let j = i + 1; j < n; j++) {   // j = i+1, NOT j = i
            if (nums[i] === nums[j]) return true;
        }
    }
    return false;
}

// Approach 2: Sort — O(n log n) time, O(n) space (spread to avoid mutation)
// Sort a copy so we don't silently corrupt the caller's data.
// Duplicates will be adjacent after sorting — one linear scan finds them.
function containsDuplicate_sort(nums) {
    const sorted = [...nums].sort((a, b) => a - b);  // copy, then sort numerically
    for (let i = 1; i < sorted.length; i++) {
        if (sorted[i] === sorted[i - 1]) return true;
    }
    return false;
}

// Approach 3: HashSet — O(n) time, O(n) space  <-- OPTIMAL
// Set provides O(1) has() and add() operations.
// Return true as soon as we encounter a number already in the set.
function containsDuplicate(nums) {
    const seen = new Set();
    for (const num of nums) {
        if (seen.has(num)) return true;   // duplicate found — exit early
        seen.add(num);
    }
    return false;
}

// One-liner: concise but cannot exit early (builds full Set first)
// const containsDuplicate = (nums) => nums.length !== new Set(nums).size;

Complexity Analysis

ApproachTimeSpaceMutates Input?Early Exit?
Brute ForceO(n²)O(1)NoYes
Sort + ScanO(n log n)O(1) in-place / O(n) with copyYes / NoNo
HashSetO(n) averageO(n)NoYes

Reasoning:

  • Brute Force: Two nested loops each running up to n iterations produce n*(n-1)/2 comparisons — O(n²). No extra data structures means O(1) space.
  • Sort + Scan: Sorting is O(n log n) in the average and worst case for comparison-based sorts (merge sort, TimSort). The subsequent scan is O(n) but doesn't dominate. Space is O(1) for in-place sort, O(n) if you copy first to preserve original.
  • HashSet: Each element is visited once — O(n). Hash table operations (add and contains) are O(1) amortized average case due to hashing. Worst case per operation is O(n) if many hash collisions occur (pathological inputs), but this is extremely rare in practice. Space is O(n) for the set.

Follow-up Questions

When an interviewer finishes with LC 217, they will almost always pivot to one of these two problems. Knowing them cold turns a good interview into a great one.

LC 219 — Contains Duplicate II (within k indices)

Problem: Return true if there exist indices i and j such that nums[i] == nums[j] and abs(i - j) <= k.

What it adds: Not just "is there a duplicate?" but "is there a duplicate within a distance constraint?"

Approach — Sliding Window HashSet:

Maintain a set representing the last k elements seen. As you move forward, if the current element is already in the window, you found a nearby duplicate. If the window exceeds size k, evict the oldest element.

def containsNearbyDuplicate(nums: List[int], k: int) -> bool:
    window = set()
    for i, num in enumerate(nums):
        if num in window:
            return True           # duplicate within k distance
        window.add(num)
        if len(window) > k:
            window.remove(nums[i - k])   # shrink window from the left
    return False

Time: O(n) — each element enters and leaves the window at most once. Space: O(k) — the window holds at most k+1 elements.

Why interviewers love it: It tests whether you can extend a HashSet solution with a sliding window constraint — a pattern that appears constantly in real problems.

LC 220 — Contains Duplicate III (within value range t and index range k)

Problem: Return true if there exist indices i and j such that abs(i - j) <= k and abs(nums[i] - nums[j]) <= t.

What it adds: Now you need both an index window constraint AND a value proximity constraint simultaneously.

Approach 1 — Sliding Window + Sorted Structure: Maintain a sorted window of the last k elements. For each new element x, query the sorted structure for the smallest value >= x - t. If that value exists and is <= x + t, return true. Use Python's SortedList (from sortedcontainers) or Java's TreeSet.

Time: O(n log k). Each insert/query on the sorted structure is O(log k).

Approach 2 — Bucket Sort (O(n) time): Assign each number to a bucket of width t+1. Two numbers in the same bucket are guaranteed to differ by at most t. Two numbers in adjacent buckets may also qualify — check those too. Maintain only k buckets (one per sliding window position). This is O(n) time and O(k) space.

Time: O(n). Space: O(k).

Why interviewers love it: LC 220 is genuinely hard and tests whether you can design a custom bucketing scheme. Most candidates cannot solve it without hints. If you can walk through bucket assignment with a concrete example, you demonstrate advanced problem-solving under pressure.


This Pattern Solves

The HashSet for O(1) membership lookup is one of the most reusable patterns in DSA. Recognizing when to apply it is a career-level skill. Problems where the exact same instinct applies:

  • Two Sum (LC 1): "Have I seen the complement of this number before?" — HashSet/HashMap.
  • Valid Anagram (LC 242): "Do both strings have the same character frequency?" — frequency HashMap.
  • Intersection of Two Arrays (LC 349): "Which elements appear in both arrays?" — HashSet membership.
  • Happy Number (LC 202): "Have I seen this number in my sequence before?" — HashSet cycle detection.
  • Longest Consecutive Sequence (LC 128): "Is this number the start of a sequence?" — HashSet for O(1) neighbor lookups.
  • Group Anagrams (LC 49): "Which words share the same character fingerprint?" — sorted-string HashMap key.
  • Find the Duplicate Number (LC 287): Variant with space constraint — teaches Floyd's cycle detection.

In every case above, the core realization is the same: trade O(n) space for O(1) lookup time to reduce overall complexity from O(n²) to O(n).


Key Takeaway

Contains Duplicate is not an algorithm problem — it is a data structure selection problem. The moment you see "have I seen this value before?", your hand should reach for a set. The real interview value is in the conversation around the trade-off: HashSet buys you O(n) time at the cost of O(n) space; sorting gives you O(1) space at the cost of O(n log n) time and array mutation. Neither is universally better — the right answer depends on your system's constraints, and being able to articulate that clearly is what turns a passing interview into a strong hire signal.


Next: Problem 4 — Maximum Subarray (Kadane's Algorithm)

Advertisement

Sanjeev Sharma

Written by

Sanjeev Sharma

Full Stack Engineer · E-mopro