Find Duplicate File in System [Medium] — Content HashMap

Problem Statement

Given a list of paths with files and content in the format "root/d1/d2 f1.txt(content1) f2.txt(content2)", find all groups of duplicate files (same content).

Input: ["root/a 1.txt(abcd) 2.txt(efgh)", "root/c 3.txt(abcd)"]
Output: [["root/a/1.txt","root/c/3.txt"],["root/a/2.txt"]] (only dupes)

Solutions

Python

from collections import defaultdict

def findDuplicate(paths: list[str]) -> list[list[str]]:
    content_map = defaultdict(list)
    for path in paths:
        parts = path.split()
        root = parts[0]
        for f in parts[1:]:
            name, content = f.split('(')
            content = content[:-1]  # remove ')'
            content_map[content].append(f"{root}/{name}")
    return [v for v in content_map.values() if len(v) > 1]

C++

vector<vector<string>> findDuplicate(vector<string>& paths) {
    unordered_map<string,vector<string>> mp;
    for(auto& p:paths){
        stringstream ss(p); string root; ss>>root;
        string f;
        while(ss>>f){
            int l=f.find('('), r=f.find(')');
            string content=f.substr(l+1,r-l-1), name=root+"/"+f.substr(0,l);
            mp[content].push_back(name);
        }
    }
    vector<vector<string>> res;
    for(auto& [k,v]:mp) if(v.size()>1) res.push_back(v);
    return res;
}

Complexity

Time: O(n×k) where k = avg file content length
Space: O(n×k)