We use cookies (including Google cookies) to personalize ads and analyze traffic. By continuing to use our site, you accept our Privacy Policy.

Find Duplicate File in System

Difficulty: Medium


Problem Description

Given a list paths of directory info, including the directory path, and all the files with contents in this directory, return all the duplicate files in the file system in terms of their paths. A group of duplicate files consists of at least two files that have the same content.

Key Insights

  • Each directory info string contains a path followed by one or more files with their respective contents.
  • Files with the same content are considered duplicates, regardless of their directory.
  • The output should group file paths that have the same content.

Space and Time Complexity

Time Complexity: O(N), where N is the total number of files across all directories. We are processing each file exactly once.

Space Complexity: O(M), where M is the number of unique file contents. We need to store file paths by their contents in a hash table.

Solution

To solve this problem, we can use a hash table (dictionary) to map file contents to their corresponding file paths. We will iterate through each directory string, extract the directory path and each file's content, and populate the dictionary. After processing all paths, we will filter out entries in the dictionary that have more than one file path, which indicates duplicate files.

  1. Parse each directory string to separate the directory path from the files and their contents.
  2. For each file, extract the file name and its content.
  3. Store the file path in a dictionary with the content as the key.
  4. Finally, return the values of the dictionary that contain lists of file paths with duplicates.

Code Solutions

def findDuplicate(paths):
    from collections import defaultdict

    content_map = defaultdict(list)

    for path in paths:
        parts = path.split()
        directory = parts[0]

        for file_info in parts[1:]:
            file_name, content = file_info.split('(')
            content = content[:-1]  # Remove the closing parenthesis
            content_map[content].append(f"{directory}/{file_name}")

    return [paths for paths in content_map.values() if len(paths) > 1]
← Back to All Questions