We use cookies (including Google cookies) to personalize ads and analyze traffic. By continuing to use our site, you accept our Privacy Policy.

Drop Duplicate Rows

Difficulty: Easy


Problem Description

Given a DataFrame of customers with potential duplicate rows based on the email column, write a solution to remove these duplicate rows and keep only the first occurrence.


Key Insights

  • The goal is to identify and remove duplicate entries based on the email field while retaining the first occurrence of each email.
  • The DataFrame will have three columns: customer_id, name, and email.
  • Using a method that preserves order is essential since we want to keep the first instance of each duplicate.

Space and Time Complexity

Time Complexity: O(n), where n is the number of rows in the DataFrame. Each row is processed once to check for duplicates. Space Complexity: O(n), in the worst case, for storing the unique emails if all emails are unique.


Solution

To solve the problem, we can utilize a data structure (like a set) to track seen emails as we iterate through the DataFrame. The algorithm follows these steps:

  1. Initialize an empty set to store seen emails.
  2. Iterate through each row in the DataFrame.
  3. For each email, check if it is in the set of seen emails.
  4. If it is not, add it to the results and mark it as seen.
  5. If it is already seen, skip that row.
  6. Return the filtered DataFrame with the first occurrence of each email.

Code Solutions

import pandas as pd

def drop_duplicate_rows(customers: pd.DataFrame) -> pd.DataFrame:
    seen_emails = set()
    unique_customers = []

    for index, row in customers.iterrows():
        email = row['email']
        if email not in seen_emails:
            unique_customers.append(row)
            seen_emails.add(email)

    return pd.DataFrame(unique_customers)
← Back to All Questions