Problem Description
Given a DataFrame of customers with potential duplicate rows based on the email
column, write a solution to remove these duplicate rows and keep only the first occurrence.
Key Insights
- The goal is to identify and remove duplicate entries based on the
email
field while retaining the first occurrence of each email. - The DataFrame will have three columns:
customer_id
,name
, andemail
. - Using a method that preserves order is essential since we want to keep the first instance of each duplicate.
Space and Time Complexity
Time Complexity: O(n), where n is the number of rows in the DataFrame. Each row is processed once to check for duplicates. Space Complexity: O(n), in the worst case, for storing the unique emails if all emails are unique.
Solution
To solve the problem, we can utilize a data structure (like a set) to track seen emails as we iterate through the DataFrame. The algorithm follows these steps:
- Initialize an empty set to store seen emails.
- Iterate through each row in the DataFrame.
- For each email, check if it is in the set of seen emails.
- If it is not, add it to the results and mark it as seen.
- If it is already seen, skip that row.
- Return the filtered DataFrame with the first occurrence of each email.