We use cookies (including Google cookies) to personalize ads and analyze traffic. By continuing to use our site, you accept our Privacy Policy.

Web Crawler

Number: 1271

Difficulty: Medium

Paid? Yes

Companies: Dropbox, Meta


Problem Description

Given a starting URL and an interface HtmlParser that returns all the URLs on a given webpage, design a web crawler. The crawler should start from startUrl, use HtmlParser.getUrls(url) to fetch all links, and then recursively crawl only those URLs that are under the same hostname as startUrl. Make sure to avoid crawling the same URL more than once, and return all unique URLs discovered.


Key Insights

  • Treat the set of webpages as a graph where each URL is a node and an edge exists from one URL to another if it is returned by HtmlParser.getUrls(url).
  • Use either breadth-first search (BFS) or depth-first search (DFS) to traverse the webpages.
  • Extract the hostname from a URL to ensure that only URLs on the same domain are crawled.
  • Maintain a visited set (or equivalent) to avoid processing the same URL multiple times.

Space and Time Complexity

Time Complexity: O(N + E) where N is the number of nodes (URLs) and E is the number of edges (links between URLs) encountered during the crawl. Space Complexity: O(N) for storing the visited URLs and the queue/stack used for traversal.


Solution

The solution uses a BFS (or DFS) approach to traverse the web graph:

  1. Extract the hostname from startUrl (for instance, by splitting on "://" and then the first "/" that follows).
  2. Initialize a visited set to keep track of URLs that have been crawled.
  3. Use a queue to implement the BFS. Enqueue the startUrl.
  4. While the queue is not empty, dequeue a URL and retrieve its outgoing links using HtmlParser.getUrls(url).
  5. For each URL obtained:
    • Extract its hostname and compare it with the hostname of startUrl.
    • If it matches and it has not been visited before, add it to the visited set and enqueue it.
  6. Finally, return the set of visited URLs.

This approach guarantees that each URL is processed only once and that only URLs from the same hostname are crawled.


Code Solutions

# Python Code Solution

class Solution:
    def crawl(self, startUrl: str, htmlParser: 'HtmlParser') -> list[str]:
        from collections import deque
        
        # Helper function to extract hostname from a url string.
        def get_hostname(url: str) -> str:
            # Remove protocol prefix "http://"
            hostname = url.split("://")[1]
            # The hostname is the part before the first "/" (if exists)
            return hostname.split("/")[0]
        
        # Extract the hostname from the starting URL.
        start_hostname = get_hostname(startUrl)
        
        # Initialize a set to keep track of visited urls and a queue for BFS.
        visited = set([startUrl])
        queue = deque([startUrl])
        
        # Perform BFS traversal of the web links.
        while queue:
            current_url = queue.popleft()
            # Retrieve all URLs found at the current URL.
            for url in htmlParser.getUrls(current_url):
                # Check if URL is under the same hostname and has not been visited.
                if url not in visited and get_hostname(url) == start_hostname:
                    visited.add(url)
                    queue.append(url)
                    
        # Return all visited URLs as a list.
        return list(visited)
← Back to All Questions