We use cookies (including Google cookies) to personalize ads and analyze traffic. By continuing to use our site, you accept our Privacy Policy.

Web Crawler Multithreaded

Number: 1368

Difficulty: Medium

Paid? Yes

Companies: Apple, Rubrik, Databricks, Microsoft, MongoDB, OpenAI, Dropbox, Meta


Problem Description

Implement a multi-threaded web crawler that starts from a given URL (startUrl) and uses the blocking API HtmlParser.getUrls() to retrieve all URLs from a page. The crawler must only crawl URLs that share the same hostname as startUrl and must avoid processing the same URL more than once. The result should include all crawled URLs in any order.


Key Insights

  • Use multiple threads to perform concurrent crawling, which speeds up the process over a single-threaded approach.
  • Maintain a thread-safe set (visited) to ensure no URL is processed twice.
  • Utilize a work queue to manage URLs pending processing.
  • Filter URLs based on the hostname extracted from startUrl.
  • Apply synchronization (locks or concurrent data structures) to avoid race conditions when multiple threads access shared data.

Space and Time Complexity

Time Complexity: O(N) where N is the number of URLs processed, with additional constant factors due to threading overhead. Space Complexity: O(N) for storing the visited URLs and the work queue.


Solution

The solution initializes a shared work queue with the startUrl and a thread-safe visited set. Each worker thread dequeues a URL, retrieves its linked URLs via the blocking HtmlParser.getUrls() call, and filters the resulting URLs by comparing their hostnames with the start URL's hostname. New URLs that have not yet been visited are added to both the visited set and the work queue. The use of a thread pool or multiple worker threads along with synchronization ensures that the crawler processes URLs concurrently while avoiding race conditions. The approach follows a BFS-like pattern spread across multiple threads.


Code Solutions

import threading
from queue import Queue
from urllib.parse import urlparse

def crawl(startUrl, htmlParser):
    # Extract hostname of the startUrl
    start_host = urlparse(startUrl).netloc
    # Thread-safe set for visited URLs and a lock for safety
    visited = set()
    visited_lock = threading.Lock()
    
    # Queue to process URLs; initialized with startUrl
    url_queue = Queue()
    url_queue.put(startUrl)
    
    def worker():
        while True:
            try:
                current_url = url_queue.get(timeout=1)  # timeout to avoid blocking indefinitely
            except:
                break
            # Retrieve URLs from the current page using htmlParser
            for next_url in htmlParser.getUrls(current_url):
                # Check if next_url shares the same hostname as startUrl
                if urlparse(next_url).netloc == start_host:
                    with visited_lock:
                        if next_url not in visited:
                            visited.add(next_url)
                            url_queue.put(next_url)
            url_queue.task_done()
    
    # Mark startUrl as visited
    with visited_lock:
        visited.add(startUrl)
    
    # Launch worker threads (8 threads used as example)
    threads = []
    for _ in range(8):
        t = threading.Thread(target=worker)
        t.start()
        threads.append(t)
    
    # Wait until all URLs have been processed
    url_queue.join()
    
    # Ensure all threads terminate
    for t in threads:
        t.join()
    
    return list(visited)
← Back to All Questions