Problem Description
Implement a multi-threaded web crawler that starts from a given URL (startUrl) and uses the blocking API HtmlParser.getUrls() to retrieve all URLs from a page. The crawler must only crawl URLs that share the same hostname as startUrl and must avoid processing the same URL more than once. The result should include all crawled URLs in any order.
Key Insights
- Use multiple threads to perform concurrent crawling, which speeds up the process over a single-threaded approach.
- Maintain a thread-safe set (visited) to ensure no URL is processed twice.
- Utilize a work queue to manage URLs pending processing.
- Filter URLs based on the hostname extracted from startUrl.
- Apply synchronization (locks or concurrent data structures) to avoid race conditions when multiple threads access shared data.
Space and Time Complexity
Time Complexity: O(N) where N is the number of URLs processed, with additional constant factors due to threading overhead. Space Complexity: O(N) for storing the visited URLs and the work queue.
Solution
The solution initializes a shared work queue with the startUrl and a thread-safe visited set. Each worker thread dequeues a URL, retrieves its linked URLs via the blocking HtmlParser.getUrls() call, and filters the resulting URLs by comparing their hostnames with the start URL's hostname. New URLs that have not yet been visited are added to both the visited set and the work queue. The use of a thread pool or multiple worker threads along with synchronization ensures that the crawler processes URLs concurrently while avoiding race conditions. The approach follows a BFS-like pattern spread across multiple threads.