Problem Description
Given a starting URL and an interface HtmlParser that returns all the URLs on a given webpage, design a web crawler. The crawler should start from startUrl, use HtmlParser.getUrls(url) to fetch all links, and then recursively crawl only those URLs that are under the same hostname as startUrl. Make sure to avoid crawling the same URL more than once, and return all unique URLs discovered.
Key Insights
- Treat the set of webpages as a graph where each URL is a node and an edge exists from one URL to another if it is returned by HtmlParser.getUrls(url).
- Use either breadth-first search (BFS) or depth-first search (DFS) to traverse the webpages.
- Extract the hostname from a URL to ensure that only URLs on the same domain are crawled.
- Maintain a visited set (or equivalent) to avoid processing the same URL multiple times.
Space and Time Complexity
Time Complexity: O(N + E) where N is the number of nodes (URLs) and E is the number of edges (links between URLs) encountered during the crawl. Space Complexity: O(N) for storing the visited URLs and the queue/stack used for traversal.
Solution
The solution uses a BFS (or DFS) approach to traverse the web graph:
- Extract the hostname from startUrl (for instance, by splitting on "://" and then the first "/" that follows).
- Initialize a visited set to keep track of URLs that have been crawled.
- Use a queue to implement the BFS. Enqueue the startUrl.
- While the queue is not empty, dequeue a URL and retrieve its outgoing links using HtmlParser.getUrls(url).
- For each URL obtained:
- Extract its hostname and compare it with the hostname of startUrl.
- If it matches and it has not been visited before, add it to the visited set and enqueue it.
- Finally, return the set of visited URLs.
This approach guarantees that each URL is processed only once and that only URLs from the same hostname are crawled.