CS6200/IS4200: Information Retrieval

Homework 1

Assigned: Wednesday, September 11
Due: Tuesday, September 24, 11:59 p.m.

Focused Crawling

Implement your own web crawler, with the following specifications:

Be polite and use a delay of at least one second between requests to the web server.
Start from the seed document https://en.wikipedia.org/wiki/Karen_Sparck_Jones, the Wikipedia article on Karen_Spärck_Jones, an originator of information retrieval concepts such as inverse document frequency.
Only follow links with the prefix http://en.wikipedia.org/wiki/. In other words, do not follow links to non-English articles or to non-Wikipedia pages.
Do not follow links with a colon (:) in the rest of the URL. This will help filter out Wikipedia help and administration pages.
Do not follow links to the main page http://en.wikipedia.org/wiki/Main_Page.
You may use existing libraries to request documents over HTTP, including following redirects that handle alternate article titles in Wikipedia.
Otherwise, you should implement your own code to extract links, keep track of what you've crawled, and decide what to crawl next.
Crawl to at most depth 5 from the seed page. In other words, you should retrieve the seed page, pages it links to, the pages those pages link to, the pages they link to, and the pages they link to. The seed page is thus not depth 0, but depth 1.
Wikipedia pages have a lot of links, so you should also stop when you reach 1000 unique URLs.
Your crawler should take two arguments: the seed page and an optional "keyphrase" that must be present, in any combination of upper and lower case, on any page you crawl (after the seed). Don't worry about tokenization: just match the characters ignoring case. If the keyphrase is not present, stop crawling. This is a very simple version of focused crawling, where the presence or absence of a single feature is used to determine whether a document is relevant.

Hand in your code and instructions on how to (compile and) run it in a README file. In addition, hand in two lists of URLs, each with at most 1000 entries:

the pages crawled when the crawler is run with no keyphrase, in other words all Wikipedia pages meeting the requirements above to a depth of 5 from the starting seed; and
the pages crawled when the keyphrase is ‘retrieval’.

Finally, include in your README the percentage of the pages in the full crawl that were retrieved by the focused crawler for ‘retrieval’. Keep in mind that this will be a significant overestimate of the prevalence of Wikipedia articles on information retrieval.