CS6200/IS4200: Information Retrieval
Homework 1
Return to basic course information.
Assigned: Wednesday, September 11
Due: Tuesday, September 24, 11:59 p.m.
Focused Crawling
Implement your own web crawler, with the following specifications:
- Be polite and use a delay of at least one second between
requests to the web server.
- Start from the seed
document
https://en.wikipedia.org/wiki/Karen_Sparck_Jones
,
the Wikipedia article on Karen_Spärck_Jones, an originator of
information retrieval concepts such as inverse document
frequency.
- Only follow links with the
prefix
http://en.wikipedia.org/wiki/
. In other
words, do not follow links to non-English articles or to
non-Wikipedia pages.
- Do not follow links with a colon (:) in the rest of the URL.
This will help filter out Wikipedia help and administration
pages.
- Do not follow links to the main
page
http://en.wikipedia.org/wiki/Main_Page
.
- You may use existing libraries to request documents over HTTP,
including following redirects that handle alternate
article titles in Wikipedia.
- Otherwise, you should implement your own code to extract
links, keep track of what you've crawled, and decide what to
crawl next.
- Crawl to at most depth 5 from the seed page. In other words,
you should retrieve the seed page, pages it links to, the pages
those pages link to, the pages they link to, and the pages they
link to. The seed page is thus not depth 0, but depth 1.
- Wikipedia pages have a lot of links, so you should also stop when you
reach 1000 unique URLs.
- Your crawler should take two arguments: the seed page and an
optional "keyphrase" that must be present, in any combination of
upper and lower case, on any page you crawl (after the seed).
Don't worry about tokenization: just match the characters
ignoring case. If the keyphrase is not present, stop crawling.
This is a very simple version of focused crawling, where the
presence or absence of a single feature is used to determine
whether a document is relevant.
Hand in your code and instructions on how to (compile and) run it
in a README file. In addition, hand in two lists of URLs, each with
at most 1000 entries:
- the pages crawled when the crawler is run with no
keyphrase, in other words all Wikipedia pages meeting the
requirements above to a depth of 5 from the starting seed;
and
- the pages crawled when the keyphrase is
‘retrieval’.
Finally, include in your README the percentage of the pages in the
full crawl that were retrieved by the focused crawler for
‘retrieval’. Keep in mind that this will be a
significant overestimate of the prevalence of Wikipedia
articles on information retrieval.