IS1320 Spring 2003 -- List of Projects
Professor Futrelle
Version of 25 March 2003
The projects below will be assigned to groups of about three
students, depending on the final class enrollment in order to
assure that all of the projects are assigned.
Here is a list of possible projects.
- #1. Modern search architectures -- Clusters
- Google uses thousands of PCs to return search results from
three billion documents within a fraction of a second.
This is performance well-beyond anything alluded to in our
textbook. This project will dig deeper into current search
technology, including page ranking algorithms, indexing
and more. See for example
http://www.brics.dk/~gerth/webalg02/
Some programming of the algorithms is required.
- #2. Copyrights and information access
- Certain information is freely available and other has to
be paid for. Much of it, free or not, is controlled by
copyrights. These topics are at the center of huge battles
today. This is a fascinating project topic.
Some things to look at include OAI, ePrints and BiomedCentral.
No programming needed, but thorough research and good writeups
are a must.
- #3. Structured information -- XML, databases and more
- The representation of information goes far beyond flat text.
Describing, representing, indexing, searching and using structured
techniques are major activities these days. This project calls
for some programming related to HTML and XML generation and
parsing, using existing tools for XML Schemas, e.g., JAXB.
- #4. Pattern analysis of language
- This will involve pattern matching to determine the structure
of individual words and word groups. Frequency analyses and indexing
strategies are other aspects that you will explore. Requires programming.
- #5. Implementing the boolean and vector models
- This will be an exercise in turning the discussions in Chapter 2
and in other sources, into software that implements the boolean
and vector models for simple retrieval from a small collection of
flat-text documents. Your work can be validated against the collection
available through the book's website,
http://www.sims.berkeley.edu/~hearst/irbook/cfc.html
- #6. User interfaces for IR
- This will involve some GUI programming using Java Swing.
You will build a simple system in which a user can enter queries
using your interactive interface.
Your application can access local files or web documents.
Screen shots of your application in action are required.
- #7. Building a web crawler (HTTP client)
- Using existing toolkits in Java, build a web crawler that
downloads documents and saves ones you've specified as relevant.
Learn about, explain and respect robot exclusion statements on sites.
The focus for this project will be on downloading images.
You might use Java2D to deal with the images, once downloaded.
Your goal is to do a simple emulation of the google image search
system.
- #8. Digital libraries versus "real" libraries
- A research project to investigate the state of the art and
the present and expected future convergence of these
two approaches to libraries.
- #9. Music and video downloads
- Since this is a popular subject, only students with a
serious interest in thoroughly researching and documenting
the legal and technical aspects of these problems need apply.
- #10. Text and data "mining"
- This project will build a simple software system that attempts
to extract particular information from documents (which needn't be
from the web). Do research on the TREC competitions.
Go to IS1320 home page.
Return to Prof. Futrelle's home page