IS1320 Spring 2003 -- List of Projects

Professor Futrelle

Version of 25 March 2003

The projects below will be assigned to groups of about three students, depending on the final class enrollment in order to assure that all of the projects are assigned.

Here is a list of possible projects.

#1. Modern search architectures -- Clusters: Google uses thousands of PCs to return search results from three billion documents within a fraction of a second. This is performance well-beyond anything alluded to in our textbook. This project will dig deeper into current search technology, including page ranking algorithms, indexing and more. See for example http://www.brics.dk/~gerth/webalg02/ Some programming of the algorithms is required.
#2. Copyrights and information access: Certain information is freely available and other has to be paid for. Much of it, free or not, is controlled by copyrights. These topics are at the center of huge battles today. This is a fascinating project topic. Some things to look at include OAI, ePrints and BiomedCentral. No programming needed, but thorough research and good writeups are a must.
#3. Structured information -- XML, databases and more: The representation of information goes far beyond flat text. Describing, representing, indexing, searching and using structured techniques are major activities these days. This project calls for some programming related to HTML and XML generation and parsing, using existing tools for XML Schemas, e.g., JAXB.
#4. Pattern analysis of language: This will involve pattern matching to determine the structure of individual words and word groups. Frequency analyses and indexing strategies are other aspects that you will explore. Requires programming.
#5. Implementing the boolean and vector models: This will be an exercise in turning the discussions in Chapter 2 and in other sources, into software that implements the boolean and vector models for simple retrieval from a small collection of flat-text documents. Your work can be validated against the collection available through the book's website, http://www.sims.berkeley.edu/~hearst/irbook/cfc.html
#6. User interfaces for IR: This will involve some GUI programming using Java Swing. You will build a simple system in which a user can enter queries using your interactive interface. Your application can access local files or web documents. Screen shots of your application in action are required.
#7. Building a web crawler (HTTP client): Using existing toolkits in Java, build a web crawler that downloads documents and saves ones you've specified as relevant. Learn about, explain and respect robot exclusion statements on sites. The focus for this project will be on downloading images. You might use Java2D to deal with the images, once downloaded. Your goal is to do a simple emulation of the google image search system.
#8. Digital libraries versus "real" libraries: A research project to investigate the state of the art and the present and expected future convergence of these two approaches to libraries.
#9. Music and video downloads: Since this is a popular subject, only students with a serious interest in thoroughly researching and documenting the legal and technical aspects of these problems need apply.
#10. Text and data "mining": This project will build a simple software system that attempts to extract particular information from documents (which needn't be from the web). Do research on the TREC competitions.

Go to IS1320 home page.

Return to Prof. Futrelle's home page