W3C HCLSig - Subgroup on Text: Unstructured => Structured (T2S)
Draft of Proposal
Started 26 January 2006 by Bob Futrelle
This is to be edited into shape by
Bob Futrelle (Northeastern U.) and Matt Cockerill (BioMed Central).
As of Fri Jan 27 00:17:17 EST 2006, Matt has not seen this.
(He's only now arriving back in London.)
This report is a proposal for action items and deliverables in the T2S area.
Here are the main points, as developed by the group and presented at the
meeting.
Main points in brief
- OBTAIN: Begin with unstructured and semi-structured text that
is available.
- GENERATE: Starting with the text obtained, generate further
structure, e.g., extract entities such as names of proteins, genes,
and compounds.
- CREATE STRUCTURE: Transform to a strong structure, most
prominently, RDF/OWL.
- AGREEMENT: Agreement is needed on the form and semantics
of the target structures to avoid Babel.
- EXPOSE TO USERS: The results must be given to users in
a readable form that allows queries, retrieval, edits.
- TOOLS: Powering the steps above.
Deliverables and completion times could be:
- Elaborate and publish this T2S roadmap. Week 2.
- Review existing work on all the above topics.
Collect data, tools, and use cases. Month 2.
- Critique the design and functionality of the collected materials
and systems.
Month 4.
- Develop best practices document based on critiques. Month 6.
- Design and implement demonstrations of prototype system(s) and
cases, based on best practices. Month 15.
- Submit to broader user community for initial use and reactions.
Month 24.
Elaboration of main points
- OBTAIN: Begin with unstructured and semi-structured text that
is available. The point here is to identify and describe the various
sources of text to be given structure.
This could include fully flat text (PubMed abstracts), as well as
weakly marked up text (HTML, simple XML).
Corpora include full-length papers, e.g., BioMed Central (Open Access).
- GENERATE: Starting with the text obtained, generate further
structure, e.g., extract entities such as names of proteins, genes,
and compounds. This is a mini-industry today, so it will not
be hard to document. The most extreme structure could be full
parsing of text. Semantic markup is generated by some systems.
Multi-dimensional markup
is a flexible representation system allowing multiple views
and multiple levels of analysis.
- CREATE STRUCTURE: Transform to a strong structure, most
prominently, RDF/OWL. This has not been a prominent end target
for natural language text. It will require new thinking.
The goals of such structure creation will be the controlling element here.
Me must be clear on what should/can be accomplished and why we would
want to do so.
- AGREEMENT: Agreement is needed on the form and semantics
of the target structures to avoid Babel. Must be based on a broad
view of what exists in the community as well as what the community
might be willing to move to.
- EXPOSE TO USERS: The results must be given to users in
a readable form that allows queries, retrieval, edits.
Both thin and thick clients must be considered.
- TOOLS: Powering the steps above.
Elaboration of deliverables and completion times
- Elaborate and publish this T2S roadmap. Week 2.
Even at this early stage, need to add references to existing
concepts and systems to ground the document.
- Review existing work on all the above topics.
Collect data, tools, and use cases. Month 2.
We have the tools to identify existing work.
The information could be brought together in a Wiki and/or website.
- Critique the design and functionality of the collected materials
and systems. Month 4.
This requires active experimentation that brings together
data, systems, and usage scenarios.
The critiques will form a useful report (a deliverable).
- Develop best practices document based on critiques
(a deliverable). Month 6.
This will have creative components, because we may decide that
none of the existing approaches can meet goals and challenges
that we feel must be met.
- Design and implement demonstrations of prototype system(s) and
cases, based on best practices (deliverables). Month 15.
The emphasis shifts to the subset of implementors.
It could involve as little as showcasing the best systems we find
or as much as developing prototypes.
Given the short time frame, strategies such as modifying or creating plugins
for existing systems would be all we could reasonably hope to do.
- Submit to broader user community for initial use and reactions.
Month 24. Their reactions and our review of their reactions will
allow us to outline possible future directions after this two year
endpoint.
Comments
Various communities must be kept informed of this work in all of its
aspects and at all stages. Public exposure, including postings,
Wikis, web sites, talks, and papers can all be used.
Return to Bob's HCLSig main page.