This technology is useful for compiling structure-shy programs in various programming and query languages. Transformation of Structure-Shy Programs [PEPM 2007]
Northeastern's role in high-performance XPath comes from Adaptive Programming (a part of the Demeter project) where programs are written as adaptive methods of the form:
schema.traverse(whatToTraverse, whereToGo, whatToDo)where whatToTraverse is an object/document conforming to the schema, whereToGo is a labeled graph specifying where to traverse and whatToDo is an object specifying what kind of processing to perform. A subset of the whereToGo specification has been incoporated in XPath. Northeastern has a Patent on Compiling XPath. The patent does not use the XPath terminology because XPath did not yet exist.
The Northeastern technology covers the following XPath axis: child, descendant, descendant-or-self, parent, ancestor, ancestor-or-self, attribute and self.
Although the Adaptive Programming technology has been mainly used for in-store documents/objects, it is also useful for XML streams. It supports skipping over entire chunks of streams that contain irrelevant information.
The topic of high-performance XPath compilation is clearly being looked at by research and industry. Evidence follows below. So Northeastern's technology is valuable in an IP sense. We would like to turn this interest in high performance XPath into a mutually beneficial research collaboration with the goal to improve the performance of XQuery and any of the many technologies that rely on XPath.
Also available directly from: http://msdn2.microsoft.com/en-us/library/bb308960.aspx#xlinqoverview_topic5b
Microsoft works on: Schema-Aware XML Programming
The above document states: LINQ to XML uses a generic tree type: XElement. Hence, XML trees are essentially processed in an untyped manner. This situation can be improved substantially if there is metadata that can be used to generate Common Language Runtime types that contain the knowledge of how the XML is structured and the appropriate simple types. XML Schema can be leveraged for exactly this purpose.
LINQ to XML supports the Descendents operator and this operator can be implemented efficiently in a schema-aware context using Northeastern's technology.
Oracle Use of Adaptive Programming Compilation Technology
Local Copy: Oracle Use of Adaptive Programming Compilation Technology
The high performance part of the Northeastern technology comes from using the schema to speed the XPath processing as described in the Oracle document.
In the Oracle Technology Network Mark Drake writes:
"...understand that in 10gR2 XML DB you will need to use schema based storage to get high performance, and that it doesn't matter whether you use XPath or XQuery, what matters is whether or not the XPath or XQuery operations get re-written correctly so the database can optimize them."
The Northeastern technology rewrites the XPath operations based on the schema and produces a new XPath expression (called the TraversalGraph) that optimizes the data access in the sense that it never accesses data that is irrelevant to the query based on the static schema information. Oracle uses these kinds of optimizations in both XDB and XDK.
Michael Kay has a product that does schema-aware XPath processing. Currently he only optimizes the most common XPath expressions.
@article{DAFZF03, author = {Yanlei Diao and Mehmet Altinel and Michael J. Franklin and Hao Zhang and Peter Fischer}, title = {Path sharing and predicate evaluation for high-performance XML filtering}, journal = {ACM TODS}, volume = {28}, number = {4}, year = {2003}, pages = {467--516}, publisher = {ACM Press}, }
@article{GKT04, author = {Torsten Grust and Maurice Van Keulen and Jens Teubner}, title = "{Accelerating XPath evaluation in any RDBMS}", journal = {ACM TODS}, volume = {29}, number = {1}, year = {2004}, pages = {91--131}, publisher = {ACM Press}, }This paper focuses on schema-less XML data but says about schema awareness: "lack of schema awareness may also give away possible performance improvements. ... schema information may be useful to discover empty pre/post plane regions at query compile time."
An interesting research topic is to combine index-based processing and schema-awareness.
Recent activity:
http://wiki.eclipse.org/PsychoPathXPathProcessor/UserManual
Karl Lieberherr, March, 2010.