Efficiently Publishing Relational Data as XML Documents


Efficiently Publishing Relational Data as XML Documents
IBM Almaden Research Center
Eugene Shekita
Rimon Barr
Michael Carey
Bruce Lindsay
Hamid Pirahesh
Berthold Reinwald
Jayavel Shanmugasundaram
Univ. Wisconsin/IBM Almaden
Joint work with:
Outline
Why? How? Which? Hence
XML Example
<department name=00urchasing00gt; <emplist> <employee> John </employee> <employee> Mary </employee> </emplist> <projlist> <project> Internet </project> <project> Recycling </project> </projlist> </department>
What is the big deal about XML?
Elegantly models complex, hierarchical/ graph-structured data Domain-specific tags (unlike HTML) Standardized!
Fast emerging as dominant standard for data exchange on the WWW
Why Relational Data?
Most business data stored in relational databases Unlikely to change in the near future Scalability, Reliability, Performance, Tools
Need efficient means to publish relational data as XML documents
Usage Scenario
Existing Database System (RDBMS)
Application/User Query to produce XML Documents
XML Result (processed or displayed in browser)
The Internet
Outline
Why? How? Which? Hence
Example Relational Schema
Department
DeptId
DeptName
10
Purchasing
Project
ProjId
DeptId
ProjName
888
10
Internet
795
10
Recycling
Employee
EmpId
DeptId
EmpName
101
10
John
91
10
Mary
Salary
50K
70K
XML Representation
<department name=00urchasing00gt; <emplist> <employee> John </employee> <employee> Mary </employee> </emplist> <projlist> <project> Internet </project> <project> Recycling </project> </projlist> </department>
Main Issues
Relational data is flat, XML is a tagged graph How do we specify translation from flat model to a graph model? A query language to map from relations to XML How do we transform flat representations to tagged nested representations? Efficient implementation strategies
Outline
Why? How? Language? Mechanism? Which? Hence
SQL: Key Ideas
Sub-queries to specify nesting Scalar functions to specify tags/attributes XML Constructors Aggregate functions to group child elements
Example Relational Schema
Department
DeptId
DeptName
10
Purchasing
Project
ProjId
DeptId
ProjName
888
10
Internet
795
10
Recycling
Employee
EmpId
DeptId
EmpName
101
10
John
91
10
Mary
Salary
50K
70K
SQL: Query to publish XML
Select DEPT(d.name,
<subquery to produce emplist>,
<subquery to produce projlist>
)
From Department d
SQL: XML Constructor
Define XML Constructor DEPT(dname: varchar(20),
emplist: xml,
projlist: xml) As ( <department name={dname}>
<emplist> {emplist} </emplist>
<projlist> {projlist} </projlist>
</department> )
SQL: Query to publish XML
Select DEPT(d.name,
<subquery to produce emplist>,
<subquery to produce projlist>
)
From Department d
SQL: Query to publish XML
Select DEPT(d.name,
(Select XMLAGG(EMP(e.name))
From Employee e
Where e.deptno = d.deptno),
(Select XMLAGG(PROJ(p.name))
From Project p
Where p.deptno = d.deptno)
)
From Department d
Query Result
<department name=00urchasing00gt; <emplist> <employee> John </employee> <employee> Mary </employee> </emplist> <projlist> <project> Internet </project> <project> Recycling </project> </projlist> </department>
(<XML Result>)
Outline
Why? How? Language? Mechanism? Which? Hence
Relations to XML: Issues
Two main differences: Nesting (structuring) Tagging Space of alternatives:
Late Tagging
Early Tagging
Late Structuring
Early Structuring
Inside Engine
Inside Engine
Inside Engine
Outside Engine
Outside Engine
Outside Engine
Stored Procedure Approach
Issue queries for sub-structures and tag them Could be a Stored Procedure
DBMS Engine
Department
Employee
Project
Problem: Too many SQL queries!
(10, Purchasing)
(John) (Mary)
(Internet) (Recycling)
Early Tagging, Early Structuring, Outside Engine
Correlated CLOB Approach
Problem: Correlated execution of sub-queries
Select DEPT(d.name,
(Select XMLAGG(EMP(e.name))
From Employee e
Where e.deptno = d.deptno),
(Select XMLAGG(PROJ(p.name))
From Project p
Where p.deptno = d.deptno)
)
From Department d
Early Tagging, Early Structuring, Inside Engine
De-Correlated CLOB Approach
Compute employee lists associated with all departments Compute project lists associated with all departments Join results above on department id
Early Tagging, Early Structuring, Inside Engine
Problem: CLOBs during query processing
Late Tagging, Late Structuring
XML document content produced without structure (in arbitrary order) Tagger enforces order as final step
Relational Query
Processing
Unstructured content
Tagging
Result XML Document
Redundant Relation Approach
How do we represent nested content as relations?
(10, Purchasing)
(10, Internet) (10, Recycling)
(10, John) (10, Mary)
(Purchasing, John, Internet) (Purchasing, John, Recycling) (Purchasing, Mary, Internet) (Purchasing, Mary, Recycling)
Problem: Large relation due to data redundancy!
Late Tagging, Late Structuring
Outer Union Approach
How do we represent nested content as relations?
Problem: Wide tuples (having many columns)
Department
Employee
Project
Department
Employee
Project
Union
(Purchasing, Internet) (Purchasing, Recycling)
(Purchasing, John) (Purchasing, Mary)
(10, Purchasing)
(Purchasing, null, Internet , 0) (Purchasing, null, Recycling, 0) (Purchasing, John, null , 1) (Purchasing, Mary, null , 1)
Late Tagging, Late Structuring
Hash-based Tagger
Results not structured early In arbitrary order Tagger has to enforce order during tagging Hash-based approach Inside/Outside engine tagger
Late Tagging, Late Structuring
Problem: Requires memory for entire document
Late Tagging, Early Structuring
Structured XML document content produced Tagger just adds tags (constant space)
Relational Query
Processing
Structured content
Tagging
Result XML Document
Sorted Outer Union Approach
A
B
C
D
E
F
G
A B n n E n n
A n C n n F n
A n C n n n G
Late Tagging, Early Structuring
A B n D n n n
Sort By: Aid, Bid, Cid
Problem: Only partial ordering required
Constant Space Tagger
Detects changes in XML document hierarchy Adds appropriate opening/closing tags Inside/outside engine
Late Tagging, Late Structuring
Classification of Alternatives
Late Tagging
Early Tagging
Late
Structuring
Early
Structuring
Inside Engine
Inside Engine
De-Correlated CLOB
Outside Engine
Stored Procedure
Inside Engine
Outside Engine
Sorted Outer Union
(Tagging inside)
Sorted Outer Union
(Tagging outside)
Unsorted Outer Union
(Tagging inside)
Unsorted Outer Union
(Tagging outside)
Outside Engine
Correlated CLOB
Outline
Why? How? Language? Mechanism? Which? Hence
Where Does Time Go?
Performance Evaluation Summary
Late Tagging
Early Tagging
Late
Structuring
Early
Structuring
Inside Engine
Inside Engine
De-Correlated CLOB
Outside Engine
Stored Procedure
Inside Engine
Outside Engine
Sorted Outer Union
(Tagging inside)
Sorted Outer Union
(Tagging outside)
Unsorted Outer Union
(Tagging inside)
Unsorted Outer Union
(Tagging outside)
Outside Engine
Correlated CLOB
00/font>
00/font>
Outline
Why? How? Language? Mechanism? Which? Hence
Conclusion
Publishing XML from relational sources important in Internet SQL-based language specification Implementation Alternatives Inside engine >> Outside engine Unsorted Outer Union : sufficient main memory Sorted Outer Union : otherwise (most stable)
Related Work
SilkRoute (WWW 2000) Oracle00 XML extensions (ICDE 2000) Microsoft00 XDR XPERANTO (VLDB 2000 - demo tomorrow)
Performance Evaluation
Query Depth
Query Fan Out
Database Size
Effect of Query Depth
De-Correlated CLOB Approach
Problem: CLOBs during processing
With EmpStruct (deptname, empinfo) AS ( Select d.deptname, XMLAGG(EMP(employee, e.empname)) From department d left join employee e on d.deptid = e.deptid Group By d.deptname)
With ProjStruct (deptname, projinfo) AS ( Select d.deptname, XMLAGG(PROJ(employee, p.projname)) From department d left join project p on d.deptid = e.deptid Group By d.deptname)
Select DEPT(name, d1.empinfo, d2.projinfo)) From EmpStruct d1 full join ProjStruct d2 on d1.deptname = d2.deptname
Early Tagging, Early Structuring, Inside Engine