search

 Wringing a Table Dry: Using CSVZIP to Compress a Relation to its ...

0 comments

file time: 2008-03-11

file siez:357.5KB

filetype:ppt

Click Here To Download...

>   

IBM Almaden Research Center     

漏 2006 IBM Corporation 

Wringing a Table Dry: Using CSVZIP to Compress a Relation to its Entropy  

Vijayshankar Raman & Garret Swart

 

IBM Almaden Research Center 

漏 2006 IBM Corporation      

Oxide is cheap, so why compress? 

Make better use of memory Increase capacity of in memory database Increase effective cache size of on disk database Make better use of bandwidth I/O and memory bandwidth are expensive to scale ALU operations are cheap and getting cheaper Minimize storage and replication costs  

IBM Almaden Research Center 

漏 2006 IBM Corporation      

Why compress relations? 

Relations are important for structured information Text, video, audio, image compression is more advanced than relational Statistical and structural properties of the relation can be exploited to improve compression Relational data have special access patterns Don00 just 00nflate.00nbsp; Need to run selections, projections and aggregations  

IBM Almaden Research Center 

漏 2006 IBM Corporation      

Our results 

Near optimal compression of relational data Exploits data skew, column correlations and lack of ordering Theory: Compress m i.i.d. tuples to within 4.3 m bits of entropy (but theory doesn00 count dictionaries) Practice: Between 8 and 40x compression Scanning compressed relational data Directly perform projections, equality and range selections, and joins on entropy compressed data Cache efficient dictionary usage Query short circuiting  

IBM Almaden Research Center 

漏 2006 IBM Corporation      

This  Talk 

Raw Data 

Compressed  Data 

Analyze 

Meta Data &  Dictionaries 

Compress 

Query 

Results 

Update 

New  Raw Data 

CSVZIP Flow 

Analyze to determine compression plan Compress to reduce size Execute many queries over compressed data Periodically update data and dictionaries  

IBM Almaden Research Center 

漏 2006 IBM Corporation      

Sources of Redundancy in Relations 

Column Value space much smaller than Domain |C| << |domain(C)| Type specific transformations, dictionaries Skew in value frequency H(C) << lg |C| Entropy encoding (e.g. Huffman codes) Column correlations within a tuple H(C1, C2) << H(C1) + H(C2) Column co-coding Incidental tuple ordering H({T1, T2, 00 Tm}) ~ H(T1,T2, 00, Tm) 00m lg m Sort and delta code Tuple correlations If correlated tuples share common columns, sort first on those columns  

{00pple00 00ear00 00ango00 in CHAR(10) 

90% of fruits are 00pple00/b> 

Mangos are mainly sold in August 

Mango buyers also buy paper towels

 

IBM Almaden Research Center 

漏 2006 IBM Corporation      

Male/John 

Compression Process: Step 1 

Input tuple 

Column 1 

Column 2 

Co-code  transform 

Type specific  transform 

Column  1 & 2 

Column  3.A 

Column  Code 

TupleCode 

Column  Code 

Column 3 

Column  3.B 

Column  Code 

Huffman  Encode 

Dict 

Huffman  Encode 

Dict 

Huffman  Encode 

Dict 

Male/John/Sat 

Sat 

2006 

Male, John, 08/10/06, Mango 

101101011 

001 

01011101 

10110101100101011101 

p = 1/512 

p = 1/8 

p = 1/512 

w35/Mango 

w35 

Male 

John 

08/10/06 

Mango 

1.5% 

Steven 

1.9% 

Thomas 

2.3% 

Richard 

2.4% 

Mark 

2.5% 

William 

3.5% 

John 

3.5% 

Robert 

3.6% 

James 

3.8% 

David 

4.2% 

Michael 

22% 

28% 

17% 

15% 

9% 

5% 

4% 

Female 

12% 

42% 

23% 

6% 

10% 

4% 

3% 

Male 

Sun 

Sat 

Fri 

Thu 

Wed 

Tue 

Mon

 

IBM Almaden Research Center 

漏 2006 IBM Corporation      

Compression Process: Step 2 

First tuple code 

Tuplecode 

00/b> 

Sorted  Tuplecodes 

Previous  Tuplecode 

Delta 

Huffman  Encode 

Delta Code 

Append 

Dict 

Compression  Block 

101101011100001100 

10110101110001011111 

1011010111000011101 

10110101110001011101 

00/font> 

00/font> 

10110101110001011101 

0000000000000000001 

000 

000 

00000000000000000001 

010 

010 

0000000000000000101 

1110 

1110 

00/font> 

Look Ma, no delimiters! 

101101011100010111010000101110

 

IBM Almaden Research Center 

漏 2006 IBM Corporation      

Compression Results 

P1 00P6: Various projections of TPC-H tables   P7: SAP SEOCOMPODF P8: TPC-E Customer  

IBM Almaden Research Center 

漏 2006 IBM Corporation      

Huffman Code Scan operations 

SELECT SUM(price) FROM Sale WHERE week(saleDate) = 23 AND fruit = 00ango00AND year(saleDate) between 1997 AND 2005 Scan this: 1011010110010101110101001 Skip Over first column: Need length Range Compare on 2nd column: year in 1997 to 2005 Equality Compare 3rd column: Week = 23, fruit = Mango Decode 4th column for aggregation Segregated Coding: Faster operations, same compression Assign Huffman Codes in order of length |code(v)| < |code(w)| 00/font> code(v) < code(w) Sort codes within a length |code(v)| = |code(w)| 00/font> (v < w 00/font> code(v) < code(w))  

0101001 

2001 

0101000 

1998 

0101010 

2002 

010011 

2004 

0101011 

2005 

010001 

1999 

010010 

2000 

010000 

1997 

001 

2006 

000 

2003 

Code 

Year

 

IBM Almaden Research Center 

漏 2006 IBM Corporation      

Segregated Coding: Computing Code Length 

One code length 00/b> Constant function #define codeLen(w) 6 Second largest code length << lg L1 cache size 00/b> Use lookup table #define codeLen(w) \    codeTable[x>>26] Otherwise compare input with max code of each length #define codeLen(w) \    (w <= 0b0011111100?3 \   :(w <= 0b0100111100?6 \   :(w <= 0b0101011100?7 00)))  

0101001 

2001 

0101000 

1998 

0101010 

2002 

010011 

2004 

0101011 

2005 

010001 

1999 

010010 

2000 

010000 

1997 

001 

2006 

000 

2003 

Code 

Year

 

IBM Almaden Research Center 

漏 2006 IBM Corporation      

Segregated Coding: Range Query 

100 

200 

300 

98 

180 

220 

322 

87 

111 

190 

232 

256 

278 

298 

switch (codeLen(w)) {

case 3: return w>>28 != 0; 

302 

case 4: return w >= 0b0111000000000000    && w <= 0b1000111111111111; 

case 5: return w >= 0b1011000000000000  && w <= 0b1101111111111111;

333 

Value   code 

000 

001 

010 

0110 

0111 

1000 

1001 

10100 

10101 

10110 

10111 

11000 

11001 

11010 

11011 

11100 

SELECT * WHERE col BETWEEN 112 and 302

 

IBM Almaden Research Center 

漏 2006 IBM Corporation      

Advantages of Segregated Coding 

Find code length quickly No access to dictionary Fast Range query No access to dictionary for constant ranges Cache Locality Because values are sorted by code length, commonly used values are clustered near the beginning of the array The beginning of the array is most likely to be in cache, improving the cache hit ratio  

IBM Almaden Research Center 

漏 2006 IBM Corporation      

Query Short Circuiting 

Reuse predicates and values that depend on unchanged columns Sorting causes many unchanged columns  

101101011100001100 

1011010111000011101 

0000000000000000101 

Previous Tuple: 

Delta Value: 

Next Tuple: 

Common Bits:  

1011010111000011 

Unchanged Columns: 

Gender/  FName 

Reused  predicates: 

Sex = Male  Name = John  Year 002005 

Reduces instructions but adds a branch! 

Year

 

IBM Almaden Research Center 

漏 2006 IBM Corporation      

Selected Prior Work 

Entropy Coding Shannon (1948), Huffman (1952) Arithmetic coding 00Abramson (1963) Pasco, Rissanen (1976) Row or Page Coding Compress each row or page independently.  Decompress on page load or row touch. Compression code is localized.  [Oracle, DB2, IMS] Column-wise coding Each column value gets a fixed length code from a per column dictionary. [Sybase IQ, CStore, MonetDB] Pack multiple short values into 16 bit quantities and decode them as a unit to save CPU [Abadi/Madden/Ferreira]   Delta coding Sort and difference or remove common prefix from adjacent codes [Inverted Indices, B-trees, CStore] Text coding 00zip00style coding using n-grams, Huffman codes, and sliding dictionaries [Ziv, Lempel, Welch, Katz] Order preserving codes Allows range queries at a cost in compression [Hu/Tucker, Antoshenkov/Murray/Lomet, Zandi/Iyer/Langdon] Lossy coding Model based lossy compression: SPARTAN, Vector quantization  

IBM Almaden Research Center 

漏 2006 IBM Corporation      

Work in Progress 

Analysis to find best: Dictionaries that fit in L2 cache size Set of columns to co-code Column ordering for sort Generate code for efficient queries on x86-64, Power5 and Cell Don00 interpret meta-data at run time Utilize architecture features Update Incremental update of dictionaries.  Background merge of new rows. Release of CSVZIP utilities  

IBM Almaden Research Center 

漏 2006 IBM Corporation      

Observations 

Entropy decoding uses less I/O, but more ALU ops than conventional decoding Our technique removes the cache as a problem Have to squeeze every ALU op: Trends in favor Variable length codes makes vectorization and out-of-order execution hard Exploit compression block parallelism instead These techniques can be exploited in a column store  

IBM Almaden Research Center 

漏 2006 IBM Corporation      

Back up

 

IBM Almaden Research Center 

漏 2006 IBM Corporation      

Entropy Encoding on a Column Store 

Don00 build tuple code: Treat tuple as vector of column codes and sort lexicographically Columns early in the sort: Run length encoded deltas Columns in the middle of the sort: Entropy encoded deltas Columns late in the sort: Concatenated column codes Independently break columns into compression blocks Make dictionaries bigger because only using one at a time  

IBM Almaden Research Center 

漏 2006 IBM Corporation      

Entropy: A measure of information content 

Entropy of a random variable R The expected number of bits needed to represent the outcome of R H(R) = 00sub>r 00/sub> domain(R) Pr(R = r) lg (1/ Pr(R = r)) Conditional entropy of R given S The expected number of bits needed to represent the outcome of R given we already know the outcome of S. H(R | S) = 00sub>s 00/sub> domain(S) 00sub>r 00/sub> domain(R) Pr(R = r & S = s)                                                    lg (1/ Pr(R = r & S = s)) 00H(S) If R is a random relation of size n, then R is a multi-set of random variables {T1, 00 Tn} where each random tuple Ti is a cross product of random attributes C1i 00/b> 0000Cki  

IBM Almaden Research Center 

漏 2006 IBM Corporation      

The Entropy of a Relation 

We define a random relation R of size m over D as a random variable whose outcomes are multi-sets of size m where each element is chosen identically and independently from an arbitrary tuple distribution D. The results are dependent on H(D) and thus on the optimal encoding of tuples chosen from D. If we do a good job of co-coding and Huffman coding, then the tuple codes are entropy coded: They are random bit strings whose length depends on the distribution of the column values but whose entropy is equal to their length Lemma 2: The Entropy of random relation R of size m over a distribution D is at least m H(D) 00lg m! Theorem 3: The Algorithm presented compresses a random relation R of size m to within H(R) + 4.3 m bits, if m > 100  

IBM Almaden Research Center 

漏 2006 IBM Corporation      

Proof of Lemma 2 

Let R be a random vector of m tuples i.i.d. over distribution D whose outcomes are sequences of m tuples, t1, 00 tm. Obviously H(R) is m H(D). Consider an augmentation of R that adds an index to each tuple so that ti has the value i appended.  Define R1 as a set consisting of exactly those values.  H(R1) = m H(D) as there is a bijection between R1 and R But the random multi-set R is a projection of the set R1 and there are exactly m! equal probability sets R1 that each project to each outcome of R so H(R1) 00H(R) + lg m! and thus H(R) 00 m H(D) 00lg m!  

IBM Almaden Research Center 

漏 2006 IBM Corporation      

Proof sketch of Theorem 3 

Lemma 1 says: If R is random multi-set of m values over the uniform distribution 1..m and m > 100, then H(delta(sort(R))) < 2.67 m. But we have values from an arbitrary distribution,  so work by cases For values longer than lg m bits, truncate, getting a uniform distribution in the range. For values shorter than lg m bits, append random bits, also getting a uniform distribution.

   download Wringing a Table Dry: Using CSVZIP to Compress a Relation to its ...

Responses to Wringing a Table Dry: Using CSVZIP to Compress a Relation to its ...

It's no comment...

 

Your Name:
Your Email:
Your Talk: