4241.3.1
by Ian Clatworthy
developer doc updates from brisbane-core |
1 |
This document contains notes about the design for groupcompress, replacement |
2 |
VersionedFiles store for use in pack based repositories. The goal is to provide |
|
3 |
fast, history bounded text extraction. |
|
4 |
||
5 |
Overview |
|
6 |
++++++++ |
|
7 |
||
8 |
The goal: Much tighter compression, maintained automatically. Considerations |
|
9 |
to weigh: The minimum IO to reconstruct a text with no other repository |
|
10 |
involved; The number of index lookups to plan a reconstruction. The minimum |
|
11 |
IO to reconstruct a text with another repositories assistance (affects |
|
12 |
network IO for fetch, which impacts incremental pulls and shallow branch |
|
13 |
operations). |
|
14 |
||
15 |
Current approach |
|
16 |
================ |
|
17 |
||
18 |
Each delta is individually compressed against another text, and then entropy |
|
19 |
compressed. We index the pointers between these deltas. |
|
20 |
||
21 |
Solo reconstruction: Plan a readv via the index, read the deltas in forward |
|
22 |
IO, apply each delta. Total IO: sum(deltas) + deltacount*index overhead. |
|
23 |
Fetch/stacked reconstruction: Plan a readv via the index, using local basis |
|
24 |
texts where possible. Then readv locally and remote and apply deltas. Total IO |
|
25 |
as for solo reconstruction. |
|
26 |
||
27 |
Things to keep |
|
28 |
============== |
|
29 |
||
30 |
Reasonable sizes 'amount read' from remote machines to reconstruct an arbitrary |
|
31 |
text: Reading 5MB for a 100K plain text is not a good trade off. Reading (say) |
|
5538.1.1
by Zearin
Fixed “its” vs “it's”. |
32 |
500K is probably acceptable. Reading ~100K is ideal. However, it's likely that |
4241.3.1
by Ian Clatworthy
developer doc updates from brisbane-core |
33 |
some texts (e.g NEWS versions) can be stored for nearly-no space at all if we |
34 |
are willing to have unbounded IO. Profiling to set a good heuristic will be |
|
35 |
important. Also allowing users to choose to optimise for a server environment |
|
36 |
may make sense: paying more local IO for less compact storage may be useful. |
|
37 |
||
38 |
Things to remove |
|
39 |
================ |
|
40 |
||
41 |
Index scatter gather IO. Doing hundreds or thousands of index lookups is very |
|
42 |
expensive, and doing that per file just adds insult to injury. |
|
43 |
||
44 |
Partioned compression amongst files. |
|
45 |
||
46 |
Scatter gather IO when reconstructing texts: linear forward IO is better. |
|
47 |
||
48 |
Thoughts |
|
49 |
======== |
|
50 |
||
51 |
Merges combine texts from multiple versions to create a new version. Deltas |
|
52 |
add new text to existing files and remove some text from the same. Getting |
|
53 |
high compression means reading some base and then a chain of deltas (could |
|
54 |
be a tree) to gain access to the thing that the final delta was made against, |
|
55 |
and that delta. Rather than composing all these deltas, we can just just |
|
56 |
perform the final diff against the base text and the serialised invidual |
|
57 |
deltas. If the diff algorithm can reuse out of order lines from previous |
|
58 |
texts (e.g. storing AB -> BA as pointers rather than delete and add, then |
|
59 |
the presence of any previously stored line in a single chain can be reused. |
|
60 |
One such diff algorithm is xdelta, another reasonable one to consider is |
|
61 |
plain old zlib or lzma. We could also use bzip2. One advantage of using |
|
62 |
a generic compression engine is less python code. One advantage of |
|
63 |
preprocessing line based deltas is that we reduce the window size for the |
|
64 |
text repeated within lines, and that will help compression by a simple |
|
65 |
entropy compressor as a post processor. |
|
66 |
lzma appears fantastic at compression - 420MB of NEWS files down to 200KB. |
|
67 |
so window size appears to be a key determiner for efficiency. |
|
68 |
||
69 |
Delta strategy |
|
70 |
++++++++++++++ |
|
71 |
||
72 |
Very big objects - no delta. I plan to kick this in at 5MB initially, but |
|
73 |
once the codebase is up and running, we can tweak this to |
|
74 |
||
75 |
Very small objects - no delta? If they are combined with a larger zlib object |
|
76 |
why not? (Answer: because zlib's window is really small) |
|
77 |
||
78 |
Other objects - group by fileid (gives related texts a chance, though using a |
|
79 |
file name would be better long term as e.g. COPYING and COPYING from different |
|
80 |
projects could combine). Then by reverse topological graph(as this places more |
|
81 |
recent texts at the front of a chain). Alternatively, group by size, though |
|
82 |
that should not matter with a large enough window. |
|
83 |
Finally, delta the texts against the current output of the compressor. This is |
|
84 |
essentially a somewhat typed form of sliding window dictionary compression. An |
|
85 |
alternative implementation would be to just use zlib, or lzma, or bzip2 |
|
86 |
directory. |
|
87 |
||
88 |
Unfortunately, just using entropy compression forces a lot of data to be output |
|
89 |
by the decompressor - e.g. 420MB in the NEWS sample corpus. When we only want |
|
90 |
a single 55K text thats inefficient. (An initial test took several seconds with |
|
91 |
lzma.) |
|
92 |
||
93 |
The fastest to implement approach is probably just 'diff output to date and add |
|
94 |
to entropy compressor'. This should produce reasonable results. As delta |
|
95 |
chain length is not a concern (only one delta to apply ever), we can simply |
|
96 |
cap the chain when the total read size becomes unreasonable. Given older texts |
|
97 |
are smaller we probably want some weighted factor of plaintext size. |
|
98 |
||
99 |
In this approach, a single entropy compressed region is read as a unit, giving |
|
100 |
the lower bound for IO (and how much to read is an open question - what byte |
|
101 |
offset of compressed data is sufficient to ensue that the delta-stream contents |
|
102 |
we need are reconstructable. Flushing, while possible, degrades compression(and |
|
103 |
adds overhead - we'd be paying 4 bytes per record guaranteed). Again - tests |
|
104 |
will be needed. |
|
105 |
||
106 |
A nice possibility is to output mpdiff compatible records, which might enable |
|
107 |
some code reuse. This is more work than just diff (current_out, new_text), so |
|
108 |
can wait for the concept to be proven. |
|
109 |
||
110 |
Implementation Strategy |
|
111 |
+++++++++++++++++++++++ |
|
112 |
||
113 |
Bring up a VersionedFiles object that implements this, then stuff it into a |
|
114 |
repository format. zlib as a starting compressor, though bzip2 will probably |
|
115 |
do a good job. |