4176.2.1
by John Arbash Meinel
Start writing up the index doc in a ReST discussion. |
1 |
=================== |
2 |
CHK Optimized index |
|
3 |
=================== |
|
4 |
||
5 |
Our current btree style index is nice as a general index, but it is not optimal |
|
4176.2.6
by John Arbash Meinel
A few notes, some updates from ian. |
6 |
for Content-Hash-Key based content. With CHK, the keys themselves are hashes, |
4176.2.1
by John Arbash Meinel
Start writing up the index doc in a ReST discussion. |
7 |
which means they are randomly distributed (similar keys do not refer to |
8 |
similar content), and they do not compress well. However, we can create an |
|
9 |
index which takes advantage of these abilites, rather than suffering from |
|
10 |
them. Even further, there are specific advantages provided by |
|
11 |
``groupcompress``, because of how individual items are clustered together. |
|
12 |
||
13 |
Btree indexes also rely on zlib compression, in order to get their compact |
|
14 |
size, and further has to try hard to fit things into a compressed 4k page. |
|
15 |
When the key is a sha1 hash, we would not expect to get better than 20bytes |
|
16 |
per key, which is the same size as the binary representation of the hash. This |
|
17 |
means we could write an index format that gets approximately the same on-disk |
|
18 |
size, without having the overhead of ``zlib.decompress``. Some thought would |
|
19 |
still need to be put into how to efficiently access these records from remote. |
|
20 |
||
21 |
||
22 |
Required information |
|
23 |
==================== |
|
24 |
For a given groupcompress record, we need to know the offset and length of the |
|
25 |
compressed group in the .pack file, and the start and end of the content inside |
|
26 |
the uncompressed group. The absolute minimum is slightly less, but this is a |
|
4176.2.3
by John Arbash Meinel
next round of comments about the improved chk index. |
27 |
good starting point. The other thing to consider, is that for 1M revisions and |
28 |
1M files, we'll probably have 10-20M CHK pages, so we want to make sure we |
|
29 |
have an index that can scale up efficiently. |
|
4176.2.1
by John Arbash Meinel
Start writing up the index doc in a ReST discussion. |
30 |
|
31 |
1. A compressed sha hash is 20-bytes |
|
32 |
||
33 |
2. Pack files can be > 4GB, we could use an 8-byte (64-bit) pointer, or we |
|
34 |
could store a 5-byte pointer for a cap at 1TB. 8-bytes still seems like |
|
35 |
overkill, even if it is the natural next size up. |
|
36 |
||
37 |
3. An individual group would never be longer than 2^32, but they will often |
|
38 |
be bigger than 2^16. 3 bytes for length (16MB) would be the minimum safe |
|
39 |
length, and may not be safe if we expand groups for large content (like ISOs). |
|
40 |
So probably 4-bytes for group length is necessary. |
|
41 |
||
42 |
4. A given start offset has to fit in the group, so another 4-bytes. |
|
43 |
||
44 |
5. Uncompressed length of record is based on original size, so 4-bytes is |
|
45 |
expected as well. |
|
46 |
||
47 |
6. That leaves us with 20+8+4+4+4 = 40 bytes per record. At the moment, btree |
|
48 |
compression gives us closer to 38.5 bytes per record. We don't have perfect |
|
49 |
compression, but we also don't have >4GB pack files (and if we did, the first |
|
50 |
4GB are all under then 2^32 barrier :). |
|
51 |
||
52 |
If we wanted to go back to the ''minimal'' amount of data that we would need to |
|
53 |
store. |
|
54 |
||
4176.2.3
by John Arbash Meinel
next round of comments about the improved chk index. |
55 |
1. 8 bytes of a sha hash are generally going to be more than enough to fully |
56 |
determine the entry (see `Partial hash`_). We could support some amount of |
|
4176.2.1
by John Arbash Meinel
Start writing up the index doc in a ReST discussion. |
57 |
collision in an index record, in exchange for resolving it inside the |
58 |
content. At least in theory, we don't *have* to record the whole 20-bytes |
|
4176.2.3
by John Arbash Meinel
next round of comments about the improved chk index. |
59 |
for the sha1 hash. (8-bytes gives us less than 1 in 1000 chance of |
60 |
a single collision for 10M nodes in an index) |
|
4176.2.1
by John Arbash Meinel
Start writing up the index doc in a ReST discussion. |
61 |
|
62 |
2. We could record the start and length of each group in a separate location, |
|
63 |
and then have each record reference the group by an 'offset'. This is because |
|
64 |
we expect to have many records in the same group (something like 10k or so, |
|
65 |
though we've fit >64k under some circumstances). At a minimum, we have one |
|
66 |
record per group so we have to store at least one reference anyway. So the |
|
67 |
maximum overhead is just the size and cost of the dereference (and normally |
|
68 |
will be much much better than that.) |
|
69 |
||
4176.2.3
by John Arbash Meinel
next round of comments about the improved chk index. |
70 |
3. If a group reference is an 8-byte start, and a 4-byte length, and we have |
71 |
10M keys, but get at least 1k records per group, then we would have 10k |
|
72 |
groups. So we would need 120kB to record all the group offsets, and then |
|
73 |
each individual record would only need a 2-byte group number, rather than a |
|
4176.2.1
by John Arbash Meinel
Start writing up the index doc in a ReST discussion. |
74 |
12-byte reference. We could be safe with a 4-byte group number, but if |
75 |
each group is ~1MB, 64k groups is 64GB. We can start with 2-byte, but leave |
|
76 |
room in the header info to indicate if we have more than 64k group entries. |
|
77 |
Also, current grouping creates groups of 4MB each, which would make it |
|
78 |
256GB, to create 64k groups. And our current chk pages compress down to |
|
79 |
less than 100 bytes each (average is closer to 40 bytes), which for 256GB |
|
80 |
of raw data, would amount to 2.7 billion CHK records. (This will change if |
|
81 |
we start to use CHK for text records, as they do not compress down as |
|
4176.2.3
by John Arbash Meinel
next round of comments about the improved chk index. |
82 |
small.) Using 100 bytes per 10M chk records, we have 1GB of compressed chk |
83 |
data, split into 4MB groups or 250 total groups. Still << 64k groups. |
|
84 |
Conversions could create 1 chk record at a time, creating a group for each, |
|
85 |
but they would be foolish to not commit a write group after 10k revisions |
|
86 |
(assuming 6 CHK pages each). |
|
4176.2.1
by John Arbash Meinel
Start writing up the index doc in a ReST discussion. |
87 |
|
88 |
4. We want to know the start-and-length of a record in the decompressed |
|
89 |
stream. This could actually be moved into a mini-index inside the group |
|
90 |
itself. Initial testing showed that storing an expanded "key => |
|
91 |
start,offset" consumed a considerable amount of compressed space. (about |
|
92 |
30% of final size was just these internal indices.) However, we could move |
|
93 |
to a pure "record 1 is at location 10-20", and then our external index |
|
94 |
would just have a single 'group entry number'. |
|
95 |
||
96 |
There are other internal forces that would give a natural cap of 64k |
|
97 |
entries per group. So without much loss of generality, we could probably get |
|
98 |
away with a 2-byte 'group entry' number. (which then generates an 8-byte |
|
99 |
offset + endpoint as a header in the group itself.) |
|
100 |
||
101 |
5. So for 1M keys, an ideal chk+group index would be: |
|
102 |
||
4176.2.4
by John Arbash Meinel
The mini-index is the key to shrinking the key width and still being able to 'scale'. |
103 |
a. 6-byte hash prefix |
4176.2.1
by John Arbash Meinel
Start writing up the index doc in a ReST discussion. |
104 |
|
105 |
b. 2-byte group number |
|
106 |
||
107 |
c. 2-byte entry in group number |
|
108 |
||
109 |
d. a separate lookup of 12-byte group number to offset + length |
|
110 |
||
4176.2.4
by John Arbash Meinel
The mini-index is the key to shrinking the key width and still being able to 'scale'. |
111 |
e. a variable width mini-index that splits X bits of the key. (to maintain |
112 |
small keys, low chance of collision, this is *not* redundant with the |
|
113 |
value stored in (a)) This should then dereference into a location in |
|
114 |
the index. This should probably be a 4-byte reference. It is unlikely, |
|
115 |
but possible, to have an index >16MB. With an 10-byte entry, it only |
|
116 |
takes 1.6M chk nodes to do so. At the smallest end, this will probably |
|
117 |
be a 256-way (8-bits) fan out, at the high end it could go up to |
|
118 |
64k-way (16-bits) or maybe even 1M-way (20-bits). (64k-way should |
|
119 |
handle up to 5-16M nodes and still allow a cheap <4k read to find the |
|
120 |
final entry.) |
|
4176.2.3
by John Arbash Meinel
next round of comments about the improved chk index. |
121 |
|
122 |
So the max size for the optimal groupcompress+chk index with 10M entries would be:: |
|
123 |
||
4176.2.4
by John Arbash Meinel
The mini-index is the key to shrinking the key width and still being able to 'scale'. |
124 |
10 * 10M (entries) + 64k * 12 (group) + 64k * 4 (mini index) = 101 MiB |
4176.2.3
by John Arbash Meinel
next round of comments about the improved chk index. |
125 |
|
4176.2.4
by John Arbash Meinel
The mini-index is the key to shrinking the key width and still being able to 'scale'. |
126 |
So 101MiB which breaks down as 100MiB for the actual entries, 0.75MiB for the |
4176.2.1
by John Arbash Meinel
Start writing up the index doc in a ReST discussion. |
127 |
group records, and 0.25MiB for the mini index. |
128 |
||
129 |
1. Looking up a key would involve: |
|
130 |
||
131 |
a. Read ``XX`` bytes to get the header, and various config for the index. |
|
132 |
Such as length of the group records, length of mini index, etc. |
|
133 |
||
134 |
b. Find the offset in the mini index for the first YY bits of the key. Read |
|
135 |
the 4 byte pointer stored at that location (which may already be in the |
|
136 |
first content if we pre-read a minimum size.) |
|
137 |
||
138 |
c. Jump to the location indicated, and read enough bytes to find the |
|
4176.2.3
by John Arbash Meinel
next round of comments about the improved chk index. |
139 |
correct 12-byte record. The mini-index only indicates the start of |
140 |
records that start with the given prefix. A 64k-way index resolves 10MB |
|
141 |
records down to 160 possibilities. So at 12 bytes each, to read all would |
|
142 |
cost 1920 bytes to be read. |
|
4176.2.1
by John Arbash Meinel
Start writing up the index doc in a ReST discussion. |
143 |
|
144 |
d. Determine the offset for the group entry, which is the known ``start of |
|
145 |
groups`` location + 12B*offset number. Read its 12-byte record. |
|
146 |
||
147 |
e. Switch to the .pack file, and read the group header to determine where in |
|
148 |
the stream the given record exists. At this point, you have enough |
|
149 |
information to read the entire group block. For local ops, you could |
|
150 |
only read enough to get the header, and then only read enough to |
|
151 |
decompress just the content you want to get at. |
|
152 |
||
153 |
Using an offset, you also don't need to decode the entire group header. |
|
154 |
If we assume that things are stored in fixed-size records, you can jump |
|
155 |
to exactly the entry that you care about, and read its 8-byte |
|
156 |
(start,length in uncompressed) info. If we wanted more redundancy we |
|
157 |
could store the 20-byte hash, but the content can verify itself. |
|
158 |
||
159 |
f. If the size of these mini headers becomes critical (8 bytes per record |
|
160 |
is 8% overhead for 100 byte records), we could also compress this mini |
|
4176.2.4
by John Arbash Meinel
The mini-index is the key to shrinking the key width and still being able to 'scale'. |
161 |
header. Changing the number of bytes per entry is unlikely to be |
162 |
efficient, because groups standardize on 4MiB wide, which is >>64KiB for |
|
163 |
a 2-byte offset, 3-bytes would be enough as long as we never store an |
|
164 |
ISO as a single entry in the content. Variable width also isn't a big |
|
165 |
win, since base-128 hits 4-bytes at just 2MiB. |
|
166 |
||
4176.2.1
by John Arbash Meinel
Start writing up the index doc in a ReST discussion. |
167 |
For minimum size without compression, we could only store the 4-byte |
168 |
length of each node. Then to compute the offset, you have to sum all |
|
4176.2.4
by John Arbash Meinel
The mini-index is the key to shrinking the key width and still being able to 'scale'. |
169 |
previous nodes. We require <64k nodes in a group, so it is up to 256KiB |
170 |
for this header, but we would lose partial reads. This should still be |
|
171 |
cheap in compiled code (needs tests, as you can't do partial info), and |
|
172 |
would also have the advantage that fixed width would be highly |
|
173 |
compressible itself. (Most nodes are going to have a length that fits |
|
174 |
1-2 bytes.) |
|
175 |
||
176 |
An alternative form would be to use the base-128 encoding. (If the MSB |
|
4176.2.1
by John Arbash Meinel
Start writing up the index doc in a ReST discussion. |
177 |
is set, then the next byte needs to be added to the current value |
178 |
shifted by 7*n bits.) This encodes 4GiB in 5 bytes, but stores 127B in 1 |
|
4176.2.4
by John Arbash Meinel
The mini-index is the key to shrinking the key width and still being able to 'scale'. |
179 |
byte, and 2MiB in 3 bytes. If we only stored 64k entries in a 4 MiB |
180 |
group, the average size can only be 64B, which fits in a single byte |
|
181 |
length, so 64KiB for this header, or only 1.5% overhead. We also don't |
|
182 |
have to compute the offset of *all* nodes, just the ones before the one |
|
183 |
we want, which is the similar to what we have to do to get the actual |
|
184 |
content out. |
|
4176.2.1
by John Arbash Meinel
Start writing up the index doc in a ReST discussion. |
185 |
|
186 |
||
187 |
Partial Hash |
|
188 |
============ |
|
189 |
The size of the index is dominated by the individual entries (the 1M records). |
|
190 |
Saving 1 byte there saves 1MB overall, which is the same as the group entries |
|
191 |
and mini index combined. If we can change the index so that it can handle |
|
192 |
collisions gracefully (have multiple records for a given collision), then we |
|
193 |
can shrink the number of bytes we need overall. Also, if we aren't going to |
|
194 |
put the full 20-bytes into the index, then some form of graceful handling of |
|
195 |
collisions is recommended anyway. |
|
196 |
||
197 |
The current structure does this just fine, in that the mini-index dereferences |
|
198 |
you to a "list" of records that start with that prefix. It is assumed that |
|
199 |
those would be sorted, but we could easily have multiple records. To resolve |
|
200 |
the exact record, you can read both records, and compute the sha1 to decide |
|
201 |
between them. This has performance implications, as you are now decoding 2x |
|
202 |
the records to get at one. |
|
203 |
||
204 |
The chance of ``n`` texts colliding with a hash space of ``H`` is generally |
|
205 |
given as:: |
|
206 |
||
207 |
1 - e ^(-n^2 / 2 H) |
|
208 |
||
209 |
Or if you use ``H = 2^h``, where ``h`` is the number of bits:: |
|
210 |
||
211 |
1 - e ^(-n^2 / 2^(h+1)) |
|
212 |
||
213 |
For 1M keys and 4-bytes (32-bit), the chance of collision is for all intents |
|
214 |
and purposes 100%. Rewriting the equation to give the number of bits (``h``) |
|
215 |
needed versus the number of entries (``n``) and the desired collision rate |
|
216 |
(``epsilon``):: |
|
217 |
||
218 |
h = log_2(-n^2 / ln(1-epsilon)) - 1 |
|
219 |
||
220 |
The denominator ``ln(1-epsilon)`` == ``-epsilon``` for small values (even @0.1 |
|
221 |
== -0.105, and we are assuming we want a much lower chance of collision than |
|
222 |
10%). So we have:: |
|
223 |
||
224 |
h = log_2(n^2/epsilon) - 1 = 2 log_2(n) - log_2(epsilon) - 1 |
|
225 |
||
226 |
Given that ``epsilon`` will often be very small and ``n`` very large, it can |
|
227 |
be more convenient to transform it into ``epsilon = 10^-E`` and ``n = 10^N``, |
|
228 |
which gives us:: |
|
229 |
||
230 |
h = 2 * log_2(10^N) - 2 log_2(10^-E) - 1 |
|
231 |
h = log_2(10) (2N + E) - 1 |
|
232 |
h ~ 3.3 (2N + E) - 1 |
|
233 |
||
234 |
Or if we use number of bytes ``h = 8H``:: |
|
235 |
||
236 |
H ~ 0.4 (2N + E) |
|
237 |
||
238 |
This actually has some nice understanding to be had. For every order of |
|
239 |
magnitude we want to increase the number of keys (at the same chance of |
|
240 |
collision), we need ~1 byte (0.8), for every two orders of magnitude we want |
|
241 |
to reduce the chance of collision we need the same extra bytes. So with 8 |
|
242 |
bytes, you can have 20 orders of magnitude to work with, 10^10 keys, with |
|
243 |
guaranteed collision, or 10 keys with 10^-20 chance of collision. |
|
244 |
||
245 |
Putting this in a different form, we could make ``epsilon == 1/n``. This gives |
|
246 |
us an interesting simplified form:: |
|
247 |
||
248 |
h = log_2(n^3) - 1 = 3 log_2(n) - 1 |
|
249 |
||
250 |
writing ``n`` as ``10^N``, and ``H=8h``:: |
|
251 |
||
252 |
h = 3 N log_2(10) - 1 =~ 10 N - 1 |
|
253 |
H ~ 1.25 N |
|
254 |
||
255 |
So to have a one in a million chance of collision using 1 million keys, you |
|
256 |
need ~59 bits, or slightly more than 7 bytes. For 10 million keys and a one in |
|
257 |
10 million chance of any of them colliding, you can use 9 (8.6) bytes. With 10 |
|
258 |
bytes, we have a one in a 100M chance of getting a collision in 100M keys |
|
259 |
(substituting back, the original equation says the chance of collision is 4e-9 |
|
260 |
for 100M keys when using 10 bytes.) |
|
261 |
||
262 |
Given that the only cost for a collision is reading a second page and ensuring |
|
263 |
the sha hash actually matches we could actually use a fairly "high" collision |
|
264 |
rate. A chance of 1 in 1000 that you will collide in an index with 1M keys is |
|
265 |
certainly acceptible. (note that isn't 1 in 1000 of those keys will be a |
|
266 |
collision, but 1 in 1000 that you will have a *single* collision). Using a |
|
267 |
collision chance of 10^-3, and number of keys 10^6, means we need (12+3)*0.4 = |
|
4176.2.4
by John Arbash Meinel
The mini-index is the key to shrinking the key width and still being able to 'scale'. |
268 |
6 bytes. For 10M keys, you need (14+3)*0.4 = 6.8 aka 7. We get that extra byte |
269 |
from the ``mini-index``. In an index with a lot of keys, you want a bigger |
|
270 |
fan-out up front anyway, which gives you more bytes consumed and extends your |
|
271 |
effective key width. |
|
4176.2.1
by John Arbash Meinel
Start writing up the index doc in a ReST discussion. |
272 |
|
4176.2.2
by John Arbash Meinel
Further understanding shows just how unlikely a collision is, at 8 bytes and 100M keys. |
273 |
Also taking one more look at ``H ~ 0.4 (2N + E)``, you can rearrange and |
274 |
consider that for every order of magnitude more keys you insert, your chance |
|
275 |
for collision goes up by 2 orders of magnitude. But for 100M keys, 8 bytes |
|
4176.2.4
by John Arbash Meinel
The mini-index is the key to shrinking the key width and still being able to 'scale'. |
276 |
gives you a 1 in 10,000 chance of collision, and that is gotten at a 16-bit |
277 |
fan-out (64k-way), but for 100M keys, we would likely want at least 20-bit fan |
|
278 |
out. |
|
4176.2.2
by John Arbash Meinel
Further understanding shows just how unlikely a collision is, at 8 bytes and 100M keys. |
279 |
|
4176.2.3
by John Arbash Meinel
next round of comments about the improved chk index. |
280 |
You can also see this from the original equation with a bit of rearranging:: |
281 |
||
282 |
epsilon = 1 - e^(-n^2 / 2^(h+1)) |
|
283 |
epsilon = 1 - e^(-(2^N)^2 / (2^(h+1))) = 1 - e^(-(2^(2N))(2^-(h+1))) |
|
284 |
= 1 - e^(-(2^(2N - h - 1))) |
|
285 |
||
286 |
Such that you want ``2N - h`` to be a very negative integer, such that |
|
287 |
``2^-X`` is thus very close to zero, and ``1-e^0 = 0``. But you can see that |
|
288 |
if you want to double the number of source texts, you need to quadruple the |
|
289 |
number of bits. |
|
290 |
||
291 |
||
292 |
Scaling Sizes |
|
293 |
============= |
|
294 |
||
295 |
Scaling up |
|
296 |
---------- |
|
297 |
||
298 |
We have said we want to be able to scale to a tree with 1M files and 1M |
|
4176.2.6
by John Arbash Meinel
A few notes, some updates from ian. |
299 |
commits. With a 255-way fan out for chk pages, you need 2 internal nodes, |
4176.2.3
by John Arbash Meinel
next round of comments about the improved chk index. |
300 |
and a leaf node with 16 items. (You maintain 2 internal nodes up until 16.5M |
301 |
nodes, when you get another internal node, and your leaf nodes shrink down to |
|
302 |
1 again.) If we assume every commit averages 10 changes (large, but possible, |
|
303 |
especially with large merges), then you get 1 root + 10*(1 internal + 1 leaf |
|
304 |
node) per commit or 21 nodes per commit. At 1M revisions, that is 21M chk |
|
305 |
nodes. So to support the 1Mx1M project, we really need to consider having up |
|
306 |
to 100M chk nodes. |
|
307 |
||
308 |
Even if you went up to 16M tree nodes, that only bumps us up to 31M chk |
|
309 |
nodes. Though it also scales by number of changes, so if you had a huge churn, |
|
310 |
and had 100 changes per commit and a 16M node tree, you would have 301M chk |
|
311 |
nodes. Note that 8 bytes (64-bits) in the prefix still only gives us a 0.27% |
|
312 |
chance of collision (1 in 370). Or if you had 370 projects of that size, with |
|
313 |
all different content, *one* of them would have a collision in the index. |
|
314 |
||
315 |
We also should consider that you have the ``(parent_id,basename) => file_id`` |
|
316 |
map that takes up its own set of chk pages, but testing seems to indicate that |
|
317 |
it is only about 1/10th that of the ``id_to_entry`` map. (rename,add,delete |
|
318 |
are much less common then content changes.) |
|
319 |
||
320 |
As a point of reference, one of the largest projects today OOo, has only 170k |
|
321 |
revisions, and something less than 100k files (and probably 4-5 changes per |
|
322 |
commit, but their history has very few merges, being a conversion from CVS). |
|
323 |
At 100k files, they are probably just starting to hit 2-internal nodes, so |
|
4176.2.6
by John Arbash Meinel
A few notes, some updates from ian. |
324 |
they would end up with 10 pages per commit (as a fair-but-high estimate), and |
4176.2.3
by John Arbash Meinel
next round of comments about the improved chk index. |
325 |
at 170k revs, that would be 1.7M chk nodes. |
326 |
||
327 |
||
328 |
Scaling down |
|
329 |
------------ |
|
330 |
||
331 |
While it is nice to scale to a 16M files tree with 1M files (100M total |
|
332 |
changes), it is also important to scale efficiently to more *real world* |
|
333 |
scenarios. Most projects will fall into the 255-64k file range, which is where |
|
334 |
you have one internal node and 255 leaf nodes (1-2 chk nodes per commit). And |
|
335 |
a modest number of changes (10 is generally a high figure). At 50k revisions, |
|
336 |
that would give you 50*2*10=500k chk nodes. (Note that all of python has 303k |
|
337 |
chk nodes, all of launchpad has 350k, mysql-5.1 in gc255 rather than gc255big had |
|
338 |
650k chk nodes, [depth=3].) |
|
339 |
||
340 |
So for these trees, scaling to 1M nodes is more than sufficient, and allows us |
|
341 |
to use a 6-byte prefix per record. At a minimum, group records could use a |
|
342 |
4-byte start and 3-byte length, but honestly, they are a tiny fraction of the |
|
343 |
overall index size, and it isn't really worth the implementation cost of being |
|
344 |
flexible here. We can keep a field in the header for the group record layout |
|
345 |
(8, 4) and for now just assert that this size is fixed. |
|
346 |
||
347 |
||
348 |
Other discussion |
|
349 |
================ |
|
350 |
||
351 |
group encoding |
|
352 |
-------------- |
|
353 |
||
354 |
In the above scheme we store the group locations as an 8-byte start, and |
|
355 |
4-byte length. We could theoretically just store a 4-byte length, and then you |
|
356 |
have to read all of the groups and add them up to determine the actual start |
|
357 |
position. The trade off is a direct jump-to-location versus storing 3x the |
|
358 |
data. Given when you have 64k groups you will need only .75MiB to store it, |
|
359 |
versus the 120MB for the actual entries, this seems to be no real overhead. |
|
360 |
Especially when you consider that 10M chk nodes should fit in only 250 groups, |
|
361 |
so total data is actually only 3KiB. Then again, if it was only 1KiB it is |
|
362 |
obvious that you would read the whole thing in one pass. But again, see the |
|
363 |
pathological "conversion creating 1 group per chk page" issue. |
|
364 |
||
365 |
Also, we might want to support more than 64k groups in a given index when we |
|
366 |
get to the point of storing file content in a CHK index. A lot of the analysis |
|
367 |
about the number of groups is based on the 100 byte compression of CHK nodes, |
|
368 |
which would not be true with file-content. We should compress well, I don't |
|
369 |
expect us to compress *that* well. Launchpad shows that the average size of a |
|
370 |
content record is about 500-600 bytes (after you filter out the ~140k that are |
|
371 |
NULL content records). At that size, you expect to get approx 7k records per |
|
372 |
group, down from 40k. Going further, though, you also want to split groups |
|
373 |
earlier, since you end up with better compression. so with 100,000 unique file |
|
374 |
texts, you end up with ~100 groups. With 1M revisions @ 10 changes each, you |
|
375 |
have 10M file texts, and would end up at 10,485 groups. That seems like more |
|
376 |
64k groups is still more than enough head room. You need to fit only 100 |
|
377 |
entries per group, to get down to where you are getting into trouble (and have |
|
378 |
10M file texts.) Something to keep an eye on, but unlikely to be something |
|
379 |
that is strictly a problem. |
|
380 |
||
381 |
Still reasonable to have a record in the header indicating that index entries |
|
382 |
use a 2-byte group entry pointer, and allow it to scale to 3 (we may also find |
|
383 |
a win scaling it down to 1 in the common cases of <250 groups). Note that if |
|
384 |
you have the full 4MB groups, it takes 256 GB of compressed content to fill |
|
385 |
64k records. And our groups are currently scaled that we require at least |
|
386 |
1-2MB before they can be considered 'full'. |
|
387 |
||
388 |
||
389 |
variable length index entries |
|
390 |
----------------------------- |
|
391 |
||
392 |
The above had us store 8-bytes of sha hash, 2 bytes of group number, and |
|
393 |
2 bytes for record-in-group. However, since we have the variable-pointer |
|
394 |
mini-index, we could consider having those values be 'variable length'. So |
|
395 |
when you read the bytes between the previous-and-next record, you have a |
|
396 |
parser that can handle variable width. The main problem is that to encode |
|
397 |
start/stop of record takes some bytes, and at 12-bytes for a record, you don't |
|
398 |
have a lot of space to waste for a "end-of-entry" indicator. The easiest would |
|
399 |
be to store things in base-128 (high bit indicates the next byte also should |
|
400 |
be included). |
|
401 |
||
402 |
||
403 |
storing uncompressed offset + length |
|
404 |
------------------------------------ |
|
405 |
||
406 |
To get the smallest index possible, we store only a 2-byte 'record indicator' |
|
4176.2.6
by John Arbash Meinel
A few notes, some updates from ian. |
407 |
inside the index, and then assume that it can be decoded once we've read the |
4176.2.3
by John Arbash Meinel
next round of comments about the improved chk index. |
408 |
actual group. This is certainly possible, but it represents yet another layer |
409 |
of indirection before you can actually get content. If we went with |
|
410 |
variable-length index entries, we could probably get most of the benefit with |
|
411 |
a variable-width start-of-entry value. The length-of-content is already being |
|
412 |
stored as a base128 integer starting at the second byte of the uncompressed |
|
413 |
data (the first being the record type, fulltext/delta). It complicates some of |
|
414 |
our other processing, since we would then only know how much to decompress to |
|
415 |
get the start of the record. |
|
416 |
||
417 |
Another intriguing possibility would be to store the *end* of the record in |
|
418 |
the index, and then in the data stream store the length and type information |
|
419 |
at the *end* of the record, rather than at the beginning (or possibly at both |
|
420 |
ends). Storing it at the end is a bit unintuitive when you think about reading |
|
421 |
in the data as a stream, and figuring out information (you have to read to the |
|
422 |
end, then seek back) But a given GC block does store the |
|
423 |
length-of-uncompressed-content, which means we can trivially decompress, jump |
|
424 |
to the end, and then walk-backwards for everything else. |
|
425 |
||
426 |
Given that every byte in an index entry costs 10MiB in a 10M index, it is |
|
427 |
worth considering. At 4MiB for a block, base 128 takes 4 bytes to encode the |
|
428 |
last 50% of records (those beyond 2MiB), 3 bytes for everything from 16KiB => |
|
429 |
2MiB. So the expected size is for all intents and purposes, 3.5 bytes. (Just |
|
430 |
due to an unfortunate effect of where the boundary is that you need more |
|
431 |
bytes.) If we capped the data at 2MB, the expected drops to just under 3 |
|
432 |
bytes. Note that a flat 3bytes could decode up to 16MiB, which would be much |
|
433 |
better for our purpose, but wouldn't let us write groups that had a record |
|
434 |
after 16MiB, which doesn't work for the ISO case. Though it works *absolutely* |
|
435 |
fine for the CHK inventory cases (what we have today). |
|
436 |
||
437 |
||
438 |
null content |
|
439 |
------------ |
|
440 |
At the moment, we have a lot of records in our per-file graph that refers to |
|
441 |
empty content. We get one for every symlink and directory, for every time that |
|
442 |
they change. This isn't specifically relevant for CHK pages, but for |
|
443 |
efficiency we could certainly consider setting "group = 0 entry = 0" to mean |
|
444 |
that this is actually a no-content entry. It means the group block itself |
|
445 |
doesn't have to hold a record for it, etc. Alternatively we could use |
|
446 |
"group=FFFF entry = FFFF" to mean the same thing. |
|
447 |
||
448 |
||
4176.2.5
by John Arbash Meinel
There is some concern that we can't return all keys just by reading the index. |
449 |
``VF.keys()`` |
450 |
------------- |
|
451 |
At the moment, some apis expect that you can list the references by reading |
|
452 |
all of the index. We would like to get away from this anyway, as it doesn't |
|
453 |
scale particularly well. However, with this format, we no longer store the |
|
454 |
exact value for the content. The content is self describing, and we *would* be |
|
455 |
storing enough to uniquely decide which node to read. Though that is actually |
|
456 |
contained in just 4-bytes (2-byte group, 2-byte group entry). |
|
457 |
||
458 |
We use ``VF.keys()`` during 'pack' and 'autopack' to avoid asking for content |
|
459 |
we don't have, and to put a counter on the progress bar. For the latter, we |
|
460 |
can just use ``index.key_count()`` for the former, we could just properly |
|
461 |
handle ``AbsentContentFactory``. |
|
462 |
||
4176.2.6
by John Arbash Meinel
A few notes, some updates from ian. |
463 |
|
464 |
More than 64k groups |
|
465 |
-------------------- |
|
466 |
Doing a streaming conversion all at once is still something to consider. As it |
|
467 |
would default to creating all chk pages in separate groups (300-400k easily). |
|
468 |
However, just making the number of group block entries variable, and allowing |
|
469 |
the pointer in each entry to be variable should suffice. At 3 bytes for the |
|
470 |
group pointer, we can refer to 16.7M groups. It does add complexity, but it is |
|
471 |
likely necessary to allow for arbitrary cases. |
|
472 |
||
4176.2.1
by John Arbash Meinel
Start writing up the index doc in a ReST discussion. |
473 |
.. |
474 |
vim: ft=rst tw=78 ai |