~bzr-pqm/bzr/bzr.dev

4176.2.1 by John Arbash Meinel
Start writing up the index doc in a ReST discussion.
1
===================
2
CHK Optimized index
3
===================
4
5
Our current btree style index is nice as a general index, but it is not optimal
4176.2.6 by John Arbash Meinel
A few notes, some updates from ian.
6
for Content-Hash-Key based content. With CHK, the keys themselves are hashes,
4176.2.1 by John Arbash Meinel
Start writing up the index doc in a ReST discussion.
7
which means they are randomly distributed (similar keys do not refer to
8
similar content), and they do not compress well. However, we can create an
9
index which takes advantage of these abilites, rather than suffering from
10
them. Even further, there are specific advantages provided by
11
``groupcompress``, because of how individual items are clustered together.
12
13
Btree indexes also rely on zlib compression, in order to get their compact
14
size, and further has to try hard to fit things into a compressed 4k page.
15
When the key is a sha1 hash, we would not expect to get better than 20bytes
16
per key, which is the same size as the binary representation of the hash. This
17
means we could write an index format that gets approximately the same on-disk
18
size, without having the overhead of ``zlib.decompress``. Some thought would
4853.1.1 by Patrick Regan
Removed trailing whitespace from files in doc directory
19
still need to be put into how to efficiently access these records from remote.
4176.2.1 by John Arbash Meinel
Start writing up the index doc in a ReST discussion.
20
21
22
Required information
23
====================
24
For a given groupcompress record, we need to know the offset and length of the
25
compressed group in the .pack file, and the start and end of the content inside
26
the uncompressed group. The absolute minimum is slightly less, but this is a
4176.2.3 by John Arbash Meinel
next round of comments about the improved chk index.
27
good starting point. The other thing to consider, is that for 1M revisions and
28
1M files, we'll probably have 10-20M CHK pages, so we want to make sure we
29
have an index that can scale up efficiently.
4176.2.1 by John Arbash Meinel
Start writing up the index doc in a ReST discussion.
30
31
1. A compressed sha hash is 20-bytes
32
33
2. Pack files can be > 4GB, we could use an 8-byte (64-bit) pointer, or we
34
   could store a 5-byte pointer for a cap at 1TB. 8-bytes still seems like
35
   overkill, even if it is the natural next size up.
36
37
3. An individual group would never be longer than 2^32, but they will often
38
   be bigger than 2^16. 3 bytes for length (16MB) would be the minimum safe
39
   length, and may not be safe if we expand groups for large content (like ISOs).
40
   So probably 4-bytes for group length is necessary.
41
42
4. A given start offset has to fit in the group, so another 4-bytes.
43
44
5. Uncompressed length of record is based on original size, so 4-bytes is
45
   expected as well.
46
47
6. That leaves us with 20+8+4+4+4 = 40 bytes per record. At the moment, btree
48
   compression gives us closer to 38.5 bytes per record. We don't have perfect
49
   compression, but we also don't have >4GB pack files (and if we did, the first
50
   4GB are all under then 2^32 barrier :).
51
52
If we wanted to go back to the ''minimal'' amount of data that we would need to
53
store.
54
4176.2.3 by John Arbash Meinel
next round of comments about the improved chk index.
55
1. 8 bytes of a sha hash are generally going to be more than enough to fully
56
   determine the entry (see `Partial hash`_). We could support some amount of
4176.2.1 by John Arbash Meinel
Start writing up the index doc in a ReST discussion.
57
   collision in an index record, in exchange for resolving it inside the
58
   content. At least in theory, we don't *have* to record the whole 20-bytes
4176.2.3 by John Arbash Meinel
next round of comments about the improved chk index.
59
   for the sha1 hash. (8-bytes gives us less than 1 in 1000 chance of
60
   a single collision for 10M nodes in an index)
4176.2.1 by John Arbash Meinel
Start writing up the index doc in a ReST discussion.
61
62
2. We could record the start and length of each group in a separate location,
63
   and then have each record reference the group by an 'offset'. This is because
64
   we expect to have many records in the same group (something like 10k or so,
65
   though we've fit >64k under some circumstances). At a minimum, we have one
66
   record per group so we have to store at least one reference anyway. So the
67
   maximum overhead is just the size and cost of the dereference (and normally
68
   will be much much better than that.)
69
4176.2.3 by John Arbash Meinel
next round of comments about the improved chk index.
70
3. If a group reference is an 8-byte start, and a 4-byte length, and we have
71
   10M keys, but get at least 1k records per group, then we would have 10k
72
   groups.  So we would need 120kB to record all the group offsets, and then
73
   each individual record would only need a 2-byte group number, rather than a
4176.2.1 by John Arbash Meinel
Start writing up the index doc in a ReST discussion.
74
   12-byte reference.  We could be safe with a 4-byte group number, but if
75
   each group is ~1MB, 64k groups is 64GB. We can start with 2-byte, but leave
76
   room in the header info to indicate if we have more than 64k group entries.
77
   Also, current grouping creates groups of 4MB each, which would make it
78
   256GB, to create 64k groups. And our current chk pages compress down to
79
   less than 100 bytes each (average is closer to 40 bytes), which for 256GB
80
   of raw data, would amount to 2.7 billion CHK records. (This will change if
81
   we start to use CHK for text records, as they do not compress down as
4176.2.3 by John Arbash Meinel
next round of comments about the improved chk index.
82
   small.) Using 100 bytes per 10M chk records, we have 1GB of compressed chk
83
   data, split into 4MB groups or 250 total groups. Still << 64k groups.
84
   Conversions could create 1 chk record at a time, creating a group for each,
85
   but they would be foolish to not commit a write group after 10k revisions
86
   (assuming 6 CHK pages each).
4176.2.1 by John Arbash Meinel
Start writing up the index doc in a ReST discussion.
87
88
4. We want to know the start-and-length of a record in the decompressed
89
   stream. This could actually be moved into a mini-index inside the group
90
   itself. Initial testing showed that storing an expanded "key =>
91
   start,offset" consumed a considerable amount of compressed space. (about
92
   30% of final size was just these internal indices.) However, we could move
93
   to a pure "record 1 is at location 10-20", and then our external index
94
   would just have a single 'group entry number'.
95
96
   There are other internal forces that would give a natural cap of 64k
97
   entries per group. So without much loss of generality, we could probably get
98
   away with a 2-byte 'group entry' number. (which then generates an 8-byte
99
   offset + endpoint as a header in the group itself.)
100
101
5. So for 1M keys, an ideal chk+group index would be:
4853.1.1 by Patrick Regan
Removed trailing whitespace from files in doc directory
102
4176.2.4 by John Arbash Meinel
The mini-index is the key to shrinking the key width and still being able to 'scale'.
103
    a. 6-byte hash prefix
4176.2.1 by John Arbash Meinel
Start writing up the index doc in a ReST discussion.
104
105
    b. 2-byte group number
106
107
    c. 2-byte entry in group number
108
109
    d. a separate lookup of 12-byte group number to offset + length
110
4176.2.4 by John Arbash Meinel
The mini-index is the key to shrinking the key width and still being able to 'scale'.
111
    e. a variable width mini-index that splits X bits of the key. (to maintain
112
       small keys, low chance of collision, this is *not* redundant with the
113
       value stored in (a)) This should then dereference into a location in
114
       the index. This should probably be a 4-byte reference. It is unlikely,
115
       but possible, to have an index >16MB. With an 10-byte entry, it only
116
       takes 1.6M chk nodes to do so.  At the smallest end, this will probably
117
       be a 256-way (8-bits) fan out, at the high end it could go up to
118
       64k-way (16-bits) or maybe even 1M-way (20-bits). (64k-way should
119
       handle up to 5-16M nodes and still allow a cheap <4k read to find the
120
       final entry.)
4176.2.3 by John Arbash Meinel
next round of comments about the improved chk index.
121
122
So the max size for the optimal groupcompress+chk index with 10M entries would be::
123
4176.2.4 by John Arbash Meinel
The mini-index is the key to shrinking the key width and still being able to 'scale'.
124
  10 * 10M (entries) + 64k * 12 (group) + 64k * 4 (mini index) = 101 MiB
4176.2.3 by John Arbash Meinel
next round of comments about the improved chk index.
125
4176.2.4 by John Arbash Meinel
The mini-index is the key to shrinking the key width and still being able to 'scale'.
126
So 101MiB which breaks down as 100MiB for the actual entries, 0.75MiB for the
4176.2.1 by John Arbash Meinel
Start writing up the index doc in a ReST discussion.
127
group records, and 0.25MiB for the mini index.
128
129
1. Looking up a key would involve:
130
131
   a. Read ``XX`` bytes to get the header, and various config for the index.
132
      Such as length of the group records, length of mini index, etc.
133
134
   b. Find the offset in the mini index for the first YY bits of the key. Read
135
      the 4 byte pointer stored at that location (which may already be in the
136
      first content if we pre-read a minimum size.)
137
138
   c. Jump to the location indicated, and read enough bytes to find the
4176.2.3 by John Arbash Meinel
next round of comments about the improved chk index.
139
      correct 12-byte record. The mini-index only indicates the start of
140
      records that start with the given prefix. A 64k-way index resolves 10MB
141
      records down to 160 possibilities. So at 12 bytes each, to read all would
142
      cost 1920 bytes to be read.
4176.2.1 by John Arbash Meinel
Start writing up the index doc in a ReST discussion.
143
144
   d. Determine the offset for the group entry, which is the known ``start of
145
      groups`` location + 12B*offset number. Read its 12-byte record.
146
147
   e. Switch to the .pack file, and read the group header to determine where in
148
      the stream the given record exists. At this point, you have enough
149
      information to read the entire group block. For local ops, you could
150
      only read enough to get the header, and then only read enough to
151
      decompress just the content you want to get at.
4853.1.1 by Patrick Regan
Removed trailing whitespace from files in doc directory
152
4176.2.1 by John Arbash Meinel
Start writing up the index doc in a ReST discussion.
153
      Using an offset, you also don't need to decode the entire group header.
154
      If we assume that things are stored in fixed-size records, you can jump
155
      to exactly the entry that you care about, and read its 8-byte
156
      (start,length in uncompressed) info.  If we wanted more redundancy we
157
      could store the 20-byte hash, but the content can verify itself.
4853.1.1 by Patrick Regan
Removed trailing whitespace from files in doc directory
158
4176.2.1 by John Arbash Meinel
Start writing up the index doc in a ReST discussion.
159
   f. If the size of these mini headers becomes critical (8 bytes per record
160
      is 8% overhead for 100 byte records), we could also compress this mini
4176.2.4 by John Arbash Meinel
The mini-index is the key to shrinking the key width and still being able to 'scale'.
161
      header. Changing the number of bytes per entry is unlikely to be
162
      efficient, because groups standardize on 4MiB wide, which is >>64KiB for
163
      a 2-byte offset, 3-bytes would be enough as long as we never store an
164
      ISO as a single entry in the content. Variable width also isn't a big
165
      win, since base-128 hits 4-bytes at just 2MiB.
166
4176.2.1 by John Arbash Meinel
Start writing up the index doc in a ReST discussion.
167
      For minimum size without compression, we could only store the 4-byte
168
      length of each node. Then to compute the offset, you have to sum all
4176.2.4 by John Arbash Meinel
The mini-index is the key to shrinking the key width and still being able to 'scale'.
169
      previous nodes. We require <64k nodes in a group, so it is up to 256KiB
170
      for this header, but we would lose partial reads.  This should still be
171
      cheap in compiled code (needs tests, as you can't do partial info), and
172
      would also have the advantage that fixed width would be highly
173
      compressible itself. (Most nodes are going to have a length that fits
174
      1-2 bytes.)
4853.1.1 by Patrick Regan
Removed trailing whitespace from files in doc directory
175
4176.2.4 by John Arbash Meinel
The mini-index is the key to shrinking the key width and still being able to 'scale'.
176
      An alternative form would be to use the base-128 encoding.  (If the MSB
4176.2.1 by John Arbash Meinel
Start writing up the index doc in a ReST discussion.
177
      is set, then the next byte needs to be added to the current value
178
      shifted by 7*n bits.) This encodes 4GiB in 5 bytes, but stores 127B in 1
4176.2.4 by John Arbash Meinel
The mini-index is the key to shrinking the key width and still being able to 'scale'.
179
      byte, and 2MiB in 3 bytes. If we only stored 64k entries in a 4 MiB
180
      group, the average size can only be 64B, which fits in a single byte
181
      length, so 64KiB for this header, or only 1.5% overhead. We also don't
182
      have to compute the offset of *all* nodes, just the ones before the one
183
      we want, which is the similar to what we have to do to get the actual
184
      content out.
4176.2.1 by John Arbash Meinel
Start writing up the index doc in a ReST discussion.
185
186
187
Partial Hash
188
============
189
The size of the index is dominated by the individual entries (the 1M records).
190
Saving 1 byte there saves 1MB overall, which is the same as the group entries
191
and mini index combined. If we can change the index so that it can handle
192
collisions gracefully (have multiple records for a given collision), then we
193
can shrink the number of bytes we need overall. Also, if we aren't going to
194
put the full 20-bytes into the index, then some form of graceful handling of
195
collisions is recommended anyway.
196
197
The current structure does this just fine, in that the mini-index dereferences
198
you to a "list" of records that start with that prefix. It is assumed that
199
those would be sorted, but we could easily have multiple records. To resolve
200
the exact record, you can read both records, and compute the sha1 to decide
201
between them. This has performance implications, as you are now decoding 2x
202
the records to get at one.
203
204
The chance of ``n`` texts colliding with a hash space of ``H`` is generally
205
given as::
206
207
     1 - e ^(-n^2 / 2 H)
208
209
Or if you use ``H = 2^h``, where ``h`` is the number of bits::
210
211
     1 - e ^(-n^2 / 2^(h+1))
212
213
For 1M keys and 4-bytes (32-bit), the chance of collision is for all intents
214
and purposes 100%.  Rewriting the equation to give the number of bits (``h``)
215
needed versus the number of entries (``n``) and the desired collision rate
216
(``epsilon``)::
217
218
    h = log_2(-n^2 / ln(1-epsilon)) - 1
219
220
The denominator ``ln(1-epsilon)`` == ``-epsilon``` for small values (even @0.1
221
== -0.105, and we are assuming we want a much lower chance of collision than
222
10%). So we have::
223
224
    h = log_2(n^2/epsilon) - 1 = 2 log_2(n) - log_2(epsilon) - 1
225
226
Given that ``epsilon`` will often be very small and ``n`` very large, it can
227
be more convenient to transform it into ``epsilon = 10^-E`` and ``n = 10^N``,
228
which gives us::
229
230
    h = 2 * log_2(10^N) - 2 log_2(10^-E) - 1
231
    h = log_2(10) (2N + E) - 1
232
    h ~ 3.3 (2N + E) - 1
233
234
Or if we use number of bytes ``h = 8H``::
235
236
    H ~ 0.4 (2N + E)
237
238
This actually has some nice understanding to be had. For every order of
239
magnitude we want to increase the number of keys (at the same chance of
240
collision), we need ~1 byte (0.8), for every two orders of magnitude we want
241
to reduce the chance of collision we need the same extra bytes. So with 8
242
bytes, you can have 20 orders of magnitude to work with, 10^10 keys, with
243
guaranteed collision, or 10 keys with 10^-20 chance of collision.
244
245
Putting this in a different form, we could make ``epsilon == 1/n``. This gives
246
us an interesting simplified form::
247
248
    h = log_2(n^3) - 1 = 3 log_2(n) - 1
249
250
writing ``n`` as ``10^N``, and ``H=8h``::
251
252
    h = 3 N log_2(10) - 1 =~ 10 N - 1
253
    H ~ 1.25 N
254
255
So to have a one in a million chance of collision using 1 million keys, you
256
need ~59 bits, or slightly more than 7 bytes. For 10 million keys and a one in
257
10 million chance of any of them colliding, you can use 9 (8.6) bytes. With 10
258
bytes, we have a one in a 100M chance of getting a collision in 100M keys
259
(substituting back, the original equation says the chance of collision is 4e-9
260
for 100M keys when using 10 bytes.)
261
262
Given that the only cost for a collision is reading a second page and ensuring
263
the sha hash actually matches we could actually use a fairly "high" collision
264
rate. A chance of 1 in 1000 that you will collide in an index with 1M keys is
265
certainly acceptible.  (note that isn't 1 in 1000 of those keys will be a
266
collision, but 1 in 1000 that you will have a *single* collision).  Using a
267
collision chance of 10^-3, and number of keys 10^6, means we need (12+3)*0.4 =
4176.2.4 by John Arbash Meinel
The mini-index is the key to shrinking the key width and still being able to 'scale'.
268
6 bytes. For 10M keys, you need (14+3)*0.4 = 6.8 aka 7. We get that extra byte
269
from the ``mini-index``. In an index with a lot of keys, you want a bigger
270
fan-out up front anyway, which gives you more bytes consumed and extends your
271
effective key width.
4176.2.1 by John Arbash Meinel
Start writing up the index doc in a ReST discussion.
272
4176.2.2 by John Arbash Meinel
Further understanding shows just how unlikely a collision is, at 8 bytes and 100M keys.
273
Also taking one more look at ``H ~ 0.4 (2N + E)``, you can rearrange and
274
consider that for every order of magnitude more keys you insert, your chance
275
for collision goes up by 2 orders of magnitude. But for 100M keys, 8 bytes
4176.2.4 by John Arbash Meinel
The mini-index is the key to shrinking the key width and still being able to 'scale'.
276
gives you a 1 in 10,000 chance of collision, and that is gotten at a 16-bit
277
fan-out (64k-way), but for 100M keys, we would likely want at least 20-bit fan
278
out.
4176.2.2 by John Arbash Meinel
Further understanding shows just how unlikely a collision is, at 8 bytes and 100M keys.
279
4176.2.3 by John Arbash Meinel
next round of comments about the improved chk index.
280
You can also see this from the original equation with a bit of rearranging::
281
282
     epsilon = 1 - e^(-n^2 / 2^(h+1))
283
     epsilon = 1 - e^(-(2^N)^2 / (2^(h+1))) = 1 - e^(-(2^(2N))(2^-(h+1)))
284
             = 1 - e^(-(2^(2N - h - 1)))
285
286
Such that you want ``2N - h`` to be a very negative integer, such that
287
``2^-X`` is thus very close to zero, and ``1-e^0 = 0``. But you can see that
288
if you want to double the number of source texts, you need to quadruple the
289
number of bits.
290
291
292
Scaling Sizes
293
=============
294
295
Scaling up
296
----------
297
298
We have said we want to be able to scale to a tree with 1M files and 1M
4176.2.6 by John Arbash Meinel
A few notes, some updates from ian.
299
commits. With a 255-way fan out for chk pages, you need 2 internal nodes,
4176.2.3 by John Arbash Meinel
next round of comments about the improved chk index.
300
and a leaf node with 16 items. (You maintain 2 internal nodes up until 16.5M
301
nodes, when you get another internal node, and your leaf nodes shrink down to
302
1 again.) If we assume every commit averages 10 changes (large, but possible,
303
especially with large merges), then you get 1 root + 10*(1 internal + 1 leaf
304
node) per commit or 21 nodes per commit. At 1M revisions, that is 21M chk
305
nodes. So to support the 1Mx1M project, we really need to consider having up
306
to 100M chk nodes.
307
308
Even if you went up to 16M tree nodes, that only bumps us up to 31M chk
309
nodes. Though it also scales by number of changes, so if you had a huge churn,
310
and had 100 changes per commit and a 16M node tree, you would have 301M chk
311
nodes. Note that 8 bytes (64-bits) in the prefix still only gives us a 0.27%
312
chance of collision (1 in 370). Or if you had 370 projects of that size, with
313
all different content, *one* of them would have a collision in the index.
314
315
We also should consider that you have the ``(parent_id,basename) => file_id``
316
map that takes up its own set of chk pages, but testing seems to indicate that
317
it is only about 1/10th that of the ``id_to_entry`` map. (rename,add,delete
318
are much less common then content changes.)
319
320
As a point of reference, one of the largest projects today OOo, has only 170k
321
revisions, and something less than 100k files (and probably 4-5 changes per
322
commit, but their history has very few merges, being a conversion from CVS).
323
At 100k files, they are probably just starting to hit 2-internal nodes, so
4176.2.6 by John Arbash Meinel
A few notes, some updates from ian.
324
they would end up with 10 pages per commit (as a fair-but-high estimate), and
4176.2.3 by John Arbash Meinel
next round of comments about the improved chk index.
325
at 170k revs, that would be 1.7M chk nodes.
326
327
328
Scaling down
329
------------
330
331
While it is nice to scale to a 16M files tree with 1M files (100M total
332
changes), it is also important to scale efficiently to more *real world*
333
scenarios. Most projects will fall into the 255-64k file range, which is where
334
you have one internal node and 255 leaf nodes (1-2 chk nodes per commit). And
335
a modest number of changes (10 is generally a high figure). At 50k revisions,
336
that would give you 50*2*10=500k chk nodes. (Note that all of python has 303k
337
chk nodes, all of launchpad has 350k, mysql-5.1 in gc255 rather than gc255big had
338
650k chk nodes, [depth=3].)
339
340
So for these trees, scaling to 1M nodes is more than sufficient, and allows us
341
to use a 6-byte prefix per record. At a minimum, group records could use a
342
4-byte start and 3-byte length, but honestly, they are a tiny fraction of the
343
overall index size, and it isn't really worth the implementation cost of being
344
flexible here. We can keep a field in the header for the group record layout
345
(8, 4) and for now just assert that this size is fixed.
346
347
348
Other discussion
349
================
350
351
group encoding
352
--------------
353
354
In the above scheme we store the group locations as an 8-byte start, and
355
4-byte length. We could theoretically just store a 4-byte length, and then you
356
have to read all of the groups and add them up to determine the actual start
357
position. The trade off is a direct jump-to-location versus storing 3x the
358
data. Given when you have 64k groups you will need only .75MiB to store it,
359
versus the 120MB for the actual entries, this seems to be no real overhead.
360
Especially when you consider that 10M chk nodes should fit in only 250 groups,
361
so total data is actually only 3KiB. Then again, if it was only 1KiB it is
362
obvious that you would read the whole thing in one pass. But again, see the
363
pathological "conversion creating 1 group per chk page" issue.
364
365
Also, we might want to support more than 64k groups in a given index when we
366
get to the point of storing file content in a CHK index. A lot of the analysis
367
about the number of groups is based on the 100 byte compression of CHK nodes,
368
which would not be true with file-content. We should compress well, I don't
369
expect us to compress *that* well. Launchpad shows that the average size of a
370
content record is about 500-600 bytes (after you filter out the ~140k that are
371
NULL content records). At that size, you expect to get approx 7k records per
372
group, down from 40k. Going further, though, you also want to split groups
373
earlier, since you end up with better compression. so with 100,000 unique file
374
texts, you end up with ~100 groups. With 1M revisions @ 10 changes each, you
375
have 10M file texts, and would end up at 10,485 groups. That seems like more
376
64k groups is still more than enough head room. You need to fit only 100
377
entries per group, to get down to where you are getting into trouble (and have
378
10M file texts.) Something to keep an eye on, but unlikely to be something
379
that is strictly a problem.
380
381
Still reasonable to have a record in the header indicating that index entries
382
use a 2-byte group entry pointer, and allow it to scale to 3 (we may also find
383
a win scaling it down to 1 in the common cases of <250 groups). Note that if
384
you have the full 4MB groups, it takes 256 GB of compressed content to fill
385
64k records. And our groups are currently scaled that we require at least
386
1-2MB before they can be considered 'full'.
387
388
389
variable length index entries
390
-----------------------------
391
392
The above had us store 8-bytes of sha hash, 2 bytes of group number, and
393
2 bytes for record-in-group. However, since we have the variable-pointer
394
mini-index, we could consider having those values be 'variable length'. So
395
when you read the bytes between the previous-and-next record, you have a
396
parser that can handle variable width. The main problem is that to encode
397
start/stop of record takes some bytes, and at 12-bytes for a record, you don't
398
have a lot of space to waste for a "end-of-entry" indicator. The easiest would
399
be to store things in base-128 (high bit indicates the next byte also should
400
be included).
401
402
403
storing uncompressed offset + length
404
------------------------------------
405
406
To get the smallest index possible, we store only a 2-byte 'record indicator'
4176.2.6 by John Arbash Meinel
A few notes, some updates from ian.
407
inside the index, and then assume that it can be decoded once we've read the
4176.2.3 by John Arbash Meinel
next round of comments about the improved chk index.
408
actual group. This is certainly possible, but it represents yet another layer
409
of indirection before you can actually get content. If we went with
410
variable-length index entries, we could probably get most of the benefit with
411
a variable-width start-of-entry value. The length-of-content is already being
412
stored as a base128 integer starting at the second byte of the uncompressed
413
data (the first being the record type, fulltext/delta). It complicates some of
414
our other processing, since we would then only know how much to decompress to
415
get the start of the record.
416
417
Another intriguing possibility would be to store the *end* of the record in
418
the index, and then in the data stream store the length and type information
419
at the *end* of the record, rather than at the beginning (or possibly at both
420
ends). Storing it at the end is a bit unintuitive when you think about reading
421
in the data as a stream, and figuring out information (you have to read to the
422
end, then seek back) But a given GC block does store the
423
length-of-uncompressed-content, which means we can trivially decompress, jump
424
to the end, and then walk-backwards for everything else.
425
426
Given that every byte in an index entry costs 10MiB in a 10M index, it is
427
worth considering. At 4MiB for a block, base 128 takes 4 bytes to encode the
428
last 50% of records (those beyond 2MiB), 3 bytes for everything from 16KiB =>
429
2MiB.  So the expected size is for all intents and purposes, 3.5 bytes.  (Just
430
due to an unfortunate effect of where the boundary is that you need more
431
bytes.) If we capped the data at 2MB, the expected drops to just under 3
432
bytes. Note that a flat 3bytes could decode up to 16MiB, which would be much
433
better for our purpose, but wouldn't let us write groups that had a record
434
after 16MiB, which doesn't work for the ISO case. Though it works *absolutely*
435
fine for the CHK inventory cases (what we have today).
436
437
438
null content
439
------------
440
At the moment, we have a lot of records in our per-file graph that refers to
441
empty content. We get one for every symlink and directory, for every time that
442
they change. This isn't specifically relevant for CHK pages, but for
443
efficiency we could certainly consider setting "group = 0 entry = 0" to mean
444
that this is actually a no-content entry. It means the group block itself
445
doesn't have to hold a record for it, etc. Alternatively we could use
446
"group=FFFF entry = FFFF" to mean the same thing.
447
448
4176.2.5 by John Arbash Meinel
There is some concern that we can't return all keys just by reading the index.
449
``VF.keys()``
450
-------------
451
At the moment, some apis expect that you can list the references by reading
452
all of the index. We would like to get away from this anyway, as it doesn't
453
scale particularly well. However, with this format, we no longer store the
454
exact value for the content. The content is self describing, and we *would* be
455
storing enough to uniquely decide which node to read. Though that is actually
456
contained in just 4-bytes (2-byte group, 2-byte group entry).
457
458
We use ``VF.keys()`` during 'pack' and 'autopack' to avoid asking for content
459
we don't have, and to put a counter on the progress bar. For the latter, we
460
can just use ``index.key_count()`` for the former, we could just properly
461
handle ``AbsentContentFactory``.
462
4176.2.6 by John Arbash Meinel
A few notes, some updates from ian.
463
464
More than 64k groups
465
--------------------
466
Doing a streaming conversion all at once is still something to consider. As it
467
would default to creating all chk pages in separate groups (300-400k easily).
468
However, just making the number of group block entries variable, and allowing
469
the pointer in each entry to be variable should suffice. At 3 bytes for the
470
group pointer, we can refer to 16.7M groups. It does add complexity, but it is
471
likely necessary to allow for arbitrary cases.
472
4853.1.1 by Patrick Regan
Removed trailing whitespace from files in doc directory
473
..
4176.2.1 by John Arbash Meinel
Start writing up the index doc in a ReST discussion.
474
  vim: ft=rst tw=78 ai