5
This document outlines how we decide to pre-read extra nodes in the btree
12
Because of the latency involved in making a request, it is often better to make
13
fewer large requests, rather than more small requests, even if some of the
14
extra data will be wasted.
19
Using my connection as an example, I have a max bandwidth of 160kB/s, and a
20
latency of between 100-400ms to London, I'll use 200ms for this example. With
21
this connection, in 200ms you can download 32kB. So if you make 10 requests for
22
4kB of data, you spend 10*.2s = 2s sending the requests, and 4*10/160 = .25s
23
actually downloading the data. If, instead, you made 3 requests for 32kB of
24
data each, you would take 3*.2s = .6s for requests, and 32*3/160 = .6s for
25
downloading the data. So you save 2.25 - 1.2 = 1.05s even though you downloaded
26
32*3-4*10 = 56kB of data that you probably don't need. On the other hand, if
27
you made 1 request for 480kB, you would take .2s for the request, and
28
480/160=3s for the data. So you end up taking 3.2s, because of the wasted
35
This is meant to give a basic feeling for how the btree index is laid out on
36
disk, not give a rigorous discussion. For that look elsewhere[ref?].
38
The basic structure is that we have pages of 4kB. Each page is either a leaf,
39
which holds the final information we are interested in, or is an internal node,
40
which contains a list of references to the next layer of nodes. The layers are
41
structured such that all nodes for the top layer come first, then the nodes for
42
the next layer, linearly in the file.
48
In the simplest example, all the data fits into a single page, the root node.
49
This means the root node is a leaf node.
55
As soon as the data cannot fit in a single node, we create a new internal node,
56
make that the root, and start to create multiple leaf nodes. The root node then
57
contains the keys which divide the leaf pages. (So if leaf node 1 ends with
58
'foo' and leaf node 2 starts with 'foz', the root node would hold the key 'foz'
65
It is possible for enough leaf nodes to be created, that we cannot fit all
66
there references in a single node. In this case, we again split, creating
67
another layer, and setting that as the root. This layer then references the
68
intermediate layer, which references the final leaf nodes.
70
In all cases, the root node is a single page wide. The next layer can have 2-N
77
Empirically, we've found that the number of references that can be stored on a
78
page varies from about 60 to about 180, depending on how much we compress, and
79
how similar the keys are. Internal nodes also achieve approximately the same
80
compression, though they seem to be closer to 80-100 and not as variable. For
81
most of this discussion, we will assume each page holds 100 entries, as that
82
makes the math nice and clean.
84
So the idea is that if you have <100 keys, they will probably all fit on the
85
root page. If you have 100 - 10,000 keys, we will have a 2-layer structure, if
86
you have 10,000 - 1,000,000 keys, you will have a 3-layer structure. 10^6-10^8
93
It is important to be aware of what sort of data requests will be made on these
94
indexes, so that we know how to optimize them. This is still a work in
95
progress, but generally we are searching through ancestry. The final
96
information (in the leaf nodes) is stored in sorted order. Revision ids are
97
generally of the form "prefix:committer@email-timestamp-randomtail".
98
This means that revisions made by the same person around the same time will be
99
clustered, but revisions made by different people at the same time will not be
101
For files, the keys are ``(file-id, revision-id)`` tuples. And file-ids are
102
generally ``basename-timestamp-random-count`` (depending on the converter).
103
This means that all revisions for a given file-id will be grouped together, and
104
that files with similar names will be grouped together. However, files
105
committed in the same revisions will not be grouped together in the index.[1]_
107
.. [1] One interesting possibility would be to change file-ids from being
108
'basename-...', to being 'containing-dirname-filename-...', which would
109
group files in the similarly named directories together.
112
In general, we always start with a request for the root node of the index, as
113
it tells us the final structure of the rest of the index. How many total pages,
114
what pages are internal nodes and what layer, which ones are leaves. Before
115
this point, we do know the *size* of the index, because that is stored in the
119
Thoughts on expansion
120
=====================
122
This is just a bullet list of things to consider when expanding a request.
124
* We generally assume locality of reference. So if we are currently reading
125
page 10, we are more likely to read page 9 or 11 than we are page 20.
127
* However, locality of reference only really holds within a layer. If we are
128
reading the last node in a layer, we are unlikely to read the first node of
129
the next layer. In fact, we are most likely to read the *last* node of the
132
More directly, we are probably equally likely to read any of the nodes in the
133
next layer, which could be referred to by this layer. So if we have a
134
structure of 1 root node, 100 intermediate nodes, and 10,000 leaf nodes.
135
They will have offsets: 0, 1-101, 102-10,102.
137
If we read the root node, we are likely to want any of the 1-101 nodes
138
(because we don't know where the key points). If we are reading node 90, then
139
we are likely to want a node somewhere around 9,100-9,200.
141
* When expanding a request, we are considering that we probably want to read on
142
the order of 10 pages extra. (64kB / 4kB = 16 pages.) It is unlikely that we
143
want to expand the requests by 100.
145
* At the moment, we assume that we don't have an idea of where in the next
146
layer the keys might fall. We *could* use a predictive algorithm assuming
147
homogenous distribution. When reading the root node, we could assume an even
148
distribution from 'a-z', so that a key starting with 'a' would tend to fall
149
in the first few pages of the next layer, while a key starting with 'z' would
150
fall at the end of the next layer.
151
However, this is quite likely to fail in many ways. Specific examples:
153
* Converters tend to use an identical prefix. So all revisions will start
154
with 'xxx:', leading us to think that the keys fall in the last half,
155
when in reality they fall evenly distributed.
157
* When looking in text indexes. In the short term, changes tend to be
158
clustered around a small set of files. Short term changes are unlikely to
159
cross many pages, but it is unclear what happens in the mid-term.
160
Obviously in the long term, changes have happened to all files.
162
A possibility, would be to use this after reading the root node. And then
163
using an algorithm that compares the keys before and after this record, to
164
find what a distribution would be, and estimate the next pages.
166
This is a lot of work for a potentially small benefit, though.
168
* When checking for N keys, we do sequential lookups in each layer. So we look
169
at layer 1 for all N keys, then in layer 2 for all N keys, etc. So our
170
requests will be clustered by layer.
172
* For projects with large history, we are probably more likely to end up with a
173
bi-modal distribution of pack files. Where we have 1 pack file with a large
174
index, and then several pack files with small indexes, several with tiny
175
indexes, but no pack files with medium sized indexes.
176
This is because a command like ``bzr pack`` will combine everything into a
177
single large file. Commands like ``bzr commit`` will create an index with a
178
single new record, though these will be packaged together by autopack.
179
Commands like ``bzr push`` and ``bzr pull`` will create indexes with more
180
records, but these are unlikely to be a significant portion of the history.
181
Consider bzr has 20,000 revisions, a single push/pull is likely to only be
182
100-200 revisions, or 1% of the history.
184
Note that there will always be cases where things are evenly distributed, but
185
we probably shouldn't *optimize* for that case.
187
* 64kB is 16 pages. 16 pages is approximately 1,600 keys.
189
* We are considering an index with 1 million keys to be very large. 10M is
190
probably possible, and maybe 100M, but something like 1 billion keys is
191
unlikely. So a 3-layer index is fairly common (it exists already in bzr), but
192
a 4-layer is going to be quite rare, and we will probably never see a
195
* There are times when the second layer is going to be incompletely filled out.
196
Consider an index with 101 keys. We found that we couldn't fit everything
197
into a single page, so we expanded the btree into a root page and a leaf
198
page, and started a new leaf page. However, the root node only has a single
199
entry. There are 3 pages, but only one of them is "full".
200
This happens again when we get near the 10,000 node barrier. We found we
201
couldn't fit the index in a single page, so we split it into a higher layer,
202
and 1 more sub-layer. So we have 1 root node, 2 layer-2 nodes, and N leaf
203
nodes (layer 3). If we read the first 3 nodes, we will have read all internal
206
It is certainly possible to detect this for the first-split case (when things
207
no-longer fit into just the root node), as there will only be a few nodes
208
total. Is it possible to detect this from only the 'size' information for the
209
second-split case (when the index no longer fits in a single page, but still
210
fits in only a small handful of pages)?
212
This only really works for the root + layer 2. For layers 3+ they will always
213
be too big to read all at once. However, until we've read the root, we don't
214
know the layout, so all we have to go on is the size of the index, though
215
that also gives us the explicit total number of pages.
216
So it doesn't help to read the root page and then decide. However, on the
217
flip side, if we read *before* the split, then we don't gain much, as we are
218
reading pages we aren't likely to be interested in.
222
We have 100 keys, which fits onto 100 pages, with a single root node. At
223
1,100 keys, it would be 101 leaf pages, which would then cause us to need 2
224
index pages, triggering an extra layer. However, this is very sensitive to
225
the number of keys we fit per-page, which depends on the compression.
226
Although, we could consider 2,000 keys. Which would be 200 leaf nodes, and
227
2 intermediate nodes, and a single root node. It is unlikely that we would
228
ever be able to fit 200 references into a single root node.
230
So if we pretend that we split at 1 page, 100 pages, and 10,000 pages. We
231
might be able to say, at 1-5 pages, read all pages, for 5-100 pages, read
232
only the root. At 100 - 500 pages, read 1-5 pages, for 500-10,000 read only
233
the root. At 10,000-50,000 read 1-5 pages again, but above 50,000 read only
234
the root. We could bias this a bit smaller, say at powers of 80, instead of
235
powers of 100, etc. The basic idea is that if we are *close* to a layer
236
split, go ahead and read a small number of extra pages.
238
* The previous discussion applies whenever we have an upper layer that is not
239
completely full. So the pages referenced by the last node from the upper
240
layer will often not have a full 100-way fan out. Probably not worthwhile
243
* Sometimes we will be making a very small request for a very small number of
244
keys, we don't really want to bloat tiny requests. Hopefully we can find a
245
decent heuristic to determine when we will be wanting extra nodes later,
246
versus when we expect to find all we want right now.
252
This is the basic outline of the algorithm.
254
1. If we don't know the size of the index, don't expand as we don't know what
255
is available. (This only really applies to the pack-names file, which is
256
unlikely to ever become larger than 1 page anyway.)
258
2. If a request is already wide enough to be greater than the number of
259
recommended pages, don't bother trying to expand. This only really happens
260
with LocalTransport which recommends a single page.
262
3. Determine what pages have already been read (if any). If the pages left to
263
read can fit in a single request, just request them. This tends to happen on
264
medium sized indexes (ones with low hundreds of revisions), and near the end
265
when we've read most of the whole index already.
267
4. If we haven't read the root node yet, and we can't fit the whole index into
268
a single request, only read the root node. We don't know where the layer
269
boundaries are anyway.
271
5. If we haven't read "tree depth" pages yet, and are only requesting a single
272
new page don't expand. This is meant to handle the 'lookup 1 item in the
273
index' case. In a large pack file, you'll read only a single page at each
274
layer and then be done. When spidering out in a search, this will cause us
275
to take a little bit longer to start expanding, but once we've started we'll
276
be expanding at full velocity. This could be improved by having indexes
277
inform each other that they have already entered the 'search' phase, or by
278
having a hint from above to indicate the same.
280
However, remember the 'bi-modal' distribution. Most indexes will either be
281
very small, or very large. So either we'll read the whole thing quickly, or
282
we'll end up spending a lot of time in the index. Which makes a small number
283
of extra round trips to large indexes a small overhead. For 2-layer nodes,
284
this only 'wastes' one round trip.
286
6. Now we are ready to expand the requests. Expand by looking for more pages
287
next to the ones requested that fit within the current layer. If you run
288
into a cached page, or a layer boundary, search further only in the opposite
289
direction. This gives us proper locality of reference, and also helps
290
because when a search goes in a single direction, we will continue to
291
prefetch pages in that direction.