3763.8.7
by John Arbash Meinel
A bit of doc updates, start putting in tests for current behavior. |
1 |
==================== |
2 |
BTree Index Prefetch |
|
3 |
==================== |
|
3763.8.2
by John Arbash Meinel
Start writing a document describing the background for the algorithm, |
4 |
|
5 |
This document outlines how we decide to pre-read extra nodes in the btree |
|
6 |
index. |
|
7 |
||
8 |
||
9 |
Rationale |
|
10 |
========= |
|
11 |
||
12 |
Because of the latency involved in making a request, it is often better to make |
|
13 |
fewer large requests, rather than more small requests, even if some of the |
|
3763.8.7
by John Arbash Meinel
A bit of doc updates, start putting in tests for current behavior. |
14 |
extra data will be wasted. |
3763.8.2
by John Arbash Meinel
Start writing a document describing the background for the algorithm, |
15 |
|
16 |
Example |
|
17 |
------- |
|
18 |
||
19 |
Using my connection as an example, I have a max bandwidth of 160kB/s, and a |
|
20 |
latency of between 100-400ms to London, I'll use 200ms for this example. With |
|
21 |
this connection, in 200ms you can download 32kB. So if you make 10 requests for |
|
22 |
4kB of data, you spend 10*.2s = 2s sending the requests, and 4*10/160 = .25s |
|
23 |
actually downloading the data. If, instead, you made 3 requests for 32kB of |
|
24 |
data each, you would take 3*.2s = .6s for requests, and 32*3/160 = .6s for |
|
25 |
downloading the data. So you save 2.25 - 1.2 = 1.05s even though you downloaded |
|
26 |
32*3-4*10 = 56kB of data that you probably don't need. On the other hand, if |
|
27 |
you made 1 request for 480kB, you would take .2s for the request, and |
|
28 |
480/160=3s for the data. So you end up taking 3.2s, because of the wasted |
|
29 |
440kB. |
|
30 |
||
31 |
||
32 |
BTree Structure |
|
33 |
=============== |
|
34 |
||
35 |
This is meant to give a basic feeling for how the btree index is laid out on |
|
36 |
disk, not give a rigorous discussion. For that look elsewhere[ref?]. |
|
37 |
||
38 |
The basic structure is that we have pages of 4kB. Each page is either a leaf, |
|
39 |
which holds the final information we are interested in, or is an internal node, |
|
40 |
which contains a list of references to the next layer of nodes. The layers are |
|
41 |
structured such that all nodes for the top layer come first, then the nodes for |
|
42 |
the next layer, linearly in the file. |
|
43 |
||
44 |
||
45 |
Example 1 layer |
|
46 |
--------------- |
|
47 |
||
48 |
In the simplest example, all the data fits into a single page, the root node. |
|
49 |
This means the root node is a leaf node. |
|
50 |
||
51 |
||
52 |
Example 2 layer |
|
53 |
--------------- |
|
54 |
||
55 |
As soon as the data cannot fit in a single node, we create a new internal node, |
|
56 |
make that the root, and start to create multiple leaf nodes. The root node then |
|
57 |
contains the keys which divide the leaf pages. (So if leaf node 1 ends with |
|
58 |
'foo' and leaf node 2 starts with 'foz', the root node would hold the key 'foz' |
|
59 |
at position 0). |
|
60 |
||
61 |
||
62 |
Example 3 layer |
|
63 |
--------------- |
|
64 |
||
65 |
It is possible for enough leaf nodes to be created, that we cannot fit all |
|
66 |
there references in a single node. In this case, we again split, creating |
|
67 |
another layer, and setting that as the root. This layer then references the |
|
68 |
intermediate layer, which references the final leaf nodes. |
|
69 |
||
70 |
In all cases, the root node is a single page wide. The next layer can have 2-N |
|
71 |
nodes. |
|
72 |
||
73 |
||
74 |
Current Info |
|
75 |
------------ |
|
76 |
||
77 |
Empirically, we've found that the number of references that can be stored on a |
|
78 |
page varies from about 60 to about 180, depending on how much we compress, and |
|
79 |
how similar the keys are. Internal nodes also achieve approximately the same |
|
80 |
compression, though they seem to be closer to 80-100 and not as variable. For |
|
81 |
most of this discussion, we will assume each page holds 100 entries, as that |
|
82 |
makes the math nice and clean. |
|
83 |
||
84 |
So the idea is that if you have <100 keys, they will probably all fit on the |
|
85 |
root page. If you have 100 - 10,000 keys, we will have a 2-layer structure, if |
|
86 |
you have 10,000 - 1,000,000 keys, you will have a 3-layer structure. 10^6-10^8 |
|
87 |
will be 4-layer, etc. |
|
88 |
||
89 |
||
90 |
Data and Request |
|
91 |
================ |
|
92 |
||
93 |
It is important to be aware of what sort of data requests will be made on these |
|
94 |
indexes, so that we know how to optimize them. This is still a work in |
|
95 |
progress, but generally we are searching through ancestry. The final |
|
96 |
information (in the leaf nodes) is stored in sorted order. Revision ids are |
|
97 |
generally of the form "prefix:committer@email-timestamp-randomtail". |
|
98 |
This means that revisions made by the same person around the same time will be |
|
99 |
clustered, but revisions made by different people at the same time will not be |
|
100 |
clustered. |
|
101 |
For files, the keys are ``(file-id, revision-id)`` tuples. And file-ids are |
|
102 |
generally ``basename-timestamp-random-count`` (depending on the converter). |
|
103 |
This means that all revisions for a given file-id will be grouped together, and |
|
104 |
that files with similar names will be grouped together. However, files |
|
105 |
committed in the same revisions will not be grouped together in the index.[1]_ |
|
106 |
||
107 |
.. [1] One interesting possibility would be to change file-ids from being |
|
108 |
'basename-...', to being 'containing-dirname-filename-...', which would |
|
109 |
group files in the similarly named directories together. |
|
110 |
||
111 |
||
112 |
In general, we always start with a request for the root node of the index, as |
|
113 |
it tells us the final structure of the rest of the index. How many total pages, |
|
114 |
what pages are internal nodes and what layer, which ones are leaves. Before |
|
115 |
this point, we do know the *size* of the index, because that is stored in the |
|
116 |
``pack-names`` file. |
|
117 |
||
118 |
||
119 |
Thoughts on expansion |
|
120 |
===================== |
|
121 |
||
122 |
This is just a bullet list of things to consider when expanding a request. |
|
123 |
||
124 |
* We generally assume locality of reference. So if we are currently reading |
|
125 |
page 10, we are more likely to read page 9 or 11 than we are page 20. |
|
3763.8.7
by John Arbash Meinel
A bit of doc updates, start putting in tests for current behavior. |
126 |
|
3763.8.2
by John Arbash Meinel
Start writing a document describing the background for the algorithm, |
127 |
* However, locality of reference only really holds within a layer. If we are |
128 |
reading the last node in a layer, we are unlikely to read the first node of |
|
129 |
the next layer. In fact, we are most likely to read the *last* node of the |
|
130 |
next layer. |
|
131 |
||
132 |
More directly, we are probably equally likely to read any of the nodes in the |
|
133 |
next layer, which could be referred to by this layer. So if we have a |
|
134 |
structure of 1 root node, 100 intermediate nodes, and 10,000 leaf nodes. |
|
135 |
They will have offsets: 0, 1-101, 102-10,102. |
|
136 |
||
137 |
If we read the root node, we are likely to want any of the 1-101 nodes |
|
138 |
(because we don't know where the key points). If we are reading node 90, then |
|
139 |
we are likely to want a node somewhere around 9,100-9,200. |
|
140 |
||
141 |
* When expanding a request, we are considering that we probably want to read on |
|
142 |
the order of 10 pages extra. (64kB / 4kB = 16 pages.) It is unlikely that we |
|
143 |
want to expand the requests by 100. |
|
144 |
||
145 |
* At the moment, we assume that we don't have an idea of where in the next |
|
146 |
layer the keys might fall. We *could* use a predictive algorithm assuming |
|
147 |
homogenous distribution. When reading the root node, we could assume an even |
|
148 |
distribution from 'a-z', so that a key starting with 'a' would tend to fall |
|
149 |
in the first few pages of the next layer, while a key starting with 'z' would |
|
150 |
fall at the end of the next layer. |
|
151 |
However, this is quite likely to fail in many ways. Specific examples: |
|
152 |
||
153 |
* Converters tend to use an identical prefix. So all revisions will start |
|
154 |
with 'xxx:', leading us to think that the keys fall in the last half, |
|
155 |
when in reality they fall evenly distributed. |
|
156 |
||
157 |
* When looking in text indexes. In the short term, changes tend to be |
|
158 |
clustered around a small set of files. Short term changes are unlikely to |
|
159 |
cross many pages, but it is unclear what happens in the mid-term. |
|
160 |
Obviously in the long term, changes have happened to all files. |
|
161 |
||
162 |
A possibility, would be to use this after reading the root node. And then |
|
163 |
using an algorithm that compares the keys before and after this record, to |
|
164 |
find what a distribution would be, and estimate the next pages. |
|
165 |
||
166 |
This is a lot of work for a potentially small benefit, though. |
|
167 |
||
168 |
* When checking for N keys, we do sequential lookups in each layer. So we look |
|
169 |
at layer 1 for all N keys, then in layer 2 for all N keys, etc. So our |
|
170 |
requests will be clustered by layer. |
|
171 |
||
172 |
* For projects with large history, we are probably more likely to end up with a |
|
173 |
bi-modal distribution of pack files. Where we have 1 pack file with a large |
|
174 |
index, and then several pack files with small indexes, several with tiny |
|
175 |
indexes, but no pack files with medium sized indexes. |
|
176 |
This is because a command like ``bzr pack`` will combine everything into a |
|
177 |
single large file. Commands like ``bzr commit`` will create an index with a |
|
178 |
single new record, though these will be packaged together by autopack. |
|
3763.8.7
by John Arbash Meinel
A bit of doc updates, start putting in tests for current behavior. |
179 |
Commands like ``bzr push`` and ``bzr pull`` will create indexes with more |
3763.8.2
by John Arbash Meinel
Start writing a document describing the background for the algorithm, |
180 |
records, but these are unlikely to be a significant portion of the history. |
181 |
Consider bzr has 20,000 revisions, a single push/pull is likely to only be |
|
182 |
100-200 revisions, or 1% of the history. |
|
183 |
||
184 |
Note that there will always be cases where things are evenly distributed, but |
|
185 |
we probably shouldn't *optimize* for that case. |
|
186 |
||
187 |
* 64kB is 16 pages. 16 pages is approximately 1,600 keys. |
|
188 |
||
189 |
* We are considering an index with 1 million keys to be very large. 10M is |
|
190 |
probably possible, and maybe 100M, but something like 1 billion keys is |
|
191 |
unlikely. So a 3-layer index is fairly common (it exists already in bzr), but |
|
192 |
a 4-layer is going to be quite rare, and we will probably never see a |
|
193 |
5-layer. |
|
194 |
||
195 |
* There are times when the second layer is going to be incompletely filled out. |
|
196 |
Consider an index with 101 keys. We found that we couldn't fit everything |
|
197 |
into a single page, so we expanded the btree into a root page and a leaf |
|
198 |
page, and started a new leaf page. However, the root node only has a single |
|
199 |
entry. There are 3 pages, but only one of them is "full". |
|
200 |
This happens again when we get near the 10,000 node barrier. We found we |
|
201 |
couldn't fit the index in a single page, so we split it into a higher layer, |
|
202 |
and 1 more sub-layer. So we have 1 root node, 2 layer-2 nodes, and N leaf |
|
203 |
nodes (layer 3). If we read the first 3 nodes, we will have read all internal |
|
204 |
nodes. |
|
205 |
||
206 |
It is certainly possible to detect this for the first-split case (when things |
|
207 |
no-longer fit into just the root node), as there will only be a few nodes |
|
208 |
total. Is it possible to detect this from only the 'size' information for the |
|
209 |
second-split case (when the index no longer fits in a single page, but still |
|
210 |
fits in only a small handful of pages)? |
|
211 |
||
212 |
This only really works for the root + layer 2. For layers 3+ they will always |
|
213 |
be too big to read all at once. However, until we've read the root, we don't |
|
214 |
know the layout, so all we have to go on is the size of the index, though |
|
215 |
that also gives us the explicit total number of pages. |
|
3763.8.3
by John Arbash Meinel
A bit more info |
216 |
So it doesn't help to read the root page and then decide. However, on the |
217 |
flip side, if we read *before* the split, then we don't gain much, as we are |
|
218 |
reading pages we aren't likely to be interested in. |
|
219 |
||
220 |
For example: |
|
221 |
||
222 |
We have 100 keys, which fits onto 100 pages, with a single root node. At |
|
223 |
1,100 keys, it would be 101 leaf pages, which would then cause us to need 2 |
|
224 |
index pages, triggering an extra layer. However, this is very sensitive to |
|
225 |
the number of keys we fit per-page, which depends on the compression. |
|
226 |
Although, we could consider 2,000 keys. Which would be 200 leaf nodes, and |
|
227 |
2 intermediate nodes, and a single root node. It is unlikely that we would |
|
228 |
ever be able to fit 200 references into a single root node. |
|
229 |
||
230 |
So if we pretend that we split at 1 page, 100 pages, and 10,000 pages. We |
|
231 |
might be able to say, at 1-5 pages, read all pages, for 5-100 pages, read |
|
232 |
only the root. At 100 - 500 pages, read 1-5 pages, for 500-10,000 read only |
|
233 |
the root. At 10,000-50,000 read 1-5 pages again, but above 50,000 read only |
|
234 |
the root. We could bias this a bit smaller, say at powers of 80, instead of |
|
235 |
powers of 100, etc. The basic idea is that if we are *close* to a layer |
|
236 |
split, go ahead and read a small number of extra pages. |
|
3763.8.2
by John Arbash Meinel
Start writing a document describing the background for the algorithm, |
237 |
|
3763.8.7
by John Arbash Meinel
A bit of doc updates, start putting in tests for current behavior. |
238 |
* The previous discussion applies whenever we have an upper layer that is not |
239 |
completely full. So the pages referenced by the last node from the upper |
|
240 |
layer will often not have a full 100-way fan out. Probably not worthwhile |
|
241 |
very often, though. |
|
242 |
||
3763.8.14
by John Arbash Meinel
Add in a shortcut when we haven't cached much yet. |
243 |
* Sometimes we will be making a very small request for a very small number of |
244 |
keys, we don't really want to bloat tiny requests. Hopefully we can find a |
|
245 |
decent heuristic to determine when we will be wanting extra nodes later, |
|
246 |
versus when we expect to find all we want right now. |
|
247 |
||
3763.8.2
by John Arbash Meinel
Start writing a document describing the background for the algorithm, |
248 |
|
3763.8.13
by John Arbash Meinel
This is now the algorithm, rather than just a suggestion. |
249 |
Algorithm |
250 |
========= |
|
3763.8.2
by John Arbash Meinel
Start writing a document describing the background for the algorithm, |
251 |
|
3763.8.13
by John Arbash Meinel
This is now the algorithm, rather than just a suggestion. |
252 |
This is the basic outline of the algorithm. |
3763.8.2
by John Arbash Meinel
Start writing a document describing the background for the algorithm, |
253 |
|
3763.8.14
by John Arbash Meinel
Add in a shortcut when we haven't cached much yet. |
254 |
1. If we don't know the size of the index, don't expand as we don't know what |
255 |
is available. (This only really applies to the pack-names file, which is |
|
256 |
unlikely to ever become larger than 1 page anyway.) |
|
257 |
||
258 |
2. If a request is already wide enough to be greater than the number of |
|
259 |
recommended pages, don't bother trying to expand. This only really happens |
|
260 |
with LocalTransport which recommends a single page. |
|
261 |
||
262 |
3. Determine what pages have already been read (if any). If the pages left to |
|
263 |
read can fit in a single request, just request them. This tends to happen on |
|
264 |
medium sized indexes (ones with low hundreds of revisions), and near the end |
|
265 |
when we've read most of the whole index already. |
|
266 |
||
267 |
4. If we haven't read the root node yet, and we can't fit the whole index into |
|
268 |
a single request, only read the root node. We don't know where the layer |
|
269 |
boundaries are anyway. |
|
270 |
||
271 |
5. If we haven't read "tree depth" pages yet, and are only requesting a single |
|
272 |
new page don't expand. This is meant to handle the 'lookup 1 item in the |
|
273 |
index' case. In a large pack file, you'll read only a single page at each |
|
274 |
layer and then be done. When spidering out in a search, this will cause us |
|
275 |
to take a little bit longer to start expanding, but once we've started we'll |
|
276 |
be expanding at full velocity. This could be improved by having indexes |
|
277 |
inform each other that they have already entered the 'search' phase, or by |
|
278 |
having a hint from above to indicate the same. |
|
279 |
||
280 |
However, remember the 'bi-modal' distribution. Most indexes will either be |
|
281 |
very small, or very large. So either we'll read the whole thing quickly, or |
|
282 |
we'll end up spending a lot of time in the index. Which makes a small number |
|
283 |
of extra round trips to large indexes a small overhead. For 2-layer nodes, |
|
284 |
this only 'wastes' one round trip. |
|
285 |
||
286 |
6. Now we are ready to expand the requests. Expand by looking for more pages |
|
287 |
next to the ones requested that fit within the current layer. If you run |
|
288 |
into a cached page, or a layer boundary, search further only in the opposite |
|
289 |
direction. This gives us proper locality of reference, and also helps |
|
290 |
because when a search goes in a single direction, we will continue to |
|
291 |
prefetch pages in that direction. |
|
3763.8.5
by John Arbash Meinel
Rename the doc to .txt and add the vim syntax line. |
292 |
|
293 |
.. |
|
294 |
vim: ft=rst tw=79 ai |