~bzr-pqm/bzr/bzr.dev : contents of doc/developers/btree_index

~bzr-pqm/bzr/bzr.dev : (revision 6587.2.4)

====================
BTree Index Prefetch
====================

This document outlines how we decide to pre-read extra nodes in the btree
index.


Rationale
=========

Because of the latency involved in making a request, it is often better to make
fewer large requests, rather than more small requests, even if some of the
extra data will be wasted.

Example
-------

Using my connection as an example, I have a max bandwidth of 160kB/s, and a
latency of between 100-400ms to London, I'll use 200ms for this example. With
this connection, in 200ms you can download 32kB. So if you make 10 requests for
4kB of data, you spend 10*.2s = 2s sending the requests, and 4*10/160 = .25s
actually downloading the data.  If, instead, you made 3 requests for 32kB of
data each, you would take 3*.2s = .6s for requests, and 32*3/160 = .6s for
downloading the data. So you save 2.25 - 1.2 = 1.05s even though you downloaded
32*3-4*10 = 56kB of data that you probably don't need.  On the other hand, if
you made 1 request for 480kB, you would take .2s for the request, and
480/160=3s for the data. So you end up taking 3.2s, because of the wasted
440kB.


BTree Structure
===============

This is meant to give a basic feeling for how the btree index is laid out on
disk, not give a rigorous discussion. For that look elsewhere[ref?].

The basic structure is that we have pages of 4kB. Each page is either a leaf,
which holds the final information we are interested in, or is an internal node,
which contains a list of references to the next layer of nodes. The layers are
structured such that all nodes for the top layer come first, then the nodes for
the next layer, linearly in the file.


Example 1 layer
---------------

In the simplest example, all the data fits into a single page, the root node.
This means the root node is a leaf node.


Example 2 layer
---------------

As soon as the data cannot fit in a single node, we create a new internal node,
make that the root, and start to create multiple leaf nodes. The root node then
contains the keys which divide the leaf pages. (So if leaf node 1 ends with
'foo' and leaf node 2 starts with 'foz', the root node would hold the key 'foz'
at position 0).


Example 3 layer
---------------

It is possible for enough leaf nodes to be created, that we cannot fit all
there references in a single node. In this case, we again split, creating
another layer, and setting that as the root. This layer then references the
intermediate layer, which references the final leaf nodes.

In all cases, the root node is a single page wide. The next layer can have 2-N
nodes.


Current Info
------------

Empirically, we've found that the number of references that can be stored on a
page varies from about 60 to about 180, depending on how much we compress, and
how similar the keys are. Internal nodes also achieve approximately the same
compression, though they seem to be closer to 80-100 and not as variable. For
most of this discussion, we will assume each page holds 100 entries, as that
makes the math nice and clean.

So the idea is that if you have <100 keys, they will probably all fit on the
root page. If you have 100 - 10,000 keys, we will have a 2-layer structure, if
you have 10,000 - 1,000,000 keys, you will have a 3-layer structure. 10^6-10^8
will be 4-layer, etc.


Data and Request
================

It is important to be aware of what sort of data requests will be made on these
indexes, so that we know how to optimize them. This is still a work in
progress, but generally we are searching through ancestry. The final
information (in the leaf nodes) is stored in sorted order. Revision ids are
generally of the form "prefix:committer@email-timestamp-randomtail".
This means that revisions made by the same person around the same time will be
clustered, but revisions made by different people at the same time will not be
clustered.
For files, the keys are ``(file-id, revision-id)`` tuples. And file-ids are
generally ``basename-timestamp-random-count`` (depending on the converter).
This means that all revisions for a given file-id will be grouped together, and
that files with similar names will be grouped together. However, files
committed in the same revisions will not be grouped together in the index.[1]_

.. [1] One interesting possibility would be to change file-ids from being
   'basename-...', to being 'containing-dirname-filename-...', which would
   group files in the similarly named directories together.


In general, we always start with a request for the root node of the index, as
it tells us the final structure of the rest of the index. How many total pages,
what pages are internal nodes and what layer, which ones are leaves. Before
this point, we do know the *size* of the index, because that is stored in the
``pack-names`` file.


Thoughts on expansion
=====================

This is just a bullet list of things to consider when expanding a request.

* We generally assume locality of reference. So if we are currently reading
  page 10, we are more likely to read page 9 or 11 than we are page 20.

* However, locality of reference only really holds within a layer. If we are
  reading the last node in a layer, we are unlikely to read the first node of
  the next layer. In fact, we are most likely to read the *last* node of the
  next layer.

  More directly, we are probably equally likely to read any of the nodes in the
  next layer, which could be referred to by this layer. So if we have a
  structure of 1 root node, 100 intermediate nodes, and 10,000 leaf nodes.
  They will have offsets: 0, 1-101, 102-10,102.

  If we read the root node, we are likely to want any of the 1-101 nodes
  (because we don't know where the key points). If we are reading node 90, then
  we are likely to want a node somewhere around 9,100-9,200.

* When expanding a request, we are considering that we probably want to read on
  the order of 10 pages extra. (64kB / 4kB = 16 pages.) It is unlikely that we
  want to expand the requests by 100.

* At the moment, we assume that we don't have an idea of where in the next
  layer the keys might fall. We *could* use a predictive algorithm assuming
  homogenous distribution. When reading the root node, we could assume an even
  distribution from 'a-z', so that a key starting with 'a' would tend to fall
  in the first few pages of the next layer, while a key starting with 'z' would
  fall at the end of the next layer.
  However, this is quite likely to fail in many ways. Specific examples:

    * Converters tend to use an identical prefix. So all revisions will start
      with 'xxx:', leading us to think that the keys fall in the last half,
      when in reality they fall evenly distributed.

    * When looking in text indexes. In the short term, changes tend to be
      clustered around a small set of files. Short term changes are unlikely to
      cross many pages, but it is unclear what happens in the mid-term.
      Obviously in the long term, changes have happened to all files.

  A possibility, would be to use this after reading the root node. And then
  using an algorithm that compares the keys before and after this record, to
  find what a distribution would be, and estimate the next pages.

  This is a lot of work for a potentially small benefit, though.

* When checking for N keys, we do sequential lookups in each layer. So we look
  at layer 1 for all N keys, then in layer 2 for all N keys, etc. So our
  requests will be clustered by layer.

* For projects with large history, we are probably more likely to end up with a
  bi-modal distribution of pack files. Where we have 1 pack file with a large
  index, and then several pack files with small indexes, several with tiny
  indexes, but no pack files with medium sized indexes.
  This is because a command like ``bzr pack`` will combine everything into a
  single large file. Commands like ``bzr commit`` will create an index with a
  single new record, though these will be packaged together by autopack.
  Commands like ``bzr push`` and ``bzr pull`` will create indexes with more
  records, but these are unlikely to be a significant portion of the history.
  Consider bzr has 20,000 revisions, a single push/pull is likely to only be
  100-200 revisions, or 1% of the history.

  Note that there will always be cases where things are evenly distributed, but
  we probably shouldn't *optimize* for that case.

* 64kB is 16 pages. 16 pages is approximately 1,600 keys.

* We are considering an index with 1 million keys to be very large. 10M is
  probably possible, and maybe 100M, but something like 1 billion keys is
  unlikely. So a 3-layer index is fairly common (it exists already in bzr), but
  a 4-layer is going to be quite rare, and we will probably never see a
  5-layer.

* There are times when the second layer is going to be incompletely filled out.
  Consider an index with 101 keys. We found that we couldn't fit everything
  into a single page, so we expanded the btree into a root page and a leaf
  page, and started a new leaf page. However, the root node only has a single
  entry. There are 3 pages, but only one of them is "full".
  This happens again when we get near the 10,000 node barrier. We found we
  couldn't fit the index in a single page, so we split it into a higher layer,
  and 1 more sub-layer. So we have 1 root node, 2 layer-2 nodes, and N leaf
  nodes (layer 3). If we read the first 3 nodes, we will have read all internal
  nodes.

  It is certainly possible to detect this for the first-split case (when things
  no-longer fit into just the root node), as there will only be a few nodes
  total. Is it possible to detect this from only the 'size' information for the
  second-split case (when the index no longer fits in a single page, but still
  fits in only a small handful of pages)?

  This only really works for the root + layer 2. For layers 3+ they will always
  be too big to read all at once. However, until we've read the root, we don't
  know the layout, so all we have to go on is the size of the index, though
  that also gives us the explicit total number of pages.
  So it doesn't help to read the root page and then decide. However, on the
  flip side, if we read *before* the split, then we don't gain much, as we are
  reading pages we aren't likely to be interested in.

  For example:

    We have 100 keys, which fits onto 100 pages, with a single root node. At
    1,100 keys, it would be 101 leaf pages, which would then cause us to need 2
    index pages, triggering an extra layer. However, this is very sensitive to
    the number of keys we fit per-page, which depends on the compression.
    Although, we could consider 2,000 keys. Which would be 200 leaf nodes, and
    2 intermediate nodes, and a single root node. It is unlikely that we would
    ever be able to fit 200 references into a single root node.

  So if we pretend that we split at 1 page, 100 pages, and 10,000 pages. We
  might be able to say, at 1-5 pages, read all pages, for 5-100 pages, read
  only the root. At 100 - 500 pages, read 1-5 pages, for 500-10,000 read only
  the root. At 10,000-50,000 read 1-5 pages again, but above 50,000 read only
  the root. We could bias this a bit smaller, say at powers of 80, instead of
  powers of 100, etc. The basic idea is that if we are *close* to a layer
  split, go ahead and read a small number of extra pages.

* The previous discussion applies whenever we have an upper layer that is not
  completely full. So the pages referenced by the last node from the upper
  layer will often not have a full 100-way fan out. Probably not worthwhile
  very often, though.

* Sometimes we will be making a very small request for a very small number of
  keys, we don't really want to bloat tiny requests. Hopefully we can find a
  decent heuristic to determine when we will be wanting extra nodes later,
  versus when we expect to find all we want right now.


Algorithm
=========

This is the basic outline of the algorithm.

1. If we don't know the size of the index, don't expand as we don't know what
   is available. (This only really applies to the pack-names file, which is
   unlikely to ever become larger than 1 page anyway.)

2. If a request is already wide enough to be greater than the number of
   recommended pages, don't bother trying to expand. This only really happens
   with LocalTransport which recommends a single page.

3. Determine what pages have already been read (if any). If the pages left to
   read can fit in a single request, just request them. This tends to happen on
   medium sized indexes (ones with low hundreds of revisions), and near the end
   when we've read most of the whole index already.

4. If we haven't read the root node yet, and we can't fit the whole index into
   a single request, only read the root node. We don't know where the layer
   boundaries are anyway.

5. If we haven't read "tree depth" pages yet, and are only requesting a single
   new page don't expand. This is meant to handle the 'lookup 1 item in the
   index' case. In a large pack file, you'll read only a single page at each
   layer and then be done. When spidering out in a search, this will cause us
   to take a little bit longer to start expanding, but once we've started we'll
   be expanding at full velocity. This could be improved by having indexes
   inform each other that they have already entered the 'search' phase, or by
   having a hint from above to indicate the same.

   However, remember the 'bi-modal' distribution. Most indexes will either be
   very small, or very large. So either we'll read the whole thing quickly, or
   we'll end up spending a lot of time in the index. Which makes a small number
   of extra round trips to large indexes a small overhead. For 2-layer nodes,
   this only 'wastes' one round trip.

6. Now we are ready to expand the requests. Expand by looking for more pages
   next to the ones requested that fit within the current layer. If you run
   into a cached page, or a layer boundary, search further only in the opposite
   direction. This gives us proper locality of reference, and also helps
   because when a search goes in a single direction, we will continue to
   prefetch pages in that direction.

..
   vim: ft=rst tw=79 ai

3763.8.7 by John Arbash Meinel A bit of doc updates, start putting in tests for current behavior.	1	====================
	2	BTree Index Prefetch
	3	====================
3763.8.2 by John Arbash Meinel Start writing a document describing the background for the algorithm,	4
	5	This document outlines how we decide to pre-read extra nodes in the btree
	6	index.
	7
	8
	9	Rationale
	10	=========
	11
	12	Because of the latency involved in making a request, it is often better to make
	13	fewer large requests, rather than more small requests, even if some of the
3763.8.7 by John Arbash Meinel A bit of doc updates, start putting in tests for current behavior.	14	extra data will be wasted.
3763.8.2 by John Arbash Meinel Start writing a document describing the background for the algorithm,	15
	16	Example
	17	-------
	18
	19	Using my connection as an example, I have a max bandwidth of 160kB/s, and a
	20	latency of between 100-400ms to London, I'll use 200ms for this example. With
	21	this connection, in 200ms you can download 32kB. So if you make 10 requests for
	22	4kB of data, you spend 10.2s = 2s sending the requests, and 410/160 = .25s
	23	actually downloading the data. If, instead, you made 3 requests for 32kB of
	24	data each, you would take 3.2s = .6s for requests, and 323/160 = .6s for
	25	downloading the data. So you save 2.25 - 1.2 = 1.05s even though you downloaded
	26	323-410 = 56kB of data that you probably don't need. On the other hand, if
	27	you made 1 request for 480kB, you would take .2s for the request, and
	28	480/160=3s for the data. So you end up taking 3.2s, because of the wasted
	29	440kB.
	30
	31
	32	BTree Structure
	33	===============
	34
	35	This is meant to give a basic feeling for how the btree index is laid out on
	36	disk, not give a rigorous discussion. For that look elsewhere[ref?].
	37
	38	The basic structure is that we have pages of 4kB. Each page is either a leaf,
	39	which holds the final information we are interested in, or is an internal node,
	40	which contains a list of references to the next layer of nodes. The layers are
	41	structured such that all nodes for the top layer come first, then the nodes for
	42	the next layer, linearly in the file.
	43
	44
	45	Example 1 layer
	46	---------------
	47
	48	In the simplest example, all the data fits into a single page, the root node.
	49	This means the root node is a leaf node.
	50
	51
	52	Example 2 layer
	53	---------------
	54
	55	As soon as the data cannot fit in a single node, we create a new internal node,
	56	make that the root, and start to create multiple leaf nodes. The root node then
	57	contains the keys which divide the leaf pages. (So if leaf node 1 ends with
	58	'foo' and leaf node 2 starts with 'foz', the root node would hold the key 'foz'
	59	at position 0).
	60
	61
	62	Example 3 layer
	63	---------------
	64
	65	It is possible for enough leaf nodes to be created, that we cannot fit all
	66	there references in a single node. In this case, we again split, creating
	67	another layer, and setting that as the root. This layer then references the
	68	intermediate layer, which references the final leaf nodes.
	69
	70	In all cases, the root node is a single page wide. The next layer can have 2-N
	71	nodes.
	72
	73
	74	Current Info
	75	------------
	76
	77	Empirically, we've found that the number of references that can be stored on a
	78	page varies from about 60 to about 180, depending on how much we compress, and
79	how similar the keys are. Internal nodes also achieve approximately the same
80	compression, though they seem to be closer to 80-100 and not as variable. For
81	most of this discussion, we will assume each page holds 100 entries, as that
82	makes the math nice and clean.
83
84	So the idea is that if you have <100 keys, they will probably all fit on the
85	root page. If you have 100 - 10,000 keys, we will have a 2-layer structure, if
86	you have 10,000 - 1,000,000 keys, you will have a 3-layer structure. 10^6-10^8
87	will be 4-layer, etc.
88
89
90	Data and Request
91	================
92
93	It is important to be aware of what sort of data requests will be made on these
94	indexes, so that we know how to optimize them. This is still a work in
95	progress, but generally we are searching through ancestry. The final
96	information (in the leaf nodes) is stored in sorted order. Revision ids are
97	generally of the form "prefix:committer@email-timestamp-randomtail".
98	This means that revisions made by the same person around the same time will be
99	clustered, but revisions made by different people at the same time will not be
100	clustered.
101	For files, the keys are ``(file-id, revision-id)`` tuples. And file-ids are
102	generally ``basename-timestamp-random-count`` (depending on the converter).
103	This means that all revisions for a given file-id will be grouped together, and
104	that files with similar names will be grouped together. However, files
105	committed in the same revisions will not be grouped together in the index.[1]_
106
107	.. [1] One interesting possibility would be to change file-ids from being
108	'basename-...', to being 'containing-dirname-filename-...', which would
109	group files in the similarly named directories together.
110
111
112	In general, we always start with a request for the root node of the index, as
113	it tells us the final structure of the rest of the index. How many total pages,
114	what pages are internal nodes and what layer, which ones are leaves. Before
115	this point, we do know the size of the index, because that is stored in the
116	``pack-names`` file.
117
118
119	Thoughts on expansion
120	=====================
121
122	This is just a bullet list of things to consider when expanding a request.
123
124	* We generally assume locality of reference. So if we are currently reading
125	page 10, we are more likely to read page 9 or 11 than we are page 20.
3763.8.7 by John Arbash Meinel A bit of doc updates, start putting in tests for current behavior.	126
3763.8.2 by John Arbash Meinel Start writing a document describing the background for the algorithm,	127	* However, locality of reference only really holds within a layer. If we are
	128	reading the last node in a layer, we are unlikely to read the first node of
	129	the next layer. In fact, we are most likely to read the last node of the
	130	next layer.
	131
	132	More directly, we are probably equally likely to read any of the nodes in the
	133	next layer, which could be referred to by this layer. So if we have a
	134	structure of 1 root node, 100 intermediate nodes, and 10,000 leaf nodes.
	135	They will have offsets: 0, 1-101, 102-10,102.
	136
	137	If we read the root node, we are likely to want any of the 1-101 nodes
	138	(because we don't know where the key points). If we are reading node 90, then
	139	we are likely to want a node somewhere around 9,100-9,200.
	140
	141	* When expanding a request, we are considering that we probably want to read on
	142	the order of 10 pages extra. (64kB / 4kB = 16 pages.) It is unlikely that we
	143	want to expand the requests by 100.
	144
	145	* At the moment, we assume that we don't have an idea of where in the next
	146	layer the keys might fall. We could use a predictive algorithm assuming
	147	homogenous distribution. When reading the root node, we could assume an even
	148	distribution from 'a-z', so that a key starting with 'a' would tend to fall
	149	in the first few pages of the next layer, while a key starting with 'z' would
	150	fall at the end of the next layer.
	151	However, this is quite likely to fail in many ways. Specific examples:
	152
	153	* Converters tend to use an identical prefix. So all revisions will start
	154	with 'xxx:', leading us to think that the keys fall in the last half,
	155	when in reality they fall evenly distributed.
	156
	157	* When looking in text indexes. In the short term, changes tend to be
	158	clustered around a small set of files. Short term changes are unlikely to
	159	cross many pages, but it is unclear what happens in the mid-term.
	160	Obviously in the long term, changes have happened to all files.
	161
	162	A possibility, would be to use this after reading the root node. And then
	163	using an algorithm that compares the keys before and after this record, to
	164	find what a distribution would be, and estimate the next pages.
	165
	166	This is a lot of work for a potentially small benefit, though.
	167
	168	* When checking for N keys, we do sequential lookups in each layer. So we look
	169	at layer 1 for all N keys, then in layer 2 for all N keys, etc. So our
	170	requests will be clustered by layer.
	171
	172	* For projects with large history, we are probably more likely to end up with a
	173	bi-modal distribution of pack files. Where we have 1 pack file with a large
	174	index, and then several pack files with small indexes, several with tiny
	175	indexes, but no pack files with medium sized indexes.
	176	This is because a command like ``bzr pack`` will combine everything into a
	177	single large file. Commands like ``bzr commit`` will create an index with a
	178	single new record, though these will be packaged together by autopack.
3763.8.7 by John Arbash Meinel A bit of doc updates, start putting in tests for current behavior.	179	Commands like ``bzr push`` and ``bzr pull`` will create indexes with more
3763.8.2 by John Arbash Meinel Start writing a document describing the background for the algorithm,	180	records, but these are unlikely to be a significant portion of the history.
	181	Consider bzr has 20,000 revisions, a single push/pull is likely to only be
	182	100-200 revisions, or 1% of the history.
	183
	184	Note that there will always be cases where things are evenly distributed, but
	185	we probably shouldn't optimize for that case.
	186
	187	* 64kB is 16 pages. 16 pages is approximately 1,600 keys.
	188
	189	* We are considering an index with 1 million keys to be very large. 10M is
	190	probably possible, and maybe 100M, but something like 1 billion keys is
	191	unlikely. So a 3-layer index is fairly common (it exists already in bzr), but
	192	a 4-layer is going to be quite rare, and we will probably never see a
	193	5-layer.
	194
	195	* There are times when the second layer is going to be incompletely filled out.
	196	Consider an index with 101 keys. We found that we couldn't fit everything
	197	into a single page, so we expanded the btree into a root page and a leaf
	198	page, and started a new leaf page. However, the root node only has a single
	199	entry. There are 3 pages, but only one of them is "full".
	200	This happens again when we get near the 10,000 node barrier. We found we
	201	couldn't fit the index in a single page, so we split it into a higher layer,
	202	and 1 more sub-layer. So we have 1 root node, 2 layer-2 nodes, and N leaf
	203	nodes (layer 3). If we read the first 3 nodes, we will have read all internal
	204	nodes.
	205
	206	It is certainly possible to detect this for the first-split case (when things
	207	no-longer fit into just the root node), as there will only be a few nodes
	208	total. Is it possible to detect this from only the 'size' information for the
	209	second-split case (when the index no longer fits in a single page, but still
	210	fits in only a small handful of pages)?
	211
	212	This only really works for the root + layer 2. For layers 3+ they will always
	213	be too big to read all at once. However, until we've read the root, we don't
	214	know the layout, so all we have to go on is the size of the index, though
	215	that also gives us the explicit total number of pages.
3763.8.3 by John Arbash Meinel A bit more info	216	So it doesn't help to read the root page and then decide. However, on the
	217	flip side, if we read before the split, then we don't gain much, as we are
	218	reading pages we aren't likely to be interested in.
	219
	220	For example:
	221
	222	We have 100 keys, which fits onto 100 pages, with a single root node. At
	223	1,100 keys, it would be 101 leaf pages, which would then cause us to need 2
	224	index pages, triggering an extra layer. However, this is very sensitive to
	225	the number of keys we fit per-page, which depends on the compression.
	226	Although, we could consider 2,000 keys. Which would be 200 leaf nodes, and
	227	2 intermediate nodes, and a single root node. It is unlikely that we would
	228	ever be able to fit 200 references into a single root node.
	229
	230	So if we pretend that we split at 1 page, 100 pages, and 10,000 pages. We
	231	might be able to say, at 1-5 pages, read all pages, for 5-100 pages, read
	232	only the root. At 100 - 500 pages, read 1-5 pages, for 500-10,000 read only
	233	the root. At 10,000-50,000 read 1-5 pages again, but above 50,000 read only
	234	the root. We could bias this a bit smaller, say at powers of 80, instead of
	235	powers of 100, etc. The basic idea is that if we are close to a layer
	236	split, go ahead and read a small number of extra pages.
3763.8.2 by John Arbash Meinel Start writing a document describing the background for the algorithm,	237
3763.8.7 by John Arbash Meinel A bit of doc updates, start putting in tests for current behavior.	238	* The previous discussion applies whenever we have an upper layer that is not
	239	completely full. So the pages referenced by the last node from the upper
	240	layer will often not have a full 100-way fan out. Probably not worthwhile
	241	very often, though.
	242
3763.8.14 by John Arbash Meinel Add in a shortcut when we haven't cached much yet.	243	* Sometimes we will be making a very small request for a very small number of
	244	keys, we don't really want to bloat tiny requests. Hopefully we can find a
	245	decent heuristic to determine when we will be wanting extra nodes later,
	246	versus when we expect to find all we want right now.
	247
3763.8.2 by John Arbash Meinel Start writing a document describing the background for the algorithm,	248
3763.8.13 by John Arbash Meinel This is now the algorithm, rather than just a suggestion.	249	Algorithm
	250	=========
3763.8.2 by John Arbash Meinel Start writing a document describing the background for the algorithm,	251
3763.8.13 by John Arbash Meinel This is now the algorithm, rather than just a suggestion.	252	This is the basic outline of the algorithm.
3763.8.2 by John Arbash Meinel Start writing a document describing the background for the algorithm,	253
3763.8.14 by John Arbash Meinel Add in a shortcut when we haven't cached much yet.	254	1. If we don't know the size of the index, don't expand as we don't know what
	255	is available. (This only really applies to the pack-names file, which is
	256	unlikely to ever become larger than 1 page anyway.)
	257
	258	2. If a request is already wide enough to be greater than the number of
	259	recommended pages, don't bother trying to expand. This only really happens
	260	with LocalTransport which recommends a single page.
	261
	262	3. Determine what pages have already been read (if any). If the pages left to
	263	read can fit in a single request, just request them. This tends to happen on
	264	medium sized indexes (ones with low hundreds of revisions), and near the end
	265	when we've read most of the whole index already.
	266
	267	4. If we haven't read the root node yet, and we can't fit the whole index into
	268	a single request, only read the root node. We don't know where the layer
	269	boundaries are anyway.
	270
	271	5. If we haven't read "tree depth" pages yet, and are only requesting a single
	272	new page don't expand. This is meant to handle the 'lookup 1 item in the
	273	index' case. In a large pack file, you'll read only a single page at each
	274	layer and then be done. When spidering out in a search, this will cause us
	275	to take a little bit longer to start expanding, but once we've started we'll
	276	be expanding at full velocity. This could be improved by having indexes
	277	inform each other that they have already entered the 'search' phase, or by
	278	having a hint from above to indicate the same.
	279
	280	However, remember the 'bi-modal' distribution. Most indexes will either be
	281	very small, or very large. So either we'll read the whole thing quickly, or
	282	we'll end up spending a lot of time in the index. Which makes a small number
	283	of extra round trips to large indexes a small overhead. For 2-layer nodes,
	284	this only 'wastes' one round trip.
	285
	286	6. Now we are ready to expand the requests. Expand by looking for more pages
	287	next to the ones requested that fit within the current layer. If you run
	288	into a cached page, or a layer boundary, search further only in the opposite
	289	direction. This gives us proper locality of reference, and also helps
	290	because when a search goes in a single direction, we will continue to
	291	prefetch pages in that direction.
3763.8.5 by John Arbash Meinel Rename the doc to .txt and add the vim syntax line.	292
	293	..
	294	vim: ft=rst tw=79 ai