3152.1.2
by Robert Collins
Add documentation about the inventory system. |
1 |
=========== |
2 |
Inventories |
|
3 |
=========== |
|
4 |
||
5 |
.. contents:: |
|
6 |
||
7 |
Overview |
|
8 |
======== |
|
9 |
||
10 |
Inventories provide an abstraction for talking about the shape of a tree. |
|
3638.5.1
by Robert Collins
Improve inventory design docs with current planning thoughts. |
11 |
Generally only tree object implementors should be concerned about entire |
12 |
inventory objects and their implementation. Other common exceptions are |
|
13 |
full-tree operations such as 'checkout', 'export' and 'import'. |
|
3152.1.2
by Robert Collins
Add documentation about the inventory system. |
14 |
|
15 |
In memory inventories |
|
16 |
===================== |
|
17 |
||
18 |
In memory inventories are often used in diff and status operations between |
|
19 |
trees. We are working to reduce the number of times this occurs with 'full |
|
20 |
tree' inventory objects, and instead use more custom tailored data structures |
|
21 |
that allow operations on only a small amount of data regardless of the size of |
|
22 |
the tree. |
|
23 |
||
24 |
||
25 |
Serialization |
|
26 |
============= |
|
27 |
||
28 |
There are several variants of serialised tree shape in use by bzr. To date |
|
5538.2.1
by Zearin
Fixed capitalization of XML and HTTP. Fixed by hand and only where appropriate (e.g., left http://some/url lowercase, but capitalized "When making an HTTP request…"). |
29 |
these have been mostly XML-based, though plugins have offered non-XML versions. |
3152.1.2
by Robert Collins
Add documentation about the inventory system. |
30 |
|
31 |
dirstate |
|
32 |
-------- |
|
33 |
||
34 |
The dirstate file in a working tree includes many different tree shapes - one |
|
35 |
for the working tree and one for each parent tree, interleaved to allow |
|
36 |
efficient diff and status operations. |
|
37 |
||
5538.2.1
by Zearin
Fixed capitalization of XML and HTTP. Fixed by hand and only where appropriate (e.g., left http://some/url lowercase, but capitalized "When making an HTTP request…"). |
38 |
XML |
3152.1.2
by Robert Collins
Add documentation about the inventory system. |
39 |
--- |
40 |
||
5538.2.1
by Zearin
Fixed capitalization of XML and HTTP. Fixed by hand and only where appropriate (e.g., left http://some/url lowercase, but capitalized "When making an HTTP request…"). |
41 |
All the XML serialized forms write to and read from a single byte string, whose |
3152.1.2
by Robert Collins
Add documentation about the inventory system. |
42 |
hash is then the inventory validator for the commit object. |
43 |
||
3638.5.1
by Robert Collins
Improve inventory design docs with current planning thoughts. |
44 |
|
45 |
Serialization scaling and future designs |
|
46 |
======================================== |
|
47 |
||
48 |
Overall efficiency and scaling is constrained by the bottom level structure |
|
49 |
that an inventory is stored as. We have a number of goals we want to achieve: |
|
50 |
||
51 |
1. Allow commit to write less than the full tree's data in to the repository |
|
4853.1.1
by Patrick Regan
Removed trailing whitespace from files in doc directory |
52 |
in the general case. |
3638.5.1
by Robert Collins
Improve inventory design docs with current planning thoughts. |
53 |
2. Allow the data that is written to be calculated without examining every |
54 |
versioned path in the tree. |
|
55 |
3. Generate the exact same representation for a given inventory regardless of |
|
56 |
the amount of history available. |
|
57 |
4. Allow in memory deltas to be generated directly from the serialised form |
|
58 |
without upcasting to a full in-memory representation or examining every |
|
59 |
path in the tree. Ideally the work performed will be proportional to the |
|
60 |
amount of changes between the trees being compared. |
|
61 |
5. Allow fetch to determine the file texts that need to be pulled to ensure |
|
62 |
that the entire tree can be reconstructed without having to probe every |
|
63 |
path in the tree. |
|
3638.5.4
by Robert Collins
more review feedback. |
64 |
6. Allow bzr to map paths to file ids without reading the entire serialised |
3638.5.1
by Robert Collins
Improve inventory design docs with current planning thoughts. |
65 |
form. This is something that is used by commands such as merge PATH and |
66 |
diff -r X PATH. |
|
67 |
7. Let bzr map file ids to paths without reading the entire serialised form. |
|
68 |
This is used by commands that are presenting output to the user such as |
|
3638.5.4
by Robert Collins
more review feedback. |
69 |
loggerhead, bzr-search, log FILENAME. |
3638.5.1
by Robert Collins
Improve inventory design docs with current planning thoughts. |
70 |
8. We want a strong validator for inventories which is cheap to generate. |
71 |
Specifically we should be able to create the generator for a new commit |
|
72 |
without processing all the data of the basis commit. |
|
73 |
9. Testaments generation is currently size(tree), we would like to create a |
|
74 |
new testament standard which requires less work so that signed commits |
|
75 |
are not significantly slower than regular commits. |
|
76 |
||
77 |
||
78 |
We have current performance and memory bugs in log -v, merge, commit, diff -r, |
|
3638.5.4
by Robert Collins
more review feedback. |
79 |
loggerhead and status -r which can be addressed by an inventory system |
3638.5.1
by Robert Collins
Improve inventory design docs with current planning thoughts. |
80 |
meeting these goals. |
81 |
||
82 |
Current situation |
|
83 |
----------------- |
|
84 |
||
5538.2.1
by Zearin
Fixed capitalization of XML and HTTP. Fixed by hand and only where appropriate (e.g., left http://some/url lowercase, but capitalized "When making an HTTP request…"). |
85 |
The XML-based implementation we use today layers the inventory as a bytestring |
4853.1.1
by Patrick Regan
Removed trailing whitespace from files in doc directory |
86 |
which is stored under a single key; the bytestring is then compressed as a |
3638.5.1
by Robert Collins
Improve inventory design docs with current planning thoughts. |
87 |
delta against the bytestring of its left hand parent by the knit code. |
88 |
||
89 |
Gap analysis: |
|
90 |
||
91 |
1. Succeeds |
|
5538.2.1
by Zearin
Fixed capitalization of XML and HTTP. Fixed by hand and only where appropriate (e.g., left http://some/url lowercase, but capitalized "When making an HTTP request…"). |
92 |
2. Fails - generating a new XML representation needs full tree data. |
3638.5.1
by Robert Collins
Improve inventory design docs with current planning thoughts. |
93 |
3. Succeeds - the inventory layer accesses the bytestring, which is |
94 |
deterministic |
|
95 |
4. Fails - we have to reconstruct both inventories as trees and then delta |
|
96 |
the resulting in memory objects. |
|
97 |
5. Partial success - the revision field in the inventory can be scanned for |
|
98 |
in both text-delta and full-bytestring form; other revision values than |
|
99 |
those revisions which are being pulled are by definition absent. |
|
100 |
6. Partially succeeds - with appropriate logic a path<->id map can be generated |
|
101 |
just-in-time, but it is complex and still requires reconstructing the |
|
102 |
entire byte-string. |
|
103 |
7. As for 6. |
|
104 |
8. Fails - we have to hash the entire tree in serialised form to generate |
|
105 |
validators. |
|
106 |
9. Fails. |
|
107 |
||
108 |
Long term work |
|
109 |
-------------- |
|
110 |
||
111 |
Some things are likely harder to fix incrementally than others. In particular, |
|
112 |
goal 3 (constant canonical form) is arguably only achieved if we remove all |
|
113 |
derived data such as the last-modified revision from the inventory itself. That |
|
114 |
said, the last-modified appears to be in a higher level than raw serialization. |
|
115 |
So in the medium term we will not alter the contents of inventories, only the |
|
116 |
way that the current contents are mapped to and from disk. |
|
117 |
||
118 |
||
119 |
Layering |
|
120 |
-------- |
|
121 |
||
122 |
We desire clear and clean layers. Each layer should be as simple as we can make |
|
123 |
it to aid in debugging and performance tuning. So where we can choose to either |
|
124 |
write a complex layer and something simple on top of it, or two layers with |
|
125 |
neither being as complex - then we should consider the latter choice better in |
|
126 |
the absence of compelling reasons not to. |
|
127 |
||
128 |
Some key layers we have today and can look at using or tweaking are: |
|
129 |
||
130 |
* Tree objects - the abstract interface bzrlib code works in |
|
131 |
* VersionedFiles - the optionally delta compressing key->bytes storage |
|
132 |
interface. |
|
133 |
* Inventory - the abstract interface that many tree operations are written in. |
|
134 |
||
135 |
These layers are probably sufficient with minor tweaking. We may want to add |
|
136 |
additional modules/implementations of one or more layers, but that doesn't |
|
137 |
really require new layers to be exposed. |
|
138 |
||
139 |
Design elements to achieve the goals in a future inventory implementation |
|
140 |
------------------------------------------------------------------------- |
|
141 |
||
142 |
* Split up the logical document into smaller serialised fragements. For |
|
4853.1.1
by Patrick Regan
Removed trailing whitespace from files in doc directory |
143 |
instance hash buckets or nodes in a tree of some sort. By serialising in |
144 |
smaller units, we can increase the number of smaller units rather than |
|
3638.5.1
by Robert Collins
Improve inventory design docs with current planning thoughts. |
145 |
their size as the tree grows; as long as two similar trees have similar |
146 |
serialised forms, the amount of different content should be quite high. |
|
147 |
||
148 |
* Use fragment identifiers that are independent of revision id, so that |
|
149 |
serialisation of two related trees generates overlap in the keyspace |
|
150 |
for fragments without requiring explicit delta logic. Content Hash Keys |
|
151 |
(e.g. ('sha1:ABCDEF0123456789...',) are useful here because of the ability |
|
152 |
to assign them without reference to history.) |
|
153 |
||
154 |
* Store the fragments in our existing VersionedFiles store. Adding an index |
|
155 |
for them. Have the serialised form be uncompressed utf8, so that delta logic |
|
156 |
in the VersionedFiles layer can be used. We may need to provide some sort |
|
157 |
of hinting mechanism to get good compression - but the trivially available |
|
158 |
zlib compression of knits-with-no-deltas is probably a good start. |
|
159 |
||
160 |
* Item_keys_introduced_by is innately a history-using function; we can |
|
161 |
reproduce the text-key finding logic by doing a tree diff between any tree |
|
3638.5.3
by Robert Collins
Review feedback on hash trie inventories, and describe radix tree inventories, plus some details on hash trie implementation. |
162 |
and an older tree - that will limit the amount of data we need to process |
3638.5.1
by Robert Collins
Improve inventory design docs with current planning thoughts. |
163 |
to something proportional to the difference and the size of each fragment. |
164 |
When checking many versions we can track which fragments we have examined |
|
165 |
and only look at new unique ones as each version is examined in turn. |
|
166 |
||
167 |
* Working tree to arbitrary history revision deltas/comparisons can be scaled |
|
168 |
up by doing a two-step (fixed at two!) delta combining - delta(tree, basis) |
|
4853.1.1
by Patrick Regan
Removed trailing whitespace from files in doc directory |
169 |
and then combine that with delta(basis, arbitrary_revision) using the |
3638.5.1
by Robert Collins
Improve inventory design docs with current planning thoughts. |
170 |
repositories ability to get a delta cheaply. |
171 |
||
172 |
* The key primitives we need seem to be: |
|
173 |
* canonical_form(inventory) -> fragments |
|
174 |
* delta(inventory, inventory) -> inventory_delta |
|
175 |
* apply(inventory_delta, canonical_form) -> fragments |
|
176 |
||
177 |
* Having very many small fragments is likely to cause a high latency |
|
178 |
multiplier unless we are careful. |
|
179 |
||
180 |
* Possible designs to investigate - a hash bucket approach, radix trees, |
|
181 |
B+ trees, directory trees (with splits inside a directory?). |
|
3638.5.2
by Robert Collins
Describe a hash trie based inventory |
182 |
|
183 |
||
184 |
Hash bucket based inventories |
|
185 |
============================= |
|
186 |
||
187 |
Overview |
|
188 |
-------- |
|
189 |
||
190 |
We store two maps - fileid:inventory_entry and path:fileid, in a stable |
|
3638.5.4
by Robert Collins
more review feedback. |
191 |
hash trie, stored in densly packed fragments. We pack keys into the map |
3638.5.2
by Robert Collins
Describe a hash trie based inventory |
192 |
densely up the tree, with a single canonical form for any given tree. This is |
193 |
more stable than simple fixed size buckets, which prevents corner cases where |
|
194 |
the tree size varies right on a bucket size border. (Note that such cases are |
|
195 |
not a fatal flaw - the two forms would both be present in the repository, so |
|
196 |
only a small amount of data would be written at each transition - but a full |
|
197 |
tree reprocess would be needed at each tree operation across the boundary, and |
|
198 |
thats undesirable.) |
|
199 |
||
200 |
Goal satisfaction |
|
201 |
----------------- |
|
202 |
||
203 |
1. Success |
|
204 |
2. Success |
|
205 |
3. Success |
|
206 |
4. Success, though each change will need its parents looked up as well |
|
207 |
so it will be proportional to the changes + the directories above |
|
208 |
the changed path. |
|
209 |
5. Success - looking at the difference against all parents we can determine |
|
210 |
new keys without reference to the repository content will be inserted |
|
211 |
into. |
|
3638.5.3
by Robert Collins
Review feedback on hash trie inventories, and describe radix tree inventories, plus some details on hash trie implementation. |
212 |
6. This probably needs a path->id map, allowing a 2-step lookup. |
3638.5.2
by Robert Collins
Describe a hash trie based inventory |
213 |
7. If we allocate buckets by hashing the id, then this is succeed, though, |
214 |
as per 4 it will need recursive lookups. |
|
215 |
8. Success |
|
216 |
9. Fail - data beyond that currently included in testaments is included |
|
217 |
in the strong validator. |
|
218 |
||
219 |
Issues |
|
220 |
------ |
|
221 |
||
222 |
1. Tuning the fragment size needs doing. |
|
223 |
1. Testing. |
|
224 |
1. Writing code. |
|
225 |
1. Separate root node, or inline into revision? |
|
226 |
1. Cannot do 'ls' efficiently in the current design. |
|
227 |
1. Cannot detect invalid deltas easily. |
|
228 |
1. What about LCA merge of inventories? |
|
229 |
||
230 |
Canonical form |
|
231 |
-------------- |
|
232 |
||
3638.5.6
by Robert Collins
Define CHK and very minor tweaks to the inventory text. |
233 |
There are three fragment types for the canonical form. Each fragment is |
234 |
addressed using a Content Hash Key (CHK) - for instance |
|
235 |
"sha1:12345678901234567890". |
|
3638.5.2
by Robert Collins
Describe a hash trie based inventory |
236 |
|
237 |
root_node: (Perhaps this should be inlined into the revision object). |
|
238 |
HASH_INVENTORY_SIGNATURE |
|
239 |
path_map: CHK to root of path to id map |
|
240 |
content_map: CHK to root of id to entry map |
|
241 |
||
242 |
map_node: INTERNAL_NODE or LEAF_NODE |
|
243 |
INTERNAL_NODE: |
|
244 |
INTERNAL_NODE_SIGNATURE |
|
245 |
hash_prefix: PREFIX |
|
3638.5.3
by Robert Collins
Review feedback on hash trie inventories, and describe radix tree inventories, plus some details on hash trie implementation. |
246 |
prefix_width: INT |
247 |
PREFIX CHK TYPE SIZE |
|
248 |
PREFIX CHK TYPE SIZE ... |
|
3638.5.2
by Robert Collins
Describe a hash trie based inventory |
249 |
|
250 |
(Where TYPE is I for internal or L for leaf). |
|
251 |
||
252 |
leaf_node: |
|
253 |
LEAF_NODE_SIGNATURE |
|
254 |
hash_prefix: PREFIX |
|
255 |
HASH\x00KEY\x00 VALUE |
|
256 |
||
257 |
For path maps, VALUE is:: |
|
258 |
fileid |
|
259 |
||
260 |
For content maps, VALUE:: |
|
261 |
fileid basename kind last-changed kind-specific-details |
|
262 |
||
263 |
||
264 |
The path and content maps are populated simply by serialising every inventory |
|
265 |
entry and inserting them into both the path map and the content map. The maps |
|
4853.1.1
by Patrick Regan
Removed trailing whitespace from files in doc directory |
266 |
start with just a single leaf node with an empty prefix. |
3638.5.2
by Robert Collins
Describe a hash trie based inventory |
267 |
|
268 |
||
269 |
Apply |
|
270 |
----- |
|
271 |
||
272 |
Given an inventory delta - a list of (old_path, new_path, InventoryEntry) |
|
273 |
items, with a None in new_path indicating a delete operation, and recursive |
|
274 |
deletes not being permitted - all entries to be deleted must be explicitly |
|
275 |
listed, we can transform a current inventory directly. We can't trivially |
|
276 |
detect an invalid delta though. |
|
277 |
||
278 |
To perform an application, naively we can just update both maps. For the path |
|
279 |
map we would remove all entries where the paths in the delta do not match, then |
|
280 |
insert those with a new_path again. For the content map we would just remove |
|
281 |
all the fileids in the delta, then insert those with a new_path that is not |
|
282 |
None. |
|
283 |
||
284 |
Delta |
|
285 |
----- |
|
286 |
||
287 |
To generate a delta between two inventories, we first generate a list of |
|
288 |
altered fileids, and then recursively look up their parents to generate their |
|
289 |
old and new file paths. |
|
290 |
||
291 |
To generate the list of altered file ids, we do an entry by entry comparison of |
|
292 |
the full contents of every leaf node that the two inventories do not have in |
|
293 |
common. To do this, we start at the root node, and follow every CHK pointer |
|
3638.5.6
by Robert Collins
Define CHK and very minor tweaks to the inventory text. |
294 |
that is only in one tree. We can then bring in all the values from the leaf |
3638.5.2
by Robert Collins
Describe a hash trie based inventory |
295 |
nodes and do a set difference to get the altered ones, which we would then |
296 |
parse. |
|
3638.5.3
by Robert Collins
Review feedback on hash trie inventories, and describe radix tree inventories, plus some details on hash trie implementation. |
297 |
|
3638.5.6
by Robert Collins
Define CHK and very minor tweaks to the inventory text. |
298 |
|
3638.5.3
by Robert Collins
Review feedback on hash trie inventories, and describe radix tree inventories, plus some details on hash trie implementation. |
299 |
Radix tree based inventories |
300 |
============================ |
|
301 |
||
302 |
Overview |
|
303 |
-------- |
|
304 |
||
305 |
We store two maps - fileid:path and path:inventory_entry. The fileid:path map |
|
306 |
is a hash trie (as file ids have no useful locality of reference). The |
|
307 |
path:inventory_entry map is stored as a regular trie. As for hash tries we |
|
308 |
define a single canonical representation for regular tries similar to that |
|
309 |
defined above for hash tries. |
|
310 |
||
311 |
Goal satisfaction |
|
312 |
----------------- |
|
313 |
||
314 |
1. Success |
|
315 |
2. Success |
|
316 |
3. Success |
|
317 |
4. Success |
|
318 |
5. Success - looking at the difference against all parents we can determine |
|
319 |
new keys without reference to the repository content will be inserted |
|
320 |
into. |
|
321 |
6. Success |
|
322 |
7. Success |
|
323 |
8. Success |
|
324 |
9. Fail - data beyond that currently included in testaments is included |
|
325 |
in the strong validator. |
|
326 |
||
327 |
Issues |
|
328 |
------ |
|
329 |
||
330 |
1. Tuning the fragment size needs doing. |
|
331 |
1. Testing. |
|
332 |
1. Writing code. |
|
333 |
1. Separate root node, or inline into revision? |
|
334 |
1. What about LCA merge of inventories? |
|
335 |
||
336 |
Canonical form |
|
337 |
-------------- |
|
338 |
||
339 |
There are five fragment types for the canonical form: |
|
340 |
||
341 |
The root node, hash trie internal and leaf nodes as previous. |
|
342 |
||
343 |
Then we have two more, the internal and leaf node for the radix tree. |
|
344 |
||
345 |
radix_node: INTERNAL_NODE or LEAF_NODE |
|
346 |
||
347 |
INTERNAL_NODE: |
|
348 |
INTERNAL_NODE_SIGNATURE |
|
349 |
prefix: PREFIX |
|
350 |
suffix CHK TYPE SIZE |
|
351 |
suffix CHK TYPE SIZE ... |
|
352 |
||
353 |
(Where TYPE is I for internal or L for leaf). |
|
354 |
||
355 |
LEAF_NODE: |
|
356 |
LEAF_NODE_SIGNATURE |
|
357 |
prefix: PREFIX |
|
358 |
suffix\x00VALUE |
|
359 |
||
360 |
For the content map we use the same value as for hashtrie inventories. |
|
361 |
||
362 |
||
363 |
Node splitting and joining in the radix tree are managed in the same fashion as |
|
364 |
as for the internal nodes of the hashtries. |
|
365 |
||
366 |
||
367 |
Apply |
|
368 |
----- |
|
369 |
||
370 |
Apply is implemented as for hashtries - we just remove and reinsert the |
|
371 |
fileid:paths map entries, and likewise for the path:entry map. We can however |
|
372 |
cheaply detect invalid deltas where a delete fails to include its children. |
|
373 |
||
374 |
Delta |
|
375 |
----- |
|
376 |
||
377 |
Delta generation is very similar to that with hash tries, except we get the |
|
378 |
path of nodes as part of the lookup process. |
|
379 |
||
380 |
||
381 |
Hash Trie details |
|
382 |
================= |
|
383 |
||
384 |
The canonical form for a hash trie is a tree of internal nodes leading down to |
|
385 |
leaf nodes, with no node exceeding some threshold size, and every node |
|
386 |
containing as much content as it can, but no leaf node containing less than |
|
387 |
its lower size threshold. (In the event that an imbalance in the hash function |
|
388 |
causes a tree where an internal node is needed, but any prefix generates a |
|
389 |
child with less than the lower threshold, the smallest prefix should be taken). |
|
3638.5.4
by Robert Collins
more review feedback. |
390 |
An internal node holds some number of key prefixes, all with the same bit-width. |
3638.5.3
by Robert Collins
Review feedback on hash trie inventories, and describe radix tree inventories, plus some details on hash trie implementation. |
391 |
A leaf node holds the actual values. As trees do not spring fully-formed, the |
392 |
canonical form is defined iteratively - by taking every item in a tree and |
|
393 |
inserting it into a new tree in order you can determine what canonical form |
|
394 |
would look like. As that is an expensive operation, it should only be done |
|
395 |
rarely. |
|
396 |
||
397 |
Updates to a tree that is in canonical form can be done preserving canonical |
|
398 |
form if we can prove that our rules for insertion are order-independent, |
|
399 |
and that our rules for deletion generate the same tree as if we never |
|
400 |
inserted those nodes. |
|
401 |
||
402 |
Our hash tries are balanced vertically but not horizontally. That is, one leg |
|
403 |
of a tree can be arbitrarily deeper than adjacent legs. We require that each |
|
404 |
node along a path within the tree be densely packed, with the densest nodes |
|
405 |
near the top of the tree, and the least dense at the bottom. Except where the |
|
406 |
tree cannot support it, no node is smaller than a minimum_size, and none |
|
407 |
larger than maximum_size. The minimum size constraint is only applied when |
|
408 |
there are enough entries under a prefix to meet that minimum. The maximum |
|
409 |
size constraint is always applied except when a node with a single entry |
|
410 |
is larger than the maximum size. Loosely, the maximum size constraint wins |
|
411 |
over the minimum size constraint, and if the minimum size contraint is to |
|
412 |
be ignored, a deeper prefix can be chosen to pack the containing node more |
|
413 |
densely, as long as no additional minimum sizes checks on child nodes are |
|
414 |
violated. |
|
415 |
||
416 |
Insertion |
|
417 |
--------- |
|
418 |
||
3638.5.7
by Robert Collins
Work around ReST FAIL. |
419 |
#. Hash the entry, and insert the entry in the leaf node with a matching |
3638.5.3
by Robert Collins
Review feedback on hash trie inventories, and describe radix tree inventories, plus some details on hash trie implementation. |
420 |
prefix, creating that node and linking it from the internal node containing |
421 |
that prefix if there is no appropriate leaf node. |
|
3638.5.7
by Robert Collins
Work around ReST FAIL. |
422 |
#. Starting at the highest node altered, for all altered nodes, check if it has |
3638.5.3
by Robert Collins
Review feedback on hash trie inventories, and describe radix tree inventories, plus some details on hash trie implementation. |
423 |
transitioned across either size boundary - 0 < min_size < max_size. If it |
424 |
has not, proceed to update the CHK pointers. |
|
3638.5.7
by Robert Collins
Work around ReST FAIL. |
425 |
#. If it increased above min_size, check the node above to see if it can be |
3638.5.3
by Robert Collins
Review feedback on hash trie inventories, and describe radix tree inventories, plus some details on hash trie implementation. |
426 |
more densely packed. To be below the min_size the node's parent must |
427 |
have hit the max size constraint and been forced to split even though this |
|
428 |
child did not have enough content to support a min_size node - so the prefix |
|
3638.5.4
by Robert Collins
more review feedback. |
429 |
chosen in the parent may be shorter than desirable and we may now be able |
3638.5.3
by Robert Collins
Review feedback on hash trie inventories, and describe radix tree inventories, plus some details on hash trie implementation. |
430 |
to more densely pack the parent by splitting the child nodes more. So if the |
431 |
parent node can support a deeper prefix without hitting max_size, and the |
|
432 |
count of under min_size nodes cannot be reduced, the parent should be given |
|
433 |
a deeper prefix. |
|
3638.5.7
by Robert Collins
Work around ReST FAIL. |
434 |
#. If it increased above max_size, shrink the prefix width used to split out |
3638.5.4
by Robert Collins
more review feedback. |
435 |
new nodes until the node is below max_size (unless the prefix width is |
436 |
already 1 - the minimum). |
|
3638.5.3
by Robert Collins
Review feedback on hash trie inventories, and describe radix tree inventories, plus some details on hash trie implementation. |
437 |
To shrink the prefix of an internal node, create new internal nodes for each |
438 |
new prefix, and populate them with the content of the nodes which were |
|
439 |
formerly linked. (This will normally bubble down due to keeping densely |
|
440 |
packed nodes). |
|
441 |
To shrink the prefix of a leaf node, create an internal node with the same |
|
4853.1.1
by Patrick Regan
Removed trailing whitespace from files in doc directory |
442 |
prefix, then choose a width for the internal node such that the contents |
3638.5.3
by Robert Collins
Review feedback on hash trie inventories, and describe radix tree inventories, plus some details on hash trie implementation. |
443 |
of the leaf all fit into new leaves obeying the min_size and max_size rules. |
444 |
The largest prefix possible should be chosen, to obey the |
|
4853.1.1
by Patrick Regan
Removed trailing whitespace from files in doc directory |
445 |
higher-nodes-are-denser rule. That rule also gives room in leaf nodes for |
3638.5.3
by Robert Collins
Review feedback on hash trie inventories, and describe radix tree inventories, plus some details on hash trie implementation. |
446 |
growth without affecting the parent node packing. |
3638.5.7
by Robert Collins
Work around ReST FAIL. |
447 |
#. Update the CHK pointers - serialise every altered node to generate a CHK, |
3638.5.3
by Robert Collins
Review feedback on hash trie inventories, and describe radix tree inventories, plus some details on hash trie implementation. |
448 |
and update the CHK placeholder in the nodes parent; then reserialise the |
4031.3.1
by Frank Aspell
Fixing various typos |
449 |
parent. CHK pointer propagation can be done lazily when many updates are |
3638.5.3
by Robert Collins
Review feedback on hash trie inventories, and describe radix tree inventories, plus some details on hash trie implementation. |
450 |
expected. |
451 |
||
452 |
Multiple versions of nodes for the same PREFIX and internal prefix width should |
|
453 |
compress well for the same tree. |
|
4205.5.1
by Andrew Bennetts
Initial stab at adapting Robert's journalled_inventory serialisation into inventory_delta serialisation. |
454 |
|
455 |
||
456 |
Inventory deltas |
|
457 |
================ |
|
458 |
||
4205.5.7
by Andrew Bennetts
Fix nits in spelling and naming. |
459 |
An inventory is a serialization of the in-memory inventory delta. To serialize |
4205.5.1
by Andrew Bennetts
Initial stab at adapting Robert's journalled_inventory serialisation into inventory_delta serialisation. |
460 |
an inventory delta, one takes an existing inventory delta and the revision_id |
461 |
of the revision it was created it against and the revision id of the inventory |
|
4205.5.7
by Andrew Bennetts
Fix nits in spelling and naming. |
462 |
which should result by applying the delta to the parent. We then serialize |
4205.5.1
by Andrew Bennetts
Initial stab at adapting Robert's journalled_inventory serialisation into inventory_delta serialisation. |
463 |
every item in the delta in a simple format: |
464 |
||
4205.5.7
by Andrew Bennetts
Fix nits in spelling and naming. |
465 |
'format: bzr inventory delta v1 (1.14)' NL |
466 |
'parent:' SP BASIS_INVENTORY NL |
|
467 |
'version:' SP NULL_OR_REVISION NL |
|
4205.5.1
by Andrew Bennetts
Initial stab at adapting Robert's journalled_inventory serialisation into inventory_delta serialisation. |
468 |
'versioned_root:' SP BOOL NL |
469 |
'tree_references:' SP BOOL NL |
|
470 |
DELTA_LINES |
|
471 |
||
472 |
DELTA_LINES ::= (DELTA_LINE NL)* |
|
4205.5.3
by Andrew Bennetts
Include oldpath in the the serialised delta |
473 |
DELTA_LINE ::= OLDPATH NULL NEWPATH NULL file-id NULL PARENT_ID NULL LAST_MODIFIED NULL CONTENT |
4205.5.1
by Andrew Bennetts
Initial stab at adapting Robert's journalled_inventory serialisation into inventory_delta serialisation. |
474 |
SP ::= ' ' |
475 |
BOOL ::= 'true' | 'false' |
|
476 |
NULL ::= \x00 |
|
4205.5.3
by Andrew Bennetts
Include oldpath in the the serialised delta |
477 |
OLDPATH ::= NONE | PATH |
4205.5.1
by Andrew Bennetts
Initial stab at adapting Robert's journalled_inventory serialisation into inventory_delta serialisation. |
478 |
NEWPATH ::= NONE | PATH |
479 |
NONE ::= 'None' |
|
480 |
PATH ::= path |
|
481 |
PARENT_ID ::= FILE_ID | '' |
|
482 |
CONTENT ::= DELETED_CONTENT | FILE_CONTENT | DIR_CONTENT | TREE_CONTENT | LINK_CONTENT |
|
483 |
DELETED_CONTENT ::= 'deleted' |
|
484 |
FILE_CONTENT ::= 'file' NULL text_size NULL EXEC NULL text_sha1 |
|
485 |
DIR_CONTENT ::= 'dir' |
|
486 |
TREE_CONTENT ::= 'tree' NULL tree-revision |
|
487 |
LINK_CONTENT ::= 'link' NULL link-target |
|
488 |
BASIS_INVENTORY ::= NULL_OR_REVISION |
|
489 |
LAST_MODIFIED ::= NULL_OR_REVISION |
|
490 |
NULL_OR_REVISION ::= 'null:' | REVISION |
|
491 |
REVISION ::= revision-id-in-utf8-no-whitespace |
|
492 |
EXEC ::= '' | 'Y' |
|
493 |
||
4205.5.7
by Andrew Bennetts
Fix nits in spelling and naming. |
494 |
DELTA_LINES is lexicographically sorted. |
4205.5.1
by Andrew Bennetts
Initial stab at adapting Robert's journalled_inventory serialisation into inventory_delta serialisation. |
495 |
|
496 |
Some explanation is in order. When NEWPATH is 'None' a delete has been |
|
4205.5.7
by Andrew Bennetts
Fix nits in spelling and naming. |
497 |
recorded, and because this inventory delta is not attempting to be a reversible |
498 |
delta, the only other valid fields are OLDPATH and 'file-id'. PARENT_ID is '' |
|
4205.5.1
by Andrew Bennetts
Initial stab at adapting Robert's journalled_inventory serialisation into inventory_delta serialisation. |
499 |
when a delete has been recorded or when recording a new root entry. |
500 |
||
4501.1.1
by Robert Collins
Add documentation describing how and why we use inventory deltas, and what can go wrong with them. |
501 |
|
502 |
Delta consistency |
|
503 |
================= |
|
504 |
||
505 |
Inventory deltas and more broadly changes between trees are a significant part |
|
506 |
of bzr's core operations: they are key components in status, diff, commit, |
|
507 |
and merge (although merge uses tree transform, deltas contain the changes that |
|
508 |
are applied to the transform). Our ability to perform a given operation depends |
|
509 |
on us creating consistent deltas between trees. Inconsistent deltas lead to |
|
510 |
errors and bugs, or even just unexpected conflicts. |
|
511 |
||
512 |
An inventory delta is a transform to change an inventory A into another |
|
513 |
inventory B (in patch terms its a perfect patch). Sometimes, for instance in a |
|
514 |
regular commit, inventory B is known at the time we create the delta. Other |
|
515 |
times, B is not known because the user is requesting that some parts of the |
|
516 |
second inventory they have are masked out from consideration. When this happens |
|
517 |
we create a delta that when applied to A creates a B we haven't seen in total |
|
518 |
before. In this situation we need to ensure that B will be internally |
|
519 |
consistent. Deltas are unidirectional, a delta(A, B) creates B from A, but |
|
520 |
cannot be used to create A from B. |
|
521 |
||
522 |
Deltas are expressed as a list of (oldpath, newpath, fileid, entry) tuples. The |
|
523 |
fileid, entry elements are normative; the old and new paths are strong hints |
|
524 |
but not currently guaranteed to be accurate. (This is a shame and something we |
|
525 |
should tighten up). Deltas are required to list all removals explicitly - |
|
526 |
removing the parent of an entry doesn't remove the entry. |
|
527 |
||
528 |
Applying a delta to an inventory consists of: |
|
529 |
- removing all fileids for which entry is None |
|
530 |
- adding or replacing all other fileids |
|
531 |
- detecting consistency errors |
|
532 |
||
533 |
An interesting aspect of delta inconsistencies is when we notice them: |
|
534 |
- Silent errors which our application logic misses |
|
535 |
- Visible errors we catch during application, so bad data isn't stored in |
|
536 |
the system. |
|
537 |
||
538 |
The minimum safe level for our application logic would be to catch all errors |
|
539 |
during application. Making generation never generate inconsistent deltas is |
|
540 |
a seperate but necessary condition for robust code. |
|
541 |
||
542 |
An inconsistent delta is one which: |
|
543 |
- after application to an inventory the inventory is an impossible state. |
|
544 |
- has the same fileid, or oldpath(not-None), or newpath(not-None) multiple |
|
545 |
times. |
|
546 |
- has a fileid field different to the entry.fileid in the same item in the |
|
547 |
delta. |
|
4526.9.10
by Robert Collins
Note that an inconsistent entry in a delta is inconsistent. |
548 |
- has an entry that is in an impossible state (e.g. a directory with a text |
549 |
size) |
|
4501.1.1
by Robert Collins
Add documentation describing how and why we use inventory deltas, and what can go wrong with them. |
550 |
|
551 |
Forms of inventory inconsistency deltas can carry/cause: |
|
552 |
- An entry newly introduced to a path without also removing or relocating any |
|
553 |
existing entry at that path. (Duplicate paths) |
|
554 |
- An entry whose parent id isn't present in the tree. (Missing parent). |
|
555 |
- Having oldpath or newpath not be actual original path or resulting path. |
|
556 |
(Wrong path) |
|
557 |
- An entry whose parent is not a directory. (Under non-directory). |
|
558 |
- An entry that is internally inconsistent. |
|
4526.9.2
by Robert Collins
Handle deltas with new paths not matching the actual path. |
559 |
- An entry that is already present in the tree (Duplicate id) |
4501.1.1
by Robert Collins
Add documentation describing how and why we use inventory deltas, and what can go wrong with them. |
560 |
|
561 |
Known causes of inconsistency: |
|
562 |
- A 'new' entry which the inventory already has - when this is a directory |
|
563 |
even arbitrary file ids under the 'new' entry are more likely to collide on |
|
564 |
paths. |
|
565 |
- Removing a directory without recursively removing its children - causes |
|
566 |
Missing parent. |
|
567 |
- Recording a change to an entry without including all changed entries found |
|
568 |
following its parents up to and includin the root - can cause duplicate |
|
569 |
paths, missing parents, wrong path, under non-directory. |
|
570 |
||
571 |
Avoiding inconsistent deltas |
|
572 |
---------------------------- |
|
573 |
||
574 |
The simplest thing is to never create partial deltas, as it is trivial to |
|
575 |
be consistent when all data is examined every time. However users sometimes |
|
576 |
want to specify a subset of the changes in their tree when they do an operation |
|
577 |
which needs to create a delta - such as commit. |
|
578 |
||
579 |
We have a choice about handling user requests that can generate inconsistent |
|
580 |
deltas. We can alter or interpret the request in such a way that the delta will |
|
581 |
be consistent, but perhaps larger than the user had intended. Or we can |
|
582 |
identify problematic situations and abort, specifying to the user why we have |
|
583 |
aborted and likely things they can do to make their request generate a |
|
584 |
consistent delta. |
|
585 |
||
586 |
Currently we attempt to expand/interpret the request so that the user is not |
|
587 |
required to understand all the internal constraints of the system: if they |
|
4570.2.3
by Robert Collins
Change the way iter_changes treats specific files to prevent InconsistentDeltas. |
588 |
request 'foo/bar' we automatically include foo. This works but can surprise |
589 |
the user sometimes when things they didn't explicitly request are committed. |
|
590 |
||
591 |
Different trees can use different algorithms to expand the request as long as |
|
592 |
they produce consistent deltas. As part of getting a consistent UI we require |
|
4853.1.1
by Patrick Regan
Removed trailing whitespace from files in doc directory |
593 |
that all trees expand the paths requested downwards. Beyond that as long as |
4570.2.3
by Robert Collins
Change the way iter_changes treats specific files to prevent InconsistentDeltas. |
594 |
the delta is consistent it is up to the tree. |
595 |
||
596 |
Given two trees, source and target, and a set of selected file ids to check for |
|
597 |
changes and if changed in a delta between them, we have to expand that set by |
|
598 |
the following rules, to get consistent deltas. The test for consistency is that |
|
599 |
if the resulting delta is applied to source, to create a third tree 'output', |
|
600 |
and the paths in the delta match the paths in source and output, only one file |
|
4853.1.1
by Patrick Regan
Removed trailing whitespace from files in doc directory |
601 |
id is at each path in output, and no file ids are missing parents, then the |
4570.2.3
by Robert Collins
Change the way iter_changes treats specific files to prevent InconsistentDeltas. |
602 |
delta is consistent. |
603 |
||
604 |
Firstly, the parent ids to the root for all of the file ids that have actually |
|
605 |
changed must be considered. Unless they are all examined the paths in the delta |
|
606 |
may be wrong. |
|
607 |
||
608 |
Secondly, when an item included in the delta has a new path which is the same |
|
609 |
as a path in source, the fileid of that path in source must be included. |
|
610 |
Failing to do this leads to multiple ids tryin to share a path in output. |
|
611 |
||
612 |
Thirdly, when an item changes its kind from 'directory' to anything else in the |
|
613 |
delta, all of the direct children of the directory in source must be included. |