2659.2.1
by Martin Pool
Start of directory-fingerprint documentation |
1 |
Directory fingerprints |
2 |
====================== |
|
3 |
||
4 |
.. contents:: :local: |
|
5 |
||
6 |
Introduction |
|
7 |
------------ |
|
8 |
||
9 |
The basic idea is that for a directory in a tree (committed or otherwise), we |
|
10 |
will have a single scalar value. If these values are the same, the contents of |
|
4853.1.1
by Patrick Regan
Removed trailing whitespace from files in doc directory |
11 |
the subtree under that directory are necessarily the same. |
2659.2.1
by Martin Pool
Start of directory-fingerprint documentation |
12 |
|
13 |
This is intended to help with these use cases, by allowing them to quickly skip |
|
14 |
over directories with no relevant changes, and to detect when a directory has |
|
15 |
changed: |
|
16 |
||
17 |
* diff/status (both local trees and historical trees) |
|
18 |
* merge |
|
19 |
* log -v |
|
20 |
* log on a directory |
|
21 |
* commit |
|
22 |
||
23 |
||
24 |
Use-case oriented APIs |
|
25 |
---------------------- |
|
26 |
||
27 |
Most of this will be hidden behind the Tree interface. This should cover |
|
28 |
``log -v``, ``diff``, ``status``, ``merge`` (and implicit merge during |
|
4853.1.1
by Patrick Regan
Removed trailing whitespace from files in doc directory |
29 |
push, pull, update):: |
2659.2.1
by Martin Pool
Start of directory-fingerprint documentation |
30 |
|
31 |
tree.iter_changes(other_tree) |
|
32 |
tree.get_file_lines(file_id) # and get_file, get_file_text |
|
33 |
||
34 |
``commit`` |
|
35 |
~~~~~~~~~~ |
|
36 |
||
37 |
Commit is similar to ``iter_changes``, but different because it needs to |
|
38 |
compare to all the trees. Commit currently needs to compare the working |
|
39 |
tree to all the parent trees, which is needed to update the last_modified |
|
40 |
field and would be unnecessary if we removed that field (for both files |
|
4853.1.1
by Patrick Regan
Removed trailing whitespace from files in doc directory |
41 |
and directories) and did not store per-file graphs. |
2659.2.1
by Martin Pool
Start of directory-fingerprint documentation |
42 |
This would potentially speed up commit after merge. |
43 |
||
44 |
Verbose commit also displays the merged files, which does |
|
45 |
require looking at all parents of files that aren't identical |
|
46 |
to the left-hand parent. |
|
47 |
||
48 |
``log`` |
|
49 |
~~~~~~~ |
|
50 |
||
51 |
Log is interested in two operations: finding the revisions that touched |
|
4853.1.1
by Patrick Regan
Removed trailing whitespace from files in doc directory |
52 |
anything inside a directory, and getting the differences between |
2659.2.1
by Martin Pool
Start of directory-fingerprint documentation |
53 |
consecutive revisions (possibly filtered to a directory):: |
54 |
||
55 |
find_touching_revisions(branch, file_id) # should be on Branch? |
|
56 |
||
57 |
Log shows the revisions that merged a change. At the moment that is not |
|
58 |
included in the per-file graph, and it would also not be visible if the |
|
59 |
directories were hashed. |
|
60 |
||
61 |
||
62 |
||
63 |
||
64 |
Open questions |
|
65 |
-------------- |
|
66 |
||
67 |
* Is this a good idea at all? |
|
68 |
||
69 |
If changing a file changes all its parent directories up to the root it |
|
70 |
will cause more churn on commit. (We currently update the all-in-one |
|
71 |
inventory, but only have to update one line of it.) |
|
72 |
||
73 |
Every time a child changes, we'll get a new node in the per-directory |
|
74 |
graph. This is generally useful: it allows bzr log to do the default |
|
75 |
mode easily, which is to show all changes under that directory. The |
|
76 |
less common operation, ``log --no-recursive`` is still possible by |
|
77 |
looking only at when the directory itself was renamed, added or removed. |
|
78 |
(That is what the directory graph describes in bzr 0.18 and it is rarely |
|
79 |
useful.) |
|
80 |
||
81 |
||
82 |
* Should these be hashes or revision ids or something else? |
|
83 |
||
84 |
Pros of using hashes: hashes are easy to generate by a foreign branch |
|
85 |
plugin (e.g. bzr-svn). They don't need to get recursive last-changed |
|
86 |
from the foreign branch, or to walk back through history. They just |
|
87 |
need the relevant directory state, which any system we support can |
|
88 |
answer. |
|
89 |
||
90 |
Hashes converge: if you modify and then modify back, you get the same |
|
91 |
hash. This is a pro because you can detect that there were ultimately |
|
92 |
no significant changes. And also a con: you cannot use these hashes to form a graph |
|
4853.1.1
by Patrick Regan
Removed trailing whitespace from files in doc directory |
93 |
because they get cycles. |
94 |
||
95 |
||
2659.2.1
by Martin Pool
Start of directory-fingerprint documentation |
96 |
* Are the values unique across the whole tree, or only when comparing |
97 |
different versions of the same object? |
|
98 |
||
99 |
If we use last-changed revisions, then they will be very not unique |
|
100 |
across the whole tree. To look up the contents, you must pass a |
|
101 |
composite key like ``(file_id, last_changed)``. |
|
102 |
||
103 |
If we use hashes they will be same only when the two contain the same |
|
104 |
contents. Since we say that file ids must be unique, this |
|
105 |
means they will match if and only if they are empty. We might relax |
|
106 |
that in future when we introduce path tokens. |
|
107 |
||
108 |
||
109 |
* Is it reasonable to assume hashes won't collide? |
|
4853.1.1
by Patrick Regan
Removed trailing whitespace from files in doc directory |
110 |
|
2659.2.1
by Martin Pool
Start of directory-fingerprint documentation |
111 |
The odds of SHA-1 hashes colliding "accidentally" are vanishingly small. |
112 |
||
113 |
It is possible that a `preimage attack`_ against SHA-1 may be discovered |
|
114 |
in the future. Since we're not proposing in this document to make |
|
115 |
revision-ids be SHA-1, if SHA-1 was obsoleted then we could rewrite the |
|
116 |
contents of revisions but would not need to rename revisions. So the |
|
117 |
impact of such a migration should just be a format upgrade, and a |
|
118 |
recommendation (but not requirement) to re-sign revisions. |
|
119 |
||
120 |
.. _`preimage attack`: http://tools.ietf.org/html/rfc4270 |
|
121 |
||
122 |
||
123 |
* If we use hashes, should it be the hash of the |
|
124 |
representation stored for a directory? |
|
125 |
||
126 |
In other words, should we pun the representation of the directory with |
|
127 |
the form used for validation. |
|
128 |
||
129 |
If there's some data stored that's not in the hash it's problematic. |
|
130 |
The hash in no longer (effectively) uniquely identifies the |
|
131 |
representation. |
|
132 |
||
133 |
It is desirable that we have a hash that covers all data, to guard |
|
134 |
against bugs, transmission errors, or users trying to hand-hack files. |
|
4853.1.1
by Patrick Regan
Removed trailing whitespace from files in doc directory |
135 |
Since we need one hash of everything in the tree, perhaps we should also |
2659.2.1
by Martin Pool
Start of directory-fingerprint documentation |
136 |
use it for the fingerprint. |
137 |
||
138 |
Testaments explicitly separate the form used for hashing/signing from |
|
139 |
the form used for storage. This allows us to change the storage form |
|
140 |
without breaking existing GPG signatures. The downside is that we need |
|
141 |
to do work O(tree) to make a testament, and this slows down signing, |
|
142 |
verifying and generating bundles. It also means that there is some |
|
143 |
stored data which is not protected by the signature: this data is less |
|
144 |
important, but corruption of it would still cause problems. |
|
145 |
We have encountered some specific problems with disagreement between |
|
4853.1.1
by Patrick Regan
Removed trailing whitespace from files in doc directory |
146 |
inventories as to the last-change of files, which is currently unsigned. |
2659.2.1
by Martin Pool
Start of directory-fingerprint documentation |
147 |
These problems can be introduced by ghosts. |
148 |
||
149 |
If we hash the representation, there is still a way to support old |
|
150 |
signatures, assuming that we never discard irreplaceable information. |
|
151 |
The signature should say what format it applies to (similar to |
|
152 |
testaments), and we could transform in memory the tree back to that |
|
153 |
format. |
|
154 |
||
155 |
||
156 |
* Is hashing substantially slower than other possible approaches? |
|
157 |
||
158 |
We already hash all the plain files. Except in unusual cases, the |
|
4853.1.1
by Patrick Regan
Removed trailing whitespace from files in doc directory |
159 |
directory metadata will be substantially smaller: perhaps 200:1 as a |
2659.2.1
by Martin Pool
Start of directory-fingerprint documentation |
160 |
rule of thumb. |
161 |
||
162 |
When building a bzr tree, we spend on the order of 100ms hashing all the |
|
163 |
source lines to validate them (about 13MB of source). |
|
164 |
||
165 |
||
4853.1.1
by Patrick Regan
Removed trailing whitespace from files in doc directory |
166 |
* Can you calculate one from a directory in the working tree? Without a basis? |
2659.2.1
by Martin Pool
Start of directory-fingerprint documentation |
167 |
|
4853.1.1
by Patrick Regan
Removed trailing whitespace from files in doc directory |
168 |
This seems possible with either hashes or revision ids. |
2659.2.1
by Martin Pool
Start of directory-fingerprint documentation |
169 |
|
170 |
Using last_changed means that calculating the fingerprint from a working |
|
171 |
tree necessarily requires reading the inventory for the basis |
|
172 |
revision, so that we know when unchanged files were last changed. With |
|
173 |
hashes we could calculate them using the working tree information alone. |
|
174 |
It's true that we will often then compare that information to the basis |
|
175 |
tree (e.g. for simple ``bzr diff``), but we may only have to compare at |
|
176 |
the top level, and sometimes we're comparing to a |
|
177 |
different tree. This also touches on whether we should store |
|
178 |
``last_modified`` for files, rather than directories. |
|
179 |
||
180 |
For revision ids we need to assign a value to use for uncommitted |
|
181 |
changes, but see below about the problems of this. |
|
182 |
||
183 |
In some ways it would be elegant to say (hypothetical):: |
|
184 |
||
185 |
wt.get_root().get_last_modified() == branch.get_last_revision() |
|
186 |
||
187 |
to know that nothing was changed; but this may not be much better than |
|
188 |
:: |
|
189 |
||
190 |
wt.get_root().get_hash() == |
|
191 |
branch.get_basis().get_root().get_hash() |
|
192 |
||
193 |
||
194 |
* Can you use this to compare (directories from) two working trees? |
|
195 |
||
196 |
If you can generate it from a working tree, you should be able to use it |
|
197 |
to compare them. |
|
198 |
||
199 |
This does rule out for example using ``last_modified=None`` or |
|
200 |
``='current:'`` to mean "changed in the working tree." Even if this is |
|
201 |
not supported there seems some risk that we would get the same |
|
4853.1.1
by Patrick Regan
Removed trailing whitespace from files in doc directory |
202 |
fingerprint for trees that are actually different. |
203 |
||
204 |
We could assign a |
|
2659.2.1
by Martin Pool
Start of directory-fingerprint documentation |
205 |
hypothetical revision id to the tree for uncommitted files. In that |
206 |
case there is some risk that the not-yet-committed id would become |
|
207 |
visible or committed. |
|
208 |
||
4853.1.1
by Patrick Regan
Removed trailing whitespace from files in doc directory |
209 |
|
2659.2.1
by Martin Pool
Start of directory-fingerprint documentation |
210 |
* Can we use an "approximate basis"? |
211 |
||
212 |
When using radix trees, you may need context beyond the specific |
|
4853.1.1
by Patrick Regan
Removed trailing whitespace from files in doc directory |
213 |
directory being compared. |
214 |
||
215 |
||
216 |
* Can you get the fingerprint of parents directories with only selected file ids |
|
2659.2.1
by Martin Pool
Start of directory-fingerprint documentation |
217 |
taken from the working tree? |
218 |
||
219 |
With hashes, we'd want to carry through the unselected files and |
|
4853.1.1
by Patrick Regan
Removed trailing whitespace from files in doc directory |
220 |
directories from the values they had in the parent revision. |
221 |
||
222 |
||
223 |
* Are unbalanced trees a significant problem? Trees can be unbalanced by having |
|
224 |
many directories (deep or wide), or many files per directory. |
|
225 |
||
2659.2.1
by Martin Pool
Start of directory-fingerprint documentation |
226 |
For small trees like bzr, 744 of 874 are in the bzrlib subtree. In |
227 |
general, larger trees are more balanced, because humans, editors and |
|
228 |
other tools have trouble managing very unbalanced trees. But there are |
|
229 |
exceptions: Aaron has one tree with 20,000 generated but versioned |
|
230 |
entries in one directory. |
|
231 |
||
232 |
||
4853.1.1
by Patrick Regan
Removed trailing whitespace from files in doc directory |
233 |
* Should we use a radix tree approach where fingerprints are calculated on a synthetic |
2659.2.1
by Martin Pool
Start of directory-fingerprint documentation |
234 |
tree that is by definition balanced, even when the actual tree is unbalanced? |
235 |
||
236 |
||
237 |
* What are the specific advantages of using recursive-last-modified rather than |
|
238 |
hashes? |
|
239 |
||
240 |
It may be a smaller step change. |
|
241 |
||
242 |
It's a bidirectional link: given a directory text identifier ``(file_id, |
|
243 |
last_changed)`` you can look up the revision that last changed it. |
|
244 |
||
245 |
From the preceding, even without the per-file graph you can skip through |
|
246 |
the history of this file: go to the last-changed revision, look at all |
|
247 |
its parents and repeat. |
|
248 |
||
249 |
||
250 |
* Is it a smaller change to use recursive-last-modified on directories? |
|
251 |
||
252 |
Probably yes: |
|
253 |
||
254 |
1. We can just put it into the current inventory format without changing |
|
255 |
anything else. |
|
256 |
||
257 |
By contrast to use a hash we'd have to either split up the inventory |
|
258 |
as stored, or change the sort order for the inventory, or synthesize |
|
259 |
per-directory inventories in memory for hashing. |
|
260 |
||
261 |
However, xml is somewhat redundant and slow to parse/generate; and |
|
262 |
reading the whole thing before comparing some sections is only a |
|
263 |
partial win. It may be a smaller change but we'd be preserving |
|
264 |
things we want to change. |
|
265 |
||
266 |
1. At present we rarely hash storage representations, only file texts. |
|
267 |
This is not a large technical change, but it is a conceptual change. |
|
268 |
This has some consequences for how we can upgrade it in future: all |
|
269 |
the changed directories need to be rewritten up to the revision level. |
|
270 |
||
4853.1.1
by Patrick Regan
Removed trailing whitespace from files in doc directory |
271 |
1. If we address directories by hash we need hash-addressed |
2659.2.1
by Martin Pool
Start of directory-fingerprint documentation |
272 |
storage. |
273 |
||
4853.1.1
by Patrick Regan
Removed trailing whitespace from files in doc directory |
274 |
1. If we address directories by hash then for consistency we'd probably |
2659.2.1
by Martin Pool
Start of directory-fingerprint documentation |
275 |
(not necessarily) want to address file texts by hash. |
276 |
||
277 |
1. The per-file graph can't be indexed by hash because they can converge, so we |
|
278 |
need to either rework or dispose of the per-file graph. |
|
279 |
||
280 |
||
281 |
* Any possibilities for avoiding hashes recurring? |
|
282 |
||
283 |
1. Hash along with an identification of the parents (as in hg). Then you |
|
284 |
can't convert a tree without all its basis trees, and there is still |
|
285 |
convergence when the same merge is done by two people, and you can't |
|
286 |
create it directly from the working tree. |
|
287 |
||
288 |
1. Include last-modified revision id in the hash. |
|
289 |
||
290 |
1. Index by ``(revision, hash)`` or vice versa. |
|
291 |
||
292 |
1. Store a per-file graph and allow it to have repeated keys. The graph |
|
293 |
would tell you about all the parent texts ever seen; you would need |
|
294 |
to use revision graph information to resolve ambiguities. |
|
295 |
||
296 |
||
297 |
* What are the specific disadvantages of using recursive-last-modified rather than |
|
298 |
hashes? |
|
299 |
||
300 |
To calculate the last-changed revision, given the last-changed |
|
301 |
information of the contained files, you need to look at the revision |
|
302 |
graph. They're not enough because you need to know the relations |
|
303 |
between the mentioned revisions. In a merge it's possible the correct |
|
304 |
directory last-modified will not be the same as that of any of the files |
|
305 |
within it. This can also happen when a file is removed (deleted or |
|
306 |
renamed) from a directory. |
|
307 |
||
308 |
||
309 |
* Should we split up storage of the inventories? |
|
310 |
||
311 |
This is not quite the same but connected. |
|
312 |
||
313 |
||
314 |
* How does this relate to per-file/per-directory hashes? |
|
315 |
||
316 |
If the version of a file or directory is identified by a hash, we can't |
|
317 |
use that to point into a per-file graph. We can have a graph indexed by |
|
318 |
``(file_id, hash, revision_id)``. The last-modified could be stored as |
|
4853.1.1
by Patrick Regan
Removed trailing whitespace from files in doc directory |
319 |
part of this graph. |
320 |
||
2659.2.1
by Martin Pool
Start of directory-fingerprint documentation |
321 |
The graph would no longer be core data; it could be always present but |
322 |
might be rebuilt. Treating it as non-core data may make some changes |
|
323 |
like shallow branches easier? |
|
324 |
||
325 |
||
326 |
* How do you ask a tree for a given text? |
|
327 |
||
328 |
Right now we say :: |
|
329 |
||
330 |
revision_tree.get_file_lines(file_id) |
|
331 |
||
332 |
so the choice of storage is hidden behind the revision tree: it could be |
|
333 |
accessed by ``(file_id, last_changed)`` or by hash or otherwise. |
|
334 |
||
335 |
At the moment the Repository exports a friend api to RevisionTree, |
|
336 |
currently usually talking in VersionedFiles. |
|
337 |
||
338 |
We probably wouldn't want Repository to expose a ``get_text_for_sha1()`` |
|
339 |
interface because that would be very difficult to support on old |
|
340 |
repositories or on foreign branches. |
|
341 |
||
342 |
||
343 |
Conclusions |
|
344 |
----------- |
|
345 |
||
346 |
||
347 |
Design changes |
|
348 |
-------------- |
|
349 |
||
350 |
||
351 |
||
352 |
||
353 |
API changes |
|
354 |
----------- |
|
355 |
||
356 |
||
4853.1.1
by Patrick Regan
Removed trailing whitespace from files in doc directory |
357 |
.. |
2659.2.1
by Martin Pool
Start of directory-fingerprint documentation |
358 |
vim: filetype=rst textwidth=78 expandtab spelllang=en spell |
359 |