9
9
The basic idea is that for a directory in a tree (committed or otherwise), we
10
10
will have a single scalar value. If these values are the same, the contents of
11
the subtree under that directory are necessarily the same.
11
the subtree under that directory are necessarily the same.
13
13
This is intended to help with these use cases, by allowing them to quickly skip
14
14
over directories with no relevant changes, and to detect when a directory has
27
27
Most of this will be hidden behind the Tree interface. This should cover
28
28
``log -v``, ``diff``, ``status``, ``merge`` (and implicit merge during
31
31
tree.iter_changes(other_tree)
32
32
tree.get_file_lines(file_id) # and get_file, get_file_text
38
38
compare to all the trees. Commit currently needs to compare the working
39
39
tree to all the parent trees, which is needed to update the last_modified
40
40
field and would be unnecessary if we removed that field (for both files
41
and directories) and did not store per-file graphs.
41
and directories) and did not store per-file graphs.
42
42
This would potentially speed up commit after merge.
44
44
Verbose commit also displays the merged files, which does
51
51
Log is interested in two operations: finding the revisions that touched
52
anything inside a directory, and getting the differences between
52
anything inside a directory, and getting the differences between
53
53
consecutive revisions (possibly filtered to a directory)::
55
55
find_touching_revisions(branch, file_id) # should be on Branch?
90
90
Hashes converge: if you modify and then modify back, you get the same
91
91
hash. This is a pro because you can detect that there were ultimately
92
92
no significant changes. And also a con: you cannot use these hashes to form a graph
93
because they get cycles.
93
because they get cycles.
96
96
* Are the values unique across the whole tree, or only when comparing
97
97
different versions of the same object?
133
133
It is desirable that we have a hash that covers all data, to guard
134
134
against bugs, transmission errors, or users trying to hand-hack files.
135
Since we need one hash of everything in the tree, perhaps we should also
135
Since we need one hash of everything in the tree, perhaps we should also
136
136
use it for the fingerprint.
138
138
Testaments explicitly separate the form used for hashing/signing from
143
143
stored data which is not protected by the signature: this data is less
144
144
important, but corruption of it would still cause problems.
145
145
We have encountered some specific problems with disagreement between
146
inventories as to the last-change of files, which is currently unsigned.
146
inventories as to the last-change of files, which is currently unsigned.
147
147
These problems can be introduced by ghosts.
149
149
If we hash the representation, there is still a way to support old
156
156
* Is hashing substantially slower than other possible approaches?
158
158
We already hash all the plain files. Except in unusual cases, the
159
directory metadata will be substantially smaller: perhaps 200:1 as a
159
directory metadata will be substantially smaller: perhaps 200:1 as a
162
162
When building a bzr tree, we spend on the order of 100ms hashing all the
163
163
source lines to validate them (about 13MB of source).
166
* Can you calculate one from a directory in the working tree? Without a basis?
166
* Can you calculate one from a directory in the working tree? Without a basis?
168
This seems possible with either hashes or revision ids.
168
This seems possible with either hashes or revision ids.
170
170
Using last_changed means that calculating the fingerprint from a working
171
171
tree necessarily requires reading the inventory for the basis
199
199
This does rule out for example using ``last_modified=None`` or
200
200
``='current:'`` to mean "changed in the working tree." Even if this is
201
201
not supported there seems some risk that we would get the same
202
fingerprint for trees that are actually different.
202
fingerprint for trees that are actually different.
205
205
hypothetical revision id to the tree for uncommitted files. In that
206
206
case there is some risk that the not-yet-committed id would become
207
207
visible or committed.
210
210
* Can we use an "approximate basis"?
212
212
When using radix trees, you may need context beyond the specific
213
directory being compared.
216
* Can you get the fingerprint of parents directories with only selected file ids
213
directory being compared.
216
* Can you get the fingerprint of parents directories with only selected file ids
217
217
taken from the working tree?
219
219
With hashes, we'd want to carry through the unselected files and
220
directories from the values they had in the parent revision.
223
* Are unbalanced trees a significant problem? Trees can be unbalanced by having
224
many directories (deep or wide), or many files per directory.
220
directories from the values they had in the parent revision.
223
* Are unbalanced trees a significant problem? Trees can be unbalanced by having
224
many directories (deep or wide), or many files per directory.
226
226
For small trees like bzr, 744 of 874 are in the bzrlib subtree. In
227
227
general, larger trees are more balanced, because humans, editors and
228
228
other tools have trouble managing very unbalanced trees. But there are
230
230
entries in one directory.
233
* Should we use a radix tree approach where fingerprints are calculated on a synthetic
233
* Should we use a radix tree approach where fingerprints are calculated on a synthetic
234
234
tree that is by definition balanced, even when the actual tree is unbalanced?
258
258
as stored, or change the sort order for the inventory, or synthesize
259
259
per-directory inventories in memory for hashing.
261
However, XML is somewhat redundant and slow to parse/generate; and
261
However, xml is somewhat redundant and slow to parse/generate; and
262
262
reading the whole thing before comparing some sections is only a
263
263
partial win. It may be a smaller change but we'd be preserving
264
264
things we want to change.
268
268
This has some consequences for how we can upgrade it in future: all
269
269
the changed directories need to be rewritten up to the revision level.
271
1. If we address directories by hash we need hash-addressed
271
1. If we address directories by hash we need hash-addressed
274
1. If we address directories by hash then for consistency we'd probably
274
1. If we address directories by hash then for consistency we'd probably
275
275
(not necessarily) want to address file texts by hash.
277
277
1. The per-file graph can't be indexed by hash because they can converge, so we
316
316
If the version of a file or directory is identified by a hash, we can't
317
317
use that to point into a per-file graph. We can have a graph indexed by
318
318
``(file_id, hash, revision_id)``. The last-modified could be stored as
321
321
The graph would no longer be core data; it could be always present but
322
322
might be rebuilt. Treating it as non-core data may make some changes
323
323
like shallow branches easier?