~bzr-pqm/bzr/bzr.dev

« back to all changes in this revision

Viewing changes to doc/developers/directory-fingerprints.txt

Committer: Andrew Bennetts
Date: 2009-12-03 05:57:41 UTC
mfrom: (4857 +trunk)
mto: This revision was merged to the branch mainline in revision 4869.
Revision ID: andrew.bennetts@canonical.com-20091203055741-vmmg0fmjgjw2pwvu

Merge lp:bzr.

files added:
bzrlib/tests/per_foreign_vcs/test_repository.py

files removed:
bzrlib/textui.py

files modified:
NEWS

bzrlib/_btree_serializer_py.py

bzrlib/_known_graph_py.py

bzrlib/_known_graph_pyx.pyx

bzrlib/_static_tuple_py.py

bzrlib/branch.py

bzrlib/btree_index.py

bzrlib/builtins.py

bzrlib/bundle/__init__.py

bzrlib/bzrdir.py

bzrlib/commands.py

bzrlib/config.py

bzrlib/conflicts.py

bzrlib/export/zip_exporter.py

bzrlib/fetch.py

bzrlib/foreign.py

bzrlib/graph.py

bzrlib/groupcompress.py

bzrlib/help_topics/__init__.py

bzrlib/help_topics/en/conflicts.txt

bzrlib/index.py

bzrlib/knit.py

bzrlib/lockdir.py

bzrlib/log.py

bzrlib/merge.py

bzrlib/merge_directive.py

bzrlib/osutils.py

bzrlib/push.py

bzrlib/repository.py

bzrlib/revision.py

bzrlib/shelf_ui.py

bzrlib/static_tuple.py

bzrlib/tests/__init__.py

bzrlib/tests/blackbox/test_export.py

bzrlib/tests/blackbox/test_ls.py

bzrlib/tests/blackbox/test_merge.py

bzrlib/tests/blackbox/test_push.py

bzrlib/tests/blackbox/test_send.py

bzrlib/tests/blackbox/test_serve.py

bzrlib/tests/http_server.py

bzrlib/tests/per_bzrdir/test_bzrdir.py

bzrlib/tests/per_foreign_vcs/__init__.py

bzrlib/tests/per_intertree/__init__.py

bzrlib/tests/per_workingtree/test_content_filters.py

bzrlib/tests/ssl_certs/create_ssls.py

bzrlib/tests/ssl_certs/server.crt

bzrlib/tests/ssl_certs/server.csr

bzrlib/tests/ssl_certs/server_with_pass.key

bzrlib/tests/ssl_certs/server_without_pass.key

bzrlib/tests/test__known_graph.py

bzrlib/tests/test__static_tuple.py

bzrlib/tests/test_btree_index.py

bzrlib/tests/test_graph.py

bzrlib/tests/test_index.py

bzrlib/tests/test_osutils.py

bzrlib/tests/test_urlutils.py

bzrlib/trace.py

bzrlib/transform.py

bzrlib/tree.py

bzrlib/urlutils.py

bzrlib/util/_bencode_py.py

bzrlib/version.py

bzrlib/workingtree.py

bzrlib/workingtree_4.py

doc/default.css

doc/developers/HACKING.txt

doc/developers/add.txt

doc/developers/api-versioning.txt

doc/developers/apport.txt

doc/developers/authentication-ring.txt

doc/developers/bug-handling.txt

doc/developers/bundles.txt

doc/developers/case-insensitive-file-systems.txt

doc/developers/colocated-branches.txt

doc/developers/commit.txt

doc/developers/container-format.txt

doc/developers/content-filtering.txt

doc/developers/cycle.txt

doc/developers/development-repo.txt

doc/developers/diff.txt

doc/developers/directory-fingerprints.txt

doc/developers/ec2.txt

doc/developers/improved_chk_index.txt

doc/developers/incremental-push-pull.txt

doc/developers/index-plain.txt

doc/developers/index.txt

doc/developers/inventory.txt

doc/developers/last-modified.txt

doc/developers/network-protocol.txt

doc/developers/overview.txt

doc/developers/performance-use-case-analysis.txt

doc/developers/planned-change-integration.txt

doc/developers/planned-performance-changes.txt

doc/developers/plans.txt

doc/developers/plugin-api.txt

doc/developers/ppa.txt

doc/developers/process.txt

doc/developers/profiling.txt

doc/developers/releasing.txt

doc/developers/repository-stream.txt

doc/developers/repository.txt

doc/developers/revert.txt

doc/developers/specifications.txt

doc/developers/status.txt

doc/developers/testing.txt

doc/developers/tortoise-strategy.txt

doc/developers/update.txt

doc/developers/win32_build_setup.txt

doc/en/mini-tutorial/index.txt

doc/en/tutorials/centralized_workflow.txt

doc/en/tutorials/tutorial.txt

doc/en/tutorials/using_bazaar_with_launchpad.txt

doc/en/user-guide/adv_merging.txt

doc/en/user-guide/branching_a_project.txt

doc/en/user-guide/configuring_bazaar.txt

doc/en/user-guide/controlling_registration.txt

doc/en/user-guide/distributed_intro.txt

doc/en/user-guide/http_smart_server.txt

doc/en/user-guide/index-plain.txt

doc/en/user-guide/index.txt

doc/en/user-guide/introducing_bazaar.txt

doc/en/user-guide/plugins.txt

doc/en/user-guide/publishing_a_branch.txt

doc/en/user-guide/recording_changes.txt

doc/en/user-guide/resolving_conflicts.txt

doc/en/user-guide/reviewing_changes.txt

doc/en/user-guide/sending_changes.txt

doc/en/user-guide/server.txt

doc/en/user-guide/setting_up_email.txt

doc/en/user-guide/shared_repository_layouts.txt

doc/en/user-guide/shelving_changes.txt

doc/en/user-guide/specifying_revisions.txt

doc/en/user-guide/stacked.txt

doc/en/user-guide/version_info.txt

doc/en/user-guide/web_browsing.txt

doc/en/user-guide/zen.txt

doc/es/index.txt

doc/es/mini-tutorial/index.txt

doc/es/user-guide/index-plain.txt

doc/es/user-guide/index.txt

doc/es/user-guide/resolving_conflicts.txt

doc/es/user-guide/version_info.txt

doc/index.es.txt

doc/index.ru.txt

doc/ja/tutorials/using_bazaar_with_launchpad.txt

doc/ja/upgrade-guide/data_migration.txt

doc/ja/user-guide/entering_commands.txt

doc/ja/user-guide/http_smart_server.txt

doc/ja/user-guide/introducing_bazaar.txt

doc/ja/user-guide/setting_up_email.txt

doc/ja/user-guide/version_info.txt

doc/ja/user-reference/index.txt

doc/ru/index.txt

doc/ru/mini-tutorial/index.txt

doc/ru/tutorials/centralized_workflow.txt

doc/ru/tutorials/tutorial.txt

doc/ru/tutorials/using_bazaar_with_launchpad.txt

doc/ru/user-guide/branching_a_project.txt

doc/ru/user-guide/index-plain.txt

doc/ru/user-guide/index.txt

doc/ru/user-guide/introducing_bazaar.txt

doc/ru/user-guide/specifying_revisions.txt

doc/ru/user-guide/zen.txt

Show diffs side-by-side

added added

removed removed

doc/developers/directory-fingerprints.txt

The basic idea is that for a directory in a tree (committed or otherwise), we

will have a single scalar value. If these values are the same, the contents of

the subtree under that directory are necessarily the same.

This is intended to help with these use cases, by allowing them to quickly skip

over directories with no relevant changes, and to detect when a directory has

Most of this will be hidden behind the Tree interface. This should cover

``log -v``, ``diff``, ``status``, ``merge`` (and implicit merge during

push, pull, update)::

tree.iter_changes(other_tree)

tree.get_file_lines(file_id) # and get_file, get_file_text

compare to all the trees. Commit currently needs to compare the working

tree to all the parent trees, which is needed to update the last_modified

field and would be unnecessary if we removed that field (for both files

and directories) and did not store per-file graphs.

This would potentially speed up commit after merge.

Verbose commit also displays the merged files, which does

~~~~~~~

Log is interested in two operations: finding the revisions that touched

anything inside a directory, and getting the differences between

consecutive revisions (possibly filtered to a directory)::

find_touching_revisions(branch, file_id) # should be on Branch?

Hashes converge: if you modify and then modify back, you get the same

hash. This is a pro because you can detect that there were ultimately

no significant changes. And also a con: you cannot use these hashes to form a graph

because they get cycles.

* Are the values unique across the whole tree, or only when comparing

different versions of the same object?

107

108

109

* Is it reasonable to assume hashes won't collide?

110

111

The odds of SHA-1 hashes colliding "accidentally" are vanishingly small.

112

113

It is possible that a `preimage attack`_ against SHA-1 may be discovered

132

133

It is desirable that we have a hash that covers all data, to guard

134

against bugs, transmission errors, or users trying to hand-hack files.

135

Since we need one hash of everything in the tree, perhaps we should also

135

Since we need one hash of everything in the tree, perhaps we should also

136

use it for the fingerprint.

137

138

Testaments explicitly separate the form used for hashing/signing from

143

stored data which is not protected by the signature: this data is less

144

important, but corruption of it would still cause problems.

145

We have encountered some specific problems with disagreement between

146

inventories as to the last-change of files, which is currently unsigned.

146

inventories as to the last-change of files, which is currently unsigned.

147

These problems can be introduced by ghosts.

148

149

If we hash the representation, there is still a way to support old

156

* Is hashing substantially slower than other possible approaches?

157

158

We already hash all the plain files. Except in unusual cases, the

159

directory metadata will be substantially smaller: perhaps 200:1 as a

159

directory metadata will be substantially smaller: perhaps 200:1 as a

160

rule of thumb.

161

162

When building a bzr tree, we spend on the order of 100ms hashing all the

163

source lines to validate them (about 13MB of source).

164

165

166

* Can you calculate one from a directory in the working tree? Without a basis?

166

* Can you calculate one from a directory in the working tree? Without a basis?

167

168

This seems possible with either hashes or revision ids.

168

This seems possible with either hashes or revision ids.

169

170

Using last_changed means that calculating the fingerprint from a working

171

tree necessarily requires reading the inventory for the basis

199

This does rule out for example using ``last_modified=None`` or

200

``='current:'`` to mean "changed in the working tree." Even if this is

201

not supported there seems some risk that we would get the same

202

fingerprint for trees that are actually different.

203

204

We could assign a

202

fingerprint for trees that are actually different.

203

204

We could assign a

205

hypothetical revision id to the tree for uncommitted files. In that

206

case there is some risk that the not-yet-committed id would become

207

visible or committed.

208

209

210

* Can we use an "approximate basis"?

211

212

When using radix trees, you may need context beyond the specific

213

directory being compared.

214

215

216

* Can you get the fingerprint of parents directories with only selected file ids

213

directory being compared.

214

215

216

* Can you get the fingerprint of parents directories with only selected file ids

217

taken from the working tree?

218

219

With hashes, we'd want to carry through the unselected files and

220

directories from the values they had in the parent revision.

221

222

223

* Are unbalanced trees a significant problem? Trees can be unbalanced by having

224

many directories (deep or wide), or many files per directory.

225

220

directories from the values they had in the parent revision.

221

222

223

* Are unbalanced trees a significant problem? Trees can be unbalanced by having

224

many directories (deep or wide), or many files per directory.

225

226

For small trees like bzr, 744 of 874 are in the bzrlib subtree. In

227

general, larger trees are more balanced, because humans, editors and

228

other tools have trouble managing very unbalanced trees. But there are

230

entries in one directory.

231

232

233

* Should we use a radix tree approach where fingerprints are calculated on a synthetic

233

* Should we use a radix tree approach where fingerprints are calculated on a synthetic

234

tree that is by definition balanced, even when the actual tree is unbalanced?

235

236

268

This has some consequences for how we can upgrade it in future: all

269

the changed directories need to be rewritten up to the revision level.

270

271

1. If we address directories by hash we need hash-addressed

271

1. If we address directories by hash we need hash-addressed

272

storage.

273

274

1. If we address directories by hash then for consistency we'd probably

274

1. If we address directories by hash then for consistency we'd probably

275

(not necessarily) want to address file texts by hash.

276

277

1. The per-file graph can't be indexed by hash because they can converge, so we

316

If the version of a file or directory is identified by a hash, we can't

317

use that to point into a per-file graph. We can have a graph indexed by

318

``(file_id, hash, revision_id)``. The last-modified could be stored as

319

part of this graph.

320

319

part of this graph.

320

321

The graph would no longer be core data; it could be always present but

322

might be rebuilt. Treating it as non-core data may make some changes

323

like shallow branches easier?

354

-----------

355

356

357

358

vim: filetype=rst textwidth=78 expandtab spelllang=en spell

359

Older »