~bzr-pqm/bzr/bzr.dev

« back to all changes in this revision

Viewing changes to doc/developers/directory-fingerprints.txt

Committer: Aaron Bentley
Date: 2007-08-20 13:07:12 UTC
mfrom: (2732 +trunk)
mto: This revision was merged to the branch mainline in revision 2733.
Revision ID: abentley@panoramicfeedback.com-20070820130712-buopmg528zcgwyxc

Merge bzr.dev

files added:
bzrlib/tests/blackbox/test_bundle_info.py

bzrlib/tests/repository_implementations/test_fetch.py

doc/developers/directory-fingerprints.txt

doc/developers/last-modified.txt

doc/developers/revision-properties.txt

files removed:
doc/developers/scratch.txt

files modified:
INSTALL

Makefile

NEWS

bzrlib/__init__.py

bzrlib/branch.py

bzrlib/builtins.py

bzrlib/bundle/bundle_data.py

bzrlib/bundle/commands.py

bzrlib/bundle/serializer/v4.py

bzrlib/bzrdir.py

bzrlib/commands.py

bzrlib/commit.py

bzrlib/debug.py

bzrlib/delta.py

bzrlib/diff.py

bzrlib/errors.py

bzrlib/fetch.py

bzrlib/info.py

bzrlib/knit.py

bzrlib/lockable_files.py

bzrlib/log.py

bzrlib/merge.py

bzrlib/merge_directive.py

bzrlib/multiparent.py

bzrlib/mutabletree.py

bzrlib/osutils.py

bzrlib/patches.py

bzrlib/remote.py

bzrlib/repofmt/weaverepo.py

bzrlib/repository.py

bzrlib/revision.py

bzrlib/status.py

bzrlib/store/__init__.py

bzrlib/store/revision/knit.py

bzrlib/store/versioned/__init__.py

bzrlib/tests/__init__.py

bzrlib/tests/blackbox/__init__.py

bzrlib/tests/blackbox/test_added.py

bzrlib/tests/blackbox/test_aliases.py

bzrlib/tests/blackbox/test_ancestry.py

bzrlib/tests/blackbox/test_bound_branches.py

bzrlib/tests/blackbox/test_branch.py

bzrlib/tests/blackbox/test_cat.py

bzrlib/tests/blackbox/test_checkout.py

bzrlib/tests/blackbox/test_commit.py

bzrlib/tests/blackbox/test_debug.py

bzrlib/tests/blackbox/test_diff.py

bzrlib/tests/blackbox/test_find_merge_base.py

bzrlib/tests/blackbox/test_info.py

bzrlib/tests/blackbox/test_locale.py

bzrlib/tests/blackbox/test_pull.py

bzrlib/tests/blackbox/test_selftest.py

bzrlib/tests/blackbox/test_send.py

bzrlib/tests/blackbox/test_status.py

bzrlib/tests/blackbox/test_update.py

bzrlib/tests/branch_implementations/test_branch.py

bzrlib/tests/branch_implementations/test_hooks.py

bzrlib/tests/branch_implementations/test_revision_history.py

bzrlib/tests/repository_implementations/__init__.py

bzrlib/tests/repository_implementations/test_repository.py

bzrlib/tests/test_branch.py

bzrlib/tests/test_bundle.py

bzrlib/tests/test_bzrdir.py

bzrlib/tests/test_commit.py

bzrlib/tests/test_escaped_store.py

bzrlib/tests/test_fetch.py

bzrlib/tests/test_lockdir.py

bzrlib/tests/test_log.py

bzrlib/tests/test_multiparent.py

bzrlib/tests/test_patches.py

bzrlib/tests/test_revision.py

bzrlib/tests/test_selftest.py

bzrlib/tests/test_sftp_transport.py

bzrlib/tests/test_source.py

bzrlib/tests/test_store.py

bzrlib/tests/test_trace.py

bzrlib/tests/test_transport.py

bzrlib/tests/test_transport_implementations.py

bzrlib/tests/test_versionedfile.py

bzrlib/tests/test_weave.py

bzrlib/tests/workingtree_implementations/test_is_ignored.py

bzrlib/trace.py

bzrlib/transport/__init__.py

bzrlib/transport/decorator.py

bzrlib/transport/ftp.py

bzrlib/transport/http/_pycurl.py

bzrlib/transport/http/_urllib.py

bzrlib/transport/http/wsgi.py

bzrlib/transport/local.py

bzrlib/transport/sftp.py

bzrlib/transport/ssh.py

bzrlib/ui/__init__.py

bzrlib/ui/text.py

bzrlib/versionedfile.py

bzrlib/weave.py

bzrlib/workingtree.py

doc/developers/performance-contributing.txt

doc/developers/performance-roadmap.txt

Show diffs side-by-side

added added

removed removed

doc/developers/directory-fingerprints.txt

Directory fingerprints

======================

.. contents:: :local:

Introduction

------------

The basic idea is that for a directory in a tree (committed or otherwise), we

will have a single scalar value. If these values are the same, the contents of

the subtree under that directory are necessarily the same.

This is intended to help with these use cases, by allowing them to quickly skip

over directories with no relevant changes, and to detect when a directory has

changed:

* diff/status (both local trees and historical trees)

* merge

* log -v

* log on a directory

* commit

Use-case oriented APIs

----------------------

Most of this will be hidden behind the Tree interface. This should cover

``log -v``, ``diff``, ``status``, ``merge`` (and implicit merge during

push, pull, update)::

tree.iter_changes(other_tree)

tree.get_file_lines(file_id) # and get_file, get_file_text

``commit``

~~~~~~~~~~

Commit is similar to ``iter_changes``, but different because it needs to

compare to all the trees. Commit currently needs to compare the working

tree to all the parent trees, which is needed to update the last_modified

field and would be unnecessary if we removed that field (for both files

and directories) and did not store per-file graphs.

This would potentially speed up commit after merge.

Verbose commit also displays the merged files, which does

require looking at all parents of files that aren't identical

to the left-hand parent.

``log``

~~~~~~~

Log is interested in two operations: finding the revisions that touched

anything inside a directory, and getting the differences between

consecutive revisions (possibly filtered to a directory)::

find_touching_revisions(branch, file_id) # should be on Branch?

Log shows the revisions that merged a change. At the moment that is not

included in the per-file graph, and it would also not be visible if the

directories were hashed.

Open questions

--------------

* Is this a good idea at all?

If changing a file changes all its parent directories up to the root it

will cause more churn on commit. (We currently update the all-in-one

inventory, but only have to update one line of it.)

Every time a child changes, we'll get a new node in the per-directory

graph. This is generally useful: it allows bzr log to do the default

mode easily, which is to show all changes under that directory. The

less common operation, ``log --no-recursive`` is still possible by

looking only at when the directory itself was renamed, added or removed.

(That is what the directory graph describes in bzr 0.18 and it is rarely

useful.)

* Should these be hashes or revision ids or something else?

Pros of using hashes: hashes are easy to generate by a foreign branch

plugin (e.g. bzr-svn). They don't need to get recursive last-changed

from the foreign branch, or to walk back through history. They just

need the relevant directory state, which any system we support can

answer.

Hashes converge: if you modify and then modify back, you get the same

hash. This is a pro because you can detect that there were ultimately

no significant changes. And also a con: you cannot use these hashes to form a graph

because they get cycles.

* Are the values unique across the whole tree, or only when comparing

different versions of the same object?

If we use last-changed revisions, then they will be very not unique

100

across the whole tree. To look up the contents, you must pass a

101

composite key like ``(file_id, last_changed)``.

102

103

If we use hashes they will be same only when the two contain the same

104

contents. Since we say that file ids must be unique, this

105

means they will match if and only if they are empty. We might relax

106

that in future when we introduce path tokens.

107

108

109

* Is it reasonable to assume hashes won't collide?

110

111

The odds of SHA-1 hashes colliding "accidentally" are vanishingly small.

112

113

It is possible that a `preimage attack`_ against SHA-1 may be discovered

114

in the future. Since we're not proposing in this document to make

115

revision-ids be SHA-1, if SHA-1 was obsoleted then we could rewrite the

116

contents of revisions but would not need to rename revisions. So the

117

impact of such a migration should just be a format upgrade, and a

118

recommendation (but not requirement) to re-sign revisions.

119

120

.. _`preimage attack`: http://tools.ietf.org/html/rfc4270

121

122

123

* If we use hashes, should it be the hash of the

124

representation stored for a directory?

125

126

In other words, should we pun the representation of the directory with

127

the form used for validation.

128

129

If there's some data stored that's not in the hash it's problematic.

130

The hash in no longer (effectively) uniquely identifies the

131

representation.

132

133

It is desirable that we have a hash that covers all data, to guard

134

against bugs, transmission errors, or users trying to hand-hack files.

135

Since we need one hash of everything in the tree, perhaps we should also

136

use it for the fingerprint.

137

138

Testaments explicitly separate the form used for hashing/signing from

139

the form used for storage. This allows us to change the storage form

140

without breaking existing GPG signatures. The downside is that we need

141

to do work O(tree) to make a testament, and this slows down signing,

142

verifying and generating bundles. It also means that there is some

143

stored data which is not protected by the signature: this data is less

144

important, but corruption of it would still cause problems.

145

We have encountered some specific problems with disagreement between

146

inventories as to the last-change of files, which is currently unsigned.

147

These problems can be introduced by ghosts.

148

149

If we hash the representation, there is still a way to support old

150

signatures, assuming that we never discard irreplaceable information.

151

The signature should say what format it applies to (similar to

152

testaments), and we could transform in memory the tree back to that

153

format.

154

155

156

* Is hashing substantially slower than other possible approaches?

157

158

We already hash all the plain files. Except in unusual cases, the

159

directory metadata will be substantially smaller: perhaps 200:1 as a

160

rule of thumb.

161

162

When building a bzr tree, we spend on the order of 100ms hashing all the

163

source lines to validate them (about 13MB of source).

164

165

166

* Can you calculate one from a directory in the working tree? Without a basis?

167

168

This seems possible with either hashes or revision ids.

169

170

Using last_changed means that calculating the fingerprint from a working

171

tree necessarily requires reading the inventory for the basis

172

revision, so that we know when unchanged files were last changed. With

173

hashes we could calculate them using the working tree information alone.

174

It's true that we will often then compare that information to the basis

175

tree (e.g. for simple ``bzr diff``), but we may only have to compare at

176

the top level, and sometimes we're comparing to a

177

different tree. This also touches on whether we should store

178

``last_modified`` for files, rather than directories.

179

180

For revision ids we need to assign a value to use for uncommitted

181

changes, but see below about the problems of this.

182

183

In some ways it would be elegant to say (hypothetical)::

184

185

wt.get_root().get_last_modified() == branch.get_last_revision()

186

187

to know that nothing was changed; but this may not be much better than

188

189

190

wt.get_root().get_hash() ==

191

branch.get_basis().get_root().get_hash()

192

193

194

* Can you use this to compare (directories from) two working trees?

195

196

If you can generate it from a working tree, you should be able to use it

197

to compare them.

198

199

This does rule out for example using ``last_modified=None`` or

200

``='current:'`` to mean "changed in the working tree." Even if this is

201

not supported there seems some risk that we would get the same

202

fingerprint for trees that are actually different.

203

204

We could assign a

205

hypothetical revision id to the tree for uncommitted files. In that

206

case there is some risk that the not-yet-committed id would become

207

visible or committed.

208

209

210

* Can we use an "approximate basis"?

211

212

When using radix trees, you may need context beyond the specific

213

directory being compared.

214

215

216

* Can you get the fingerprint of parents directories with only selected file ids

217

taken from the working tree?

218

219

With hashes, we'd want to carry through the unselected files and

220

directories from the values they had in the parent revision.

221

222

223

* Are unbalanced trees a significant problem? Trees can be unbalanced by having

224

many directories (deep or wide), or many files per directory.

225

226

For small trees like bzr, 744 of 874 are in the bzrlib subtree. In

227

general, larger trees are more balanced, because humans, editors and

228

other tools have trouble managing very unbalanced trees. But there are

229

exceptions: Aaron has one tree with 20,000 generated but versioned

230

entries in one directory.

231

232

233

* Should we use a radix tree approach where fingerprints are calculated on a synthetic

234

tree that is by definition balanced, even when the actual tree is unbalanced?

235

236

237

* What are the specific advantages of using recursive-last-modified rather than

238

hashes?

239

240

It may be a smaller step change.

241

242

It's a bidirectional link: given a directory text identifier ``(file_id,

243

last_changed)`` you can look up the revision that last changed it.

244

245

From the preceding, even without the per-file graph you can skip through

246

the history of this file: go to the last-changed revision, look at all

247

its parents and repeat.

248

249

250

* Is it a smaller change to use recursive-last-modified on directories?

251

252

Probably yes:

253

254

1. We can just put it into the current inventory format without changing

255

anything else.

256

257

By contrast to use a hash we'd have to either split up the inventory

258

as stored, or change the sort order for the inventory, or synthesize

259

per-directory inventories in memory for hashing.

260

261

However, xml is somewhat redundant and slow to parse/generate; and

262

reading the whole thing before comparing some sections is only a

263

partial win. It may be a smaller change but we'd be preserving

264

things we want to change.

265

266

1. At present we rarely hash storage representations, only file texts.

267

This is not a large technical change, but it is a conceptual change.

268

This has some consequences for how we can upgrade it in future: all

269

the changed directories need to be rewritten up to the revision level.

270

271

1. If we address directories by hash we need hash-addressed

272

storage.

273

274

1. If we address directories by hash then for consistency we'd probably

275

(not necessarily) want to address file texts by hash.

276

277

1. The per-file graph can't be indexed by hash because they can converge, so we

278

need to either rework or dispose of the per-file graph.

279

280

281

* Any possibilities for avoiding hashes recurring?

282

283

1. Hash along with an identification of the parents (as in hg). Then you

284

can't convert a tree without all its basis trees, and there is still

285

convergence when the same merge is done by two people, and you can't

286

create it directly from the working tree.

287

288

1. Include last-modified revision id in the hash.

289

290

1. Index by ``(revision, hash)`` or vice versa.

291

292

1. Store a per-file graph and allow it to have repeated keys. The graph

293

would tell you about all the parent texts ever seen; you would need

294

to use revision graph information to resolve ambiguities.

295

296

297

* What are the specific disadvantages of using recursive-last-modified rather than

298

hashes?

299

300

To calculate the last-changed revision, given the last-changed

301

information of the contained files, you need to look at the revision

302

graph. They're not enough because you need to know the relations

303

between the mentioned revisions. In a merge it's possible the correct

304

directory last-modified will not be the same as that of any of the files

305

within it. This can also happen when a file is removed (deleted or

306

renamed) from a directory.

307

308

309

* Should we split up storage of the inventories?

310

311

This is not quite the same but connected.

312

313

314

* How does this relate to per-file/per-directory hashes?

315

316

If the version of a file or directory is identified by a hash, we can't

317

use that to point into a per-file graph. We can have a graph indexed by

318

``(file_id, hash, revision_id)``. The last-modified could be stored as

319

part of this graph.

320

321

The graph would no longer be core data; it could be always present but

322

might be rebuilt. Treating it as non-core data may make some changes

323

like shallow branches easier?

324

325

326

* How do you ask a tree for a given text?

327

328

Right now we say ::

329

330

revision_tree.get_file_lines(file_id)

331

332

so the choice of storage is hidden behind the revision tree: it could be

333

accessed by ``(file_id, last_changed)`` or by hash or otherwise.

334

335

At the moment the Repository exports a friend api to RevisionTree,

336

currently usually talking in VersionedFiles.

337

338

We probably wouldn't want Repository to expose a ``get_text_for_sha1()``

339

interface because that would be very difficult to support on old

340

repositories or on foreign branches.

341

342

343

Conclusions

344

-----------

345

346

347

Design changes

348

--------------

349

350

351

352

353

API changes

354

-----------

355

356

357

358

vim: filetype=rst textwidth=78 expandtab spelllang=en spell

359

Older »