~bzr-pqm/bzr/bzr.dev

« back to all changes in this revision

Viewing changes to doc/developers/commit.txt

Committer: Canonical.com Patch Queue Manager
Date: 2007-04-17 00:59:30 UTC
mfrom: (1551.15.4 Aaron's mergeable stuff)
Revision ID: pqm@pqm.ubuntu.com-20070417005930-rofskshyjsfzrahh

Fix ftp transport with servers that don't support atomic rename

files added:
build-api

bzrlib/tests/test_doc_generate.py

files removed:
bzrlib/_knit_load_data_c.pyx

bzrlib/_knit_load_data_py.py

bzrlib/api.py

bzrlib/benchmarks/bench_knit.py

bzrlib/branchbuilder.py

bzrlib/breakin.py

bzrlib/bugtracker.py

bzrlib/counted_lock.py

bzrlib/graph.py

bzrlib/pack.py

bzrlib/remote.py

bzrlib/smart/branch.py

bzrlib/smart/bzrdir.py

bzrlib/smart/repository.py

bzrlib/smtp_connection.py

bzrlib/tests/blackbox/test_breakin.py

bzrlib/tests/blackbox/test_lsprof.py

bzrlib/tests/blackbox/test_pack.py

bzrlib/tests/branch_implementations/test_get_revision_id_to_revno_map.py

bzrlib/tests/branch_implementations/test_revision_id_to_revno.py

bzrlib/tests/branch_implementations/test_sprout.py

bzrlib/tests/repository_implementations/test_pack.py

bzrlib/tests/test_branchbuilder.py

bzrlib/tests/test_bugtracker.py

bzrlib/tests/test_counted_lock.py

bzrlib/tests/test_graph.py

bzrlib/tests/test_help.py

bzrlib/tests/test_hooks.py

bzrlib/tests/test_info.py

bzrlib/tests/test_lsprof.py

bzrlib/tests/test_pack.py

bzrlib/tests/test_remote.py

bzrlib/tests/test_smart.py

bzrlib/tests/test_smtp_connection.py

bzrlib/tests/workingtree_implementations/test_remove.py

bzrlib/tests/workingtree_implementations/test_uncommit.py

bzrlib/transport/brokenrename.py

bzrlib/util/tests

bzrlib/util/tests/__init__.py

bzrlib/util/tests/test_bencode.py

doc/bug_trackers.txt

doc/developers

doc/developers/add.txt

doc/developers/annotate.txt

doc/developers/api-versioning.txt

doc/developers/bundle-creation.txt

doc/developers/bundles.txt

doc/developers/commit.txt

doc/developers/container-format.txt

doc/developers/diff.txt

doc/developers/dirstate.txt

doc/developers/gc.txt

doc/developers/incremental-push-pull.txt

doc/developers/index.txt

doc/developers/initial-push-pull.txt

doc/developers/merge-scaling.txt

doc/developers/performance-contributing.txt

doc/developers/performance-roadmap-rationale.txt

doc/developers/performance-roadmap.txt

doc/developers/performance-use-case-analysis.txt

doc/developers/performance.dot

doc/developers/planned-change-integration.txt

doc/developers/planned-performance-changes.txt

doc/developers/profiling.txt

doc/developers/revert.txt

doc/developers/scratch.txt

doc/developers/status.txt

doc/developers/uncommit.txt

doc/shared_repository_layouts.txt

man1

tools/bzr_epydoc

tools/bzr_epydoc_uid.py

files renamed:
doc/developers/HACKING => HACKING

bzrlib/deprecated_graph.py => bzrlib/graph.py

bzrlib/tests/test_deprecated_graph.py => bzrlib/tests/test_graph.py

files modified:
.bzrignore

Makefile

NEWS

README

bzrlib/__init__.py

bzrlib/add.py

bzrlib/annotate.py

bzrlib/benchmarks/__init__.py

bzrlib/benchmarks/bench_add.py

bzrlib/benchmarks/bench_bench.py

bzrlib/benchmarks/bench_cache_utf8.py

bzrlib/benchmarks/bench_checkout.py

bzrlib/benchmarks/bench_commit.py

bzrlib/benchmarks/bench_inventory.py

bzrlib/benchmarks/bench_log.py

bzrlib/benchmarks/bench_osutils.py

bzrlib/benchmarks/bench_rocks.py

bzrlib/benchmarks/bench_sftp.py

bzrlib/benchmarks/bench_startup.py

bzrlib/benchmarks/bench_status.py

bzrlib/benchmarks/bench_transform.py

bzrlib/benchmarks/bench_workingtree.py

bzrlib/benchmarks/bench_xml.py

bzrlib/benchmarks/tree_creator/kernel_like.py

bzrlib/branch.py

bzrlib/builtins.py

bzrlib/bundle/bundle_data.py

bzrlib/bundle/commands.py

bzrlib/bundle/serializer/__init__.py

bzrlib/bundle/serializer/v08.py

bzrlib/bzrdir.py

bzrlib/cmd_version_info.py

bzrlib/commands.py

bzrlib/commit.py

bzrlib/config.py

bzrlib/conflicts.py

bzrlib/debug.py

bzrlib/delta.py

bzrlib/dirstate.py

bzrlib/errors.py

bzrlib/fetch.py

bzrlib/generate_ids.py

bzrlib/help.py

bzrlib/help_topics.py

bzrlib/hooks.py

bzrlib/info.py

bzrlib/inventory.py

bzrlib/knit.py

bzrlib/lazy_import.py

bzrlib/lock.py

bzrlib/lockdir.py

bzrlib/log.py

bzrlib/lsprof.py

bzrlib/memorytree.py

bzrlib/merge.py

bzrlib/merge_directive.py

bzrlib/missing.py

bzrlib/msgeditor.py

bzrlib/mutabletree.py

bzrlib/option.py

bzrlib/osutils.py

bzrlib/plugin.py

bzrlib/plugins/launchpad/__init__.py

bzrlib/plugins/launchpad/test_register.py

bzrlib/progress.py

bzrlib/repofmt/knitrepo.py

bzrlib/repository.py

bzrlib/revision.py

bzrlib/revisionspec.py

bzrlib/sign_my_commits.py

bzrlib/smart/__init__.py

bzrlib/smart/medium.py

bzrlib/smart/protocol.py

bzrlib/smart/request.py

bzrlib/smart/server.py

bzrlib/smart/vfs.py

bzrlib/status.py

bzrlib/store/revision/__init__.py

bzrlib/store/revision/knit.py

bzrlib/store/revision/text.py

bzrlib/strace.py

bzrlib/symbol_versioning.py

bzrlib/tag.py

bzrlib/tests/HTTPTestUtil.py

bzrlib/tests/HttpServer.py

bzrlib/tests/TestUtil.py

bzrlib/tests/__init__.py

bzrlib/tests/blackbox/__init__.py

bzrlib/tests/blackbox/test_add.py

bzrlib/tests/blackbox/test_added.py

bzrlib/tests/blackbox/test_aliases.py

bzrlib/tests/blackbox/test_ancestry.py

bzrlib/tests/blackbox/test_annotate.py

bzrlib/tests/blackbox/test_bound_branches.py

bzrlib/tests/blackbox/test_branch.py

bzrlib/tests/blackbox/test_break_lock.py

bzrlib/tests/blackbox/test_bundle.py

bzrlib/tests/blackbox/test_cat.py

bzrlib/tests/blackbox/test_cat_revision.py

bzrlib/tests/blackbox/test_checkout.py

bzrlib/tests/blackbox/test_command_encoding.py

bzrlib/tests/blackbox/test_commit.py

bzrlib/tests/blackbox/test_conflicts.py

bzrlib/tests/blackbox/test_debug.py

bzrlib/tests/blackbox/test_diff.py

bzrlib/tests/blackbox/test_exceptions.py

bzrlib/tests/blackbox/test_export.py

bzrlib/tests/blackbox/test_find_merge_base.py

bzrlib/tests/blackbox/test_help.py

bzrlib/tests/blackbox/test_ignore.py

bzrlib/tests/blackbox/test_info.py

bzrlib/tests/blackbox/test_init.py

bzrlib/tests/blackbox/test_inventory.py

bzrlib/tests/blackbox/test_join.py

bzrlib/tests/blackbox/test_log.py

bzrlib/tests/blackbox/test_logformats.py

bzrlib/tests/blackbox/test_ls.py

bzrlib/tests/blackbox/test_merge.py

bzrlib/tests/blackbox/test_merge_directive.py

bzrlib/tests/blackbox/test_missing.py

bzrlib/tests/blackbox/test_mv.py

bzrlib/tests/blackbox/test_nick.py

bzrlib/tests/blackbox/test_non_ascii.py

bzrlib/tests/blackbox/test_outside_wt.py

bzrlib/tests/blackbox/test_pull.py

bzrlib/tests/blackbox/test_push.py

bzrlib/tests/blackbox/test_re_sign.py

bzrlib/tests/blackbox/test_reconcile.py

bzrlib/tests/blackbox/test_remerge.py

bzrlib/tests/blackbox/test_remove.py

bzrlib/tests/blackbox/test_remove_tree.py

bzrlib/tests/blackbox/test_revert.py

bzrlib/tests/blackbox/test_revision_history.py

bzrlib/tests/blackbox/test_revision_info.py

bzrlib/tests/blackbox/test_revno.py

bzrlib/tests/blackbox/test_selftest.py

bzrlib/tests/blackbox/test_serve.py

bzrlib/tests/blackbox/test_shared_repository.py

bzrlib/tests/blackbox/test_sign_my_commits.py

bzrlib/tests/blackbox/test_split.py

bzrlib/tests/blackbox/test_status.py

bzrlib/tests/blackbox/test_tags.py

bzrlib/tests/blackbox/test_testament.py

bzrlib/tests/blackbox/test_too_much.py

bzrlib/tests/blackbox/test_uncommit.py

bzrlib/tests/blackbox/test_update.py

bzrlib/tests/blackbox/test_upgrade.py

bzrlib/tests/blackbox/test_version.py

bzrlib/tests/blackbox/test_version_info.py

bzrlib/tests/blackbox/test_versioning.py

bzrlib/tests/blackbox/test_whoami.py

bzrlib/tests/branch_implementations/__init__.py

bzrlib/tests/branch_implementations/test_bound_sftp.py

bzrlib/tests/branch_implementations/test_branch.py

bzrlib/tests/branch_implementations/test_create_checkout.py

bzrlib/tests/branch_implementations/test_locking.py

bzrlib/tests/branch_implementations/test_parent.py

bzrlib/tests/branch_implementations/test_permissions.py

bzrlib/tests/branch_implementations/test_pull.py

bzrlib/tests/branch_implementations/test_push.py

bzrlib/tests/branch_implementations/test_tags.py

bzrlib/tests/branch_implementations/test_uncommit.py

bzrlib/tests/branch_implementations/test_update.py

bzrlib/tests/bzrdir_implementations/__init__.py

bzrlib/tests/bzrdir_implementations/test_bzrdir.py

bzrlib/tests/interrepository_implementations/__init__.py

bzrlib/tests/intertree_implementations/__init__.py

bzrlib/tests/intertree_implementations/test_compare.py

bzrlib/tests/interversionedfile_implementations/__init__.py

bzrlib/tests/repository_implementations/__init__.py

bzrlib/tests/repository_implementations/test_repository.py

bzrlib/tests/revisionstore_implementations/__init__.py

bzrlib/tests/revisionstore_implementations/test_all.py

bzrlib/tests/test_ancestry.py

bzrlib/tests/test_annotate.py

bzrlib/tests/test_api.py

bzrlib/tests/test_bad_files.py

bzrlib/tests/test_branch.py

bzrlib/tests/test_bundle.py

bzrlib/tests/test_commands.py

bzrlib/tests/test_commit.py

bzrlib/tests/test_config.py

bzrlib/tests/test_dirstate.py

bzrlib/tests/test_errors.py

bzrlib/tests/test_http.py

bzrlib/tests/test_knit.py

bzrlib/tests/test_lazy_import.py

bzrlib/tests/test_lockable_files.py

bzrlib/tests/test_lockdir.py

bzrlib/tests/test_log.py

bzrlib/tests/test_merge.py

bzrlib/tests/test_merge_core.py

bzrlib/tests/test_merge_directive.py

bzrlib/tests/test_missing.py

bzrlib/tests/test_msgeditor.py

bzrlib/tests/test_options.py

bzrlib/tests/test_osutils.py

bzrlib/tests/test_plugins.py

bzrlib/tests/test_progress.py

bzrlib/tests/test_read_bundle.py

bzrlib/tests/test_repository.py

bzrlib/tests/test_revert.py

bzrlib/tests/test_revision.py

bzrlib/tests/test_selftest.py

bzrlib/tests/test_sftp_transport.py

bzrlib/tests/test_smart_add.py

bzrlib/tests/test_smart_transport.py

bzrlib/tests/test_source.py

bzrlib/tests/test_strace.py

bzrlib/tests/test_timestamp.py

bzrlib/tests/test_transform.py

bzrlib/tests/test_transport.py

bzrlib/tests/test_transport_implementations.py

bzrlib/tests/test_treebuilder.py

bzrlib/tests/test_tsort.py

bzrlib/tests/test_ui.py

bzrlib/tests/test_urlutils.py

bzrlib/tests/test_versionedfile.py

bzrlib/tests/test_workingtree_4.py

bzrlib/tests/test_wsgi.py

bzrlib/tests/tree_implementations/__init__.py

bzrlib/tests/workingtree_implementations/__init__.py

bzrlib/tests/workingtree_implementations/test_commit.py

bzrlib/tests/workingtree_implementations/test_merge_from_branch.py

bzrlib/tests/workingtree_implementations/test_move.py

bzrlib/tests/workingtree_implementations/test_parents.py

bzrlib/tests/workingtree_implementations/test_smart_add.py

bzrlib/tests/workingtree_implementations/test_workingtree.py

bzrlib/timestamp.py

bzrlib/trace.py

bzrlib/transform.py

bzrlib/transport/__init__.py

bzrlib/transport/ftp.py

bzrlib/transport/http/__init__.py

bzrlib/transport/http/_pycurl.py

bzrlib/transport/http/_urllib.py

bzrlib/transport/http/_urllib2_wrappers.py

bzrlib/transport/http/response.py

bzrlib/transport/http/wsgi.py

bzrlib/transport/local.py

bzrlib/transport/memory.py

bzrlib/transport/readonly.py

bzrlib/transport/remote.py

bzrlib/transport/sftp.py

bzrlib/tsort.py

bzrlib/ui/__init__.py

bzrlib/uncommit.py

bzrlib/urlutils.py

bzrlib/util/bencode.py

bzrlib/version.py

bzrlib/versionedfile.py

bzrlib/weave.py

bzrlib/weave_commands.py

bzrlib/win32utils.py

bzrlib/workingtree.py

bzrlib/workingtree_4.py

bzrlib/xml5.py

contrib/bash/bzr.simple

doc/README.1st

doc/centralized_workflow.txt

doc/configuration.txt

doc/default.css

doc/http_smart_server.txt

doc/index.txt

doc/plugins.txt

doc/server.txt

doc/tutorial.txt

setup.py *

tools/doc_generate/autodoc_man.py

tools/doc_generate/autodoc_rstx.py

tools/win32/bzr.iss.cog

Show diffs side-by-side

added added

removed removed

doc/developers/commit.txt

Commit Performance Notes

========================

.. contents:: :local:

Changes to commit

-----------------

We want to improve the commit code in two phases.

Phase one is to have a better separation from the format-specific logic,

the user interface, and the general process of committing.

Phase two is to have better interfaces by which a good workingtree format

can efficiently pass data to a good storage format. If we get phase one

right, it will be relatively easy and non-disruptive to bring this in.

Commit: The Minimum Work Required

---------------------------------

Here is a description of the minimum work that commit must do. We

want to make sure that our design doesn't cost too much more than this

minimum. I am trying to do this without making too many assumptions

about the underlying storage, but am assuming that the ui and basic

architecture (wt, branch, repo) stays about the same.

The basic purpose of commit is to:

1. create and store a new revision based on the contents of the working tree

2. make this the new basis revision for the working tree

We can do a selected commit of only some files or subtrees.

The best performance we could hope for is:

- stat each versioned selected working file once

- read from the workingtree and write into the repository any new file texts

- in general, do work proportional to the size of the shape (eg

inventory) of the old and new selected trees, and to the total size of

the modified files

In more detail:

1.0 - Store new file texts: if a versioned file contains a new text

there is no avoiding storing it. To determine which ones have changed

we must go over the workingtree and at least stat each file. If the

file is modified since it was last hashed, it must be read in.

Ideally we would read it only once, and either notice that it has not

changed, or store it at that point.

On the other hand we want new code to be able to handle files that are

larger than will fit in memory. We may then need to read each file up

to two times: once to determine if there is a new text and calculate

its hash, and again to store it.

1.1 - Store a tree-shape description (ie inventory or similar.) This

describes the non-file objects, and provides a reference from the

Revision to the texts within it.

1.2 - Generate and store a new revision object.

1.3 - Do delta-compression on the stored objects. (git notably does

not do this at commit time, deferring this entirely until later.)

This requires finding the appropriate basis for each modified file: in

the current scheme we get the file id, last-revision from the

dirstate, look into the knit for that text, extract that text in

total, generate a delta, then store that into the knit. Most delta

operations are O(n**2) to O(n**3) in the size of the modified files.

1.4 - Cache annotation information for the changes: at the moment this

is done as part of the delta storage. There are some flaws in that

approach, such as that it is not updated when ghosts are filled, and

the annotation can't be re-run with new diff parameters.

2.1 - Make the new revision the basis for the tree, and clear the list

of parents. Strictly this is all that's logically necessary, unless

the working tree format requires more work.

The dirstate format does require more work, because it caches the

parent tree data for each file within the working tree data. In

practice this means that every commit rewrites the entire dirstate

file - we could try to avoid rewriting the whole file but this may be

difficult because variable-length data (the last-changed revision id)

is inserted into many rows.

The current dirstate design then seems to mean that any commit of a

single file imposes a cost proportional to the size of the current

workingtree. Maybe there are other benefits that outweigh this.

Alternatively if it was fast enough for operations to always look at

the original storage of the parent trees we could do without the

cache.

2.2 - Record the observed file hashes into the workingtree control

files. For the files that we just committed, we have the information

to store a valid hash cache entry: we know their stat information and

the sha1 of the file contents. This is not strictly necessary to the

speed of commit, but it will be useful later in avoiding reading those

files, and the only cost of doing it now is writing it out.

100

In fact there are some user interface niceties that complicate this:

101

102

3 - Before starting the commit proper, we prompt for a commit message

103

and in that commit message editor we show a list of the files that

104

will be committed: basically the output of bzr status. This is

105

basically the same as the list of changes we detect while storing the

106

commit, but because the user will sometimes change the tree after

107

opening the commit editor and expect the final state to be committed I

108

think we do have to look for changes twice. Since it takes the user a

109

while to enter a message this is not a big problem as long as both the

110

status summary and the commit are individually fast.

111

112

4 - As the commit proceeds (or after?) we show another status-like

113

summary. Just printing the names of modified files as they're stored

114

would be easy. Recording deleted and renamed files or directories is

115

more work: this can only be done by reference to the primary parent

116

tree and requires it be read in. Worse, reporting renames requires

117

searching by id across the entire parent tree. Possibly full

118

reporting should be a default-off verbose option because it does

119

require more work beyond the commit itself.

120

121

5 - Bazaar currently allows for missing files to be automatically

122

marked as removed at the time of commit. Leaving aside the ui

123

consequences, this means that we have to update the working inventory

124

to mark these files as removed. Since as discussed above we always

125

have to rewrite the dirstate on commit this is not substantial, though

126

we should make sure we do this in one pass, not two. I have

127

previously proposed to make this behaviour a non-default option.

128

129

We may need to run hooks or generate signatures during commit, but

130

they don't seem to have substantial performance consequences.

131

132

If one wanted to optimize solely for the speed of commit I think

133

hash-addressed file-per-text storage like in git (or bzr 0.1) is very

134

good. Remarkably, it does not need to read the inventory for the

135

previous revision. For each versioned file, we just need to get its

136

hash, either by reading the file or validating its stat data. If that

137

hash is not already in the repository, the file is just copied in and

138

compressed. As directories are traversed, they're turned into texts

139

and stored as well, and then finally the revision is too. This does

140

depend on later doing some delta compression of these texts.

141

142

Variations on this are possible. Rather than writing a single file

143

into the repository for each text, we could fold them into a single

144

collation or pack file. That would create a smaller number of files

145

in the repository, but looking up a single text would require looking

146

into their indexes rather than just asking the filesystem.

147

148

Rather than using hashes we can use file-id/rev-id pairs as at

149

present, which has several consequences pro and con.

150

151

152

Commit vs Status

153

----------------

154

155

At first glance, commit simply stores the changes status reports. In fact,

156

this isn't technically correct: commit considers some files modified that

157

status does not. The notes below were put together by John Arbash Meinel

158

and Aaron Bentley in May 2007 to explain the finer details of commit to

159

Ian Clatworthy. They are recorded here as they are likely to be useful to

160

others new to Bazaar ...

161

162

1) **Unknown files have a different effect.** With --no-strict (the default)

163

they have no effect and can be completely ignored. With --strict they

164

should cause the commit to abort (so you don't forget to add the two new

165

test files that you just created).

166

167

2) **Multiple parents.** 'status' always compares 2 trees, typically the

168

last-committed tree and the current working tree. 'commit' will compare

169

more trees if there has been a merge.

170

171

a) The "last modified" property for files.

172

A file may be marked as changed since the last commit, but that

173

change may have come in from the merge, and the change could have

174

happened several commits back. There are several edge cases to be

175

handled here, like if both branches modified the same file, or if

176

just one branch modified it.

177

178

b) The trickier case is when a file appears unmodified since last

179

commit, but it was modified versus one of the merged branches. I

180

believe there are a few ways this can happen, like if a merged

181

branch changes a file and then reverts it back (you still update

182

the 'last modified' field).

183

In general, if both sides disagree on the 'last-modified' flag,

184

then you need to generate a new entry pointing 'last-modified' at

185

this revision (because you are resolving the differences between

186

the 2 parents).

187

188

3) **Automatic deletion of 'missing' files.** This is a point that we go

189

back and forth on. I think the basic idea is that 'bzr commit' by

190

default should abort if it finds a 'missing' file (in case that file was

191

renamed rather than deleted), but 'bzr commit --auto' can add unknown

192

files and remove missing files automatically.

193

194

4) **sha1 for newly added files.** status doesn't really need this: it should

195

only care that the file is not present in base, but is present now. In

196

some ways commit doesn't care either, since it needs to read and sha the

197

file itself anyway.

198

199

5) **Nested trees.** status doesn't recurse into nested trees, but commit does.

200

This is just because not all of the nested-trees work has been merged yet.

201

202

A tree-reference is considered modified if the subtree has been

203

committed since the last containing-tree commit. But commit needs to

204

recurse into every subtree, to ensure that a commit is done if the

205

subtree has changed since its last commit. _iter_changes only reports

206

on tree-references that are modified, so it can't be used for doing

207

subtree commits.

208

209

210

Avoiding Work: Smarter Change Detection

211

---------------------------------------

212

213

Commit currently walks through every file building an inventory. Here is

214

Aaron's brain dump on a better way ...

215

216

_iter_changes won't tell us about tree references that haven't changed,

217

even if those subtrees have changed. (Unless we ask for unchanged

218

files, which we don't want to do, of course.)

219

220

There is an iter_references method, but using it looks just as expensive

221

as calling kind().

222

223

I did some work on updating commit to use iter_changes, but found for

224

multi-parent trees, I had to fall back to the slow inventory comparison

225

approach.

226

227

Really, I think we need a call akin to iter_changes that handles

228

multiple parents, and knows to emit entries when InventoryEntry.revision

229

is all that's changed.

230

231

232

Avoiding Work: Better Layering

233

------------------------------

234

235

For each file, commit is currently doing more work than it should. Here is

236

John's take on a better way ...

237

238

Note that "_iter_changes" *does* have to touch every path on disk, but

239

it just can do it in a more efficient manner. (It doesn't have to create

240

an InventoryEntry for all the ones that haven't changed).

241

242

I agree with Aaron that we need something a little different than

243

_iter_changes. Both because of handling multiple parents, as well as we

244

don't want it to actually read the files if we have a stat-cache miss.

245

246

Specifically, the commit code *has* to read the files because it is

247

going to add the text to the repository, and we want it to compute the

248

sha1 at *that* time, so we are guaranteed to have the valid sha (rather

249

than just whatever the last cached one was). So we want the code to

250

return 'None' if it doesn't have an up-to-date sha1, rather than reading

251

the file and computing it, just before it returns it to the parent.

252

253

The commit code (0.16) should really be restructured. It's layering is

254

pretty wrong.

255

256

Specifically, calling "kind()" requires a stat of the file. But we have

257

to do a stat to get the size/whether the record is up-to-date, etc. So

258

we really need to have a "create_an_up_to_date_inventory()" function.

259

But because we are accessing every object on disk, we want to be working

260

in tuples rather than Inventory objects. And because DirState already

261

has the parent records next to the current working inventory, it can do

262

all the work to do really fast comparison and throw-away of unimportant

263

records.

264

265

The way I made "bzr status" fast is by moving the 'ignore this record'

266

ability as deep into the stack as I could get. Status has the property

267

that you don't care about most of the records, just like commit. So the

268

sooner you can stop evaluating the 99% that you don't care about, the

269

less work you do.

270

271

272

Avoiding work: avoiding reading parent data

273

-------------------------------------------

274

275

We would like to avoid the work of reading any data about the parent

276

revisions. We should at least try to avoid reading anything from the

277

repository; we can also consider whether it is possible or useful to hold

278

less parent information in the working tree.

279

280

When a commit of selected files is requested, the committed snapshot is a

281

composite of some directories from the parent revision and some from the

282

working tree. In this case it is logically necessary to have the parent

283

inventory information.

284

285

If file last-change information or per-file graph information is stored

286

then it must be available from the parent trees.

287

288

If the Branch's storage method does delta compression at commit time it

289

may need to retrieve file or inventory texts from the repository.

290

291

It is desirable to avoid roundtrips to the Repository during commit,

292

particularly because it may be remote. If the WorkingTree can determine

293

by itself that a text was in the parent and therefore should be in the

294

Repository that avoids one roundtrip per file.

295

296

There is a possibility here that the parent revision is not stored, or not

297

correctly stored, in the repository the tree is being committed into, and

298

so the committed tree would not be reconstructable. We could check that

299

the parent revision is present in the inventory and rely on the invariant

300

that if a revision is present, everything to reconstruct it will be

301

present too.

302

303

304

Code structure

305

--------------

306

307

Caller starts a commit

308

309

>>> Branch.commit(from_tree, options)

310

311

This creates a CommitBuilder object matched to the Branch, Repository and

312

Tree. It can vary depending on model differences or by knowledge of what

313

is efficient with the Repository and Tree. Model differences might

314

include whether no-text-change merges need to be reported, and whether the

315

316

The basic CommitBuilder.commit structure can be

317

318

1. Ask the branch if it is ready to commit (up to date with master if

319

any.)

320

321

2. Ask the tree if it is ready to commit to the branch (up to date with

322

branch?), no conflicts, etc

323

324

3. Commit changed files; prototype implementation:

325

326

a. Ask the working tree for all committable files; for each it should

327

return the per-file parents, stat information, kind, etc.

328

329

b. Ask the repository to store the new file text; the repository should

330

return the stored sha1 and new revision id.

331

332

4. Commit changed inventory

333

334

5. Commit revision object

335

336

337

338

339

340

341

342

343

344

Complications of commit

345

-----------------------

346

347

Bazaar (as of 0.17) does not support selective-file commit of a merge;

348

this could be done if we decide how it should be recorded - is this to be

349

stored as an overall merge revision; as a preliminary non-merge revisions;

350

or will the per-file graph diverge from the revision graph.

351

352

There are several checks that may cause the commit to be refused, which

353

may be activated or deactivated by options.

354

355

* presence of conflicts in the tree

356

357

* presence of unknown files

358

359

* the working tree basis is up to date with the branch tip

360

361

* the local branch is up to date with the master branch, if there

362

is one and --local is not specified

363

364

* an empty commit message is given,

365

366

* a hook flags an error

367

368

* a "pointless" commit, with no inventory changes

369

370

Most of these require walking the tree and can be easily done while

371

recording the tree shape. This does require that it be possible to abort

372

the commit after the tree changes have been recorded. It could be ok to

373

either leave the unreachable partly-committed records in the repository,

374

or to roll back.

375

376

Other complications:

377

378

* when automatically adding new files or deleting missing files during

379

commit, they must be noted during commit and written into the working

380

tree at some point

381

382

* refuse "pointless" commits with no file changes - should be easy by

383

just refusing to do the final step of storing a new overall inventory

384

and revision object

385

386

* heuristic detection of renames between add and delete (out of scope for

387

this change)

388

389

* pushing changes to a master branch if any

390

391

* running hooks, pre and post commit

392

393

* prompting for a commit message if necessary, including a list of the

394

changes that have already been observed

395

396

* if there are tree references and recursing into them is enabled, then

397

do so

398

399

Commit needs to protect against duplicated file ids

400

401

402

Updates that need to be made in the working tree, either on conclusion

403

of commit or during the scan, include

404

405

* Changes made to the tree shape, including automatic adds, renames or

406

deletes

407

408

* For trees (eg dirstate) that cache parent inventories, the old parent

409

information must be removed and the new one inserted

410

411

* The tree hashcache information should be updated to reflect the stat

412

value at which the file was the same as the committed version, and the

413

content hash it was observed to have. This needs to be done carefully to

414

prevent inconsistencies if the file is modified during or shortly after

415

the commit. Perhaps it would work to read the mtime of the file before we

416

read its text to commit.

417

418

419

Interface stack

420

---------------

421

422

The commit api is invoked by the command interface, and copies information

423

from the tree into the branch and its repository, possibly updating the

424

WorkingTree afterwards.

425

426

The command interface passes:

427

428

* a commit message (from an option, if any),

429

* or an indication that it should be read interactively from the ui object;

430

* a list of files to commit

431

* an option for a dry-run commit

432

* verbose option, or callback to indicate

433

* timestamp, timezone, committer, chosen revision id

434

* config (for what?)

435

* option for local-only commit on a bound branch

436

* option for strict commits (fail if there are unknown or missing files)

437

* option to allow "pointless" commits (with no tree changes)

438

439

(This is rather a lot of options to pass individually and just for code tidyness maybe some of them should be combine into objects.)

440

441

>>> Branch.commit(from_tree, message, files_to_commit, ...)

442

443

There will be different implementations of this for different Branch

444

classes, whether for foreign branches or Bazaar repositories using

445

different storage methods.

446

447

Most of the commit should occur during a single lockstep iteration across

448

the workingtree and parent trees. The WorkingTree interface needs to

449

provide methods that give commit all it needs. Some of these methods

450

(such as answering the file's last change revision) may be deprecated in

451

newer working trees and there we have a choice of either calculating the

452

value from the data that is present, or refusing to support commit to

453

newer repositories.

454

455

For a dirstate tree the iteration of changes from the parent can easily be

456

done within its own iter_changes.

457

458

Dirstate inventories may be most easily updated in a single operation at

459

the end; however it may be best to accumulate data as we proceed through

460

the tree rather than revisiting it at the end.

461

462

Showing a progress bar for commit may not be necessary if we report files

463

as they are committed. Alternatively we could transiently show a progress

464

bar for each directory that's scanned, even if no changes are observed.

465

466

This needs to collect a list of added/changed/removed files, each of which

467

must have its text stored (if any) and containing directory updated. This

468

can be done by calling Tree._iter_changes on the source tree, asking for

469

changes

470

471

In the 0.17 model the commit operation needs to know the per-file parents

472

and per-file last-changed revision.

473

474

(In this and other operations we must avoid having multiple layers walk

475

over the tree separately. For example, it is no good to have the Command

476

layer walk the tree to generate a list of all file ids to commit, because

477

the tree will also be walked later. The layers that do need to operate

478

per-file should probably be bound together in a per-dirblock iterator,

479

rather than each iterating independently.)

480

481

Branch->Tree interface

482

----------------------

483

484

The Branch commit code needs to ask the Tree what should be committed, in

485

terms of changes from the parent revisions. If the Tree holds all the

486

necessary parent tree information itself it can do it single handed;

487

otherwise it may need to ask the Repository for parent information.

488

489

This should be a streaming interface, probably like iter_changes returning

490

information per directory block.

491

492

The interface should not return a block for directories that are

493

recursively unchanged.

494

495

The tree's idea of what is possibly changed may be more conservative than

496

that of the branch. For example the tree may report on merges of files

497

where the text is identical to the parents: this must be recorded for

498

Bazaar branches that record per-file ancestry but is not necessary for all

499

branches. If the tree is responsible for determining when directories

500

have been recursively modified then it will report on all the parents of

501

such files. There are several implementation options:

502

503

1. Return all files and directories the branch might want to commit, even

504

if the branch ends up taking no action on them.

505

506

2. When starting the iteration, the branch can specify what type of change

507

is considered interesting.

508

509

Since these types of changes are probably (??) rare compared to files that

510

are either completely unmodified or substantially modified, the first may

511

be the best and simplest option.

512

513

The branch needs to build an inventory to commit, which must include

514

unchanged files within changed directories. This should be returned from

515

the working tree too. Repositories that store per-directory inventories

516

will want to build and store these from the lowest directories up.

517

For 0.17 format repositories with an all-in-one inventory it may be

518

easiest to accumulate inventory entries in arbitrary order into an

519

in-memory Inventory and then serialize it.

520

521

It ought to be possible to commit any Tree into a Branch, without

522

requiring a WorkingTree; the commit code should cope if the tree is not

523

interested in updating hashcache information or does not have a

524

``last_revision``.

525

526

527

Information from the tree to repository

528

---------------------------------------

529

530

The main things the tree needs to tell the Branch about are:

531

532

* A file is modified from its parent revision (in text, permissions,

533

other), and so its text may need to be stored.

534

535

Files should also be reported if they have more than one unique parent

536

revision, for repositories that store per-file graphs or last-change

537

revisions. Perhaps this behaviour should be optional.

538

539

**XXX:** are renames/deletions reported here too?

540

541

* The complete contents of a modified directory, so that its inventory

542

text may be stored. This should be done after all the contained files

543

and directories have been reported. If there are unmodified files,

544

or unselected files carried through from

545

546

XXX: Actually perhaps not grouped by directory, but rather grouped

547

appropriately for the shape of inventory storage in the repository.

548

549

In a zoomed-in checkout the workingtree may not have all the shape data

550

for the entire tree.

551

552

* A file is missing -- could cause either automatic removal or an aborted

553

commit.

554

555

* Any unknown files -- can cause automatic addition, abortion of a strict

556

commit, or just reporting.

557

558

559

Information from the repository to the tree

560

-------------------------------------------

561

562

After the commit the tree needs to be updated to the new revision. Some

563

information which was accumulated during the commit must be made available

564

to the workingtree. It's probably reasonable to hold it all in memory and

565

allow the workingtree to get it in whatever order it wants.

566

567

* A list of modified entries, and for each one:

568

569

* The stat values observed when the file was first read.

570

571

* The hash of the committed file text.

572

573

* The file's last-change revision, if appropriate.

574

575

This should include any entries automatically added or removed.

576

577

This might be construed as an enhanced version of ``set_parent_trees``.

578

We can avoid a stat on each file by using the value that was observed when

579

it was first read.

580

581

582

583

Selective commit

584

----------------

585

586

For a partial commit the directory contents may need to contain a mix of

587

entries from the working tree and parent trees. This code probably

588

shouldn't live in a specific tree implementation; maybe there should be a

589

general filter that selects paths from one tree into another?

590

591

However, the tree walking code does probably need to know about selected

592

paths to avoid examining unselected files or directories.

593

594

We never refuse selective file commits (except of merges).

595

596

597

598

Common commit code

599

------------------

600

601

What is common to all commit implementations, regardless of workingtree or

602

repository format?

603

604

* Prompting for a commit message?

605

* Strictness/conflict checks?

606

* Auto add/remove?

607

608

How should this be separated?

609

610

611

612

Order of traversal

613

------------------

614

615

For current and contemplated Bazaar storage formats, we can only finally

616

commit a directory after its contained files and directories have been

617

committed.

618

619

The dirstate workingtree format naturally iterates by directory in order

620

by path, yielding directories before their contents. This may also be the

621

most efficient order in which to stat and read the files.

622

623

One option would be to construe the interface as a visitor which reports

624

when files are detected to be changed, and also when directories are

625

finished.

626

627

628

Open question: per-file graphs

629

------------------------------

630

631

**XXX:** If we want to retain explicitly stored per-file graphs, it would

632

seem that we do need to record per-file parents. We have not yet finally

633

settled that we do want to remove them or treat them as a cache. This api

634

stack is still ok whether we do or not, but the internals of it may

635

change.

Older »