~bzr-pqm/bzr/bzr.dev

« back to all changes in this revision

Viewing changes to doc/developers/performance-commit.txt

Committer: John Arbash Meinel
Date: 2007-06-07 22:31:44 UTC
mfrom: (2517 +trunk)
mto: This revision was merged to the branch mainline in revision 2518.
Revision ID: john@arbash-meinel.com-20070607223144-u4oljlajcvq6by2n

[merge] bzr.dev 2517

files added:
bzrlib/branchbuilder.py

bzrlib/tests/test_branchbuilder.py

doc/developers/add.txt

doc/developers/annotate.txt

doc/developers/bundle-creation.txt

doc/developers/gc.txt

doc/developers/initial-push-pull.txt

doc/developers/merge-scaling.txt

doc/developers/performance-commit.txt

doc/developers/performance.dot

doc/developers/planned-performance-changes.txt

doc/developers/revert.txt

files modified:
Makefile

NEWS

bzrlib/builtins.py

bzrlib/commit.py

bzrlib/dirstate.py

bzrlib/errors.py

bzrlib/help_topics.py

bzrlib/log.py

bzrlib/merge.py

bzrlib/missing.py

bzrlib/osutils.py

bzrlib/tests/HTTPTestUtil.py

bzrlib/tests/__init__.py

bzrlib/tests/blackbox/test_help.py

bzrlib/tests/blackbox/test_log.py

bzrlib/tests/blackbox/test_revision_info.py

bzrlib/tests/test_ancestry.py

bzrlib/tests/test_commit.py

bzrlib/tests/test_dirstate.py

bzrlib/tests/test_http.py

bzrlib/tests/test_log.py

bzrlib/tests/test_merge.py

bzrlib/tests/test_missing.py

bzrlib/tests/test_revert.py

bzrlib/tests/test_selftest.py

bzrlib/tests/test_transform.py

bzrlib/tests/test_treebuilder.py

bzrlib/tests/workingtree_implementations/test_remove.py

bzrlib/tests/workingtree_implementations/test_workingtree.py

bzrlib/transform.py

bzrlib/transport/http/__init__.py

bzrlib/transport/http/_pycurl.py

bzrlib/ui/__init__.py

bzrlib/version.py

bzrlib/workingtree.py

bzrlib/workingtree_4.py

contrib/bash/bzr.simple

doc/developers/HACKING

doc/developers/incremental-push-pull.txt

doc/developers/performance-roadmap-rationale.txt

doc/developers/performance-roadmap.txt

doc/developers/performance-use-case-analysis.txt

doc/tutorial.txt

setup.py

Show diffs side-by-side

added added

removed removed

doc/developers/performance-commit.txt

Commit Performance Notes

========================

.. contents::

Commit: The Minimum Work Required

---------------------------------

This is Martin Pool's email to the mailing list on the minimum work that

commit needs to do. Be sure to check the mailing list archives

(https://lists.ubuntu.com/archives/bazaar/2007q2/025791.html)

for follow-up comments from Robert and others ...

Here is a description of the minimum work that commit must do. We

want to make sure that our design doesn't cost too much more than this

minimum. I am trying to do this without making too many assumptions

about the underlying storage, but am assuming that the ui and basic

architecture (wt, branch, repo) stays about the same.

The basic purpose of commit is to:

1. create and store a new revision based on the contents of the working tree

2. make this the new basis revision for the working tree

We can do a selected commit of only some files or subtrees.

The best performance we could hope for is:

- stat each versioned selected working file once

- read from the workingtree and write into the repository any new file texts

- in general, do work proportional to the size of the shape (eg

inventory) of the old and new selected trees, and to the total size of

the modified files

In more detail:

1.0 - Store new file texts: if a versioned file contains a new text

there is no avoiding storing it. To determine which ones have changed

we must go over the workingtree and at least stat each file. If the

file is modified since it was last hashed, it must be read in.

Ideally we would read it only once, and either notice that it has not

changed, or store it at that point.

On the other hand we want new code to be able to handle files that are

larger than will fit in memory. We may then need to read each file up

to two times: once to determine if there is a new text and calculate

its hash, and again to store it.

1.1 - Store a tree-shape description (ie inventory or similar.) This

describes the non-file objects, and provides a reference from the

Revision to the texts within it.

1.2 - Generate and store a new revision object.

1.3 - Do delta-compression on the stored objects. (git notably does

not do this at commit time, deferring this entirely until later.)

This requires finding the appropriate basis for each modified file: in

the current scheme we get the file id, last-revision from the

dirstate, look into the knit for that text, extract that text in

total, generate a delta, then store that into the knit. Most delta

operations are O(n2) to O(n3) in the size of the modified files.

1.4 - Cache annotation information for the changes: at the moment this

is done as part of the delta storage. There are some flaws in that

approach, such as that it is not updated when ghosts are filled, and

the annotation can't be re-run with new diff parameters.

2.1 - Make the new revision the basis for the tree, and clear the list

of parents. Strictly this is all that's logically necessary, unless

the working tree format requires more work.

The dirstate format does require more work, because it caches the

parent tree data for each file within the working tree data. In

practice this means that every commit rewrites the entire dirstate

file - we could try to avoid rewriting the whole file but this may be

difficult because variable-length data (the last-changed revision id)

is inserted into many rows.

The current dirstate design then seems to mean that any commit of a

single file imposes a cost proportional to the size of the current

workingtree. Maybe there are other benefits that outweigh this.

Alternatively if it was fast enough for operations to always look at

the original storage of the parent trees we could do without the

cache.

2.2 - Record the observed file hashes into the workingtree control

files. For the files that we just committed, we have the information

to store a valid hash cache entry: we know their stat information and

the sha1 of the file contents. This is not strictly necessary to the

speed of commit, but it will be useful later in avoiding reading those

files, and the only cost of doing it now is writing it out.

In fact there are some user interface niceties that complicate this:

3 - Before starting the commit proper, we prompt for a commit message

and in that commit message editor we show a list of the files that

will be committed: basically the output of bzr status. This is

basically the same as the list of changes we detect while storing the

commit, but because the user will sometimes change the tree after

opening the commit editor and expect the final state to be committed I

100

think we do have to look for changes twice. Since it takes the user a

101

while to enter a message this is not a big problem as long as both the

102

status summary and the commit are individually fast.

103

104

4 - As the commit proceeds (or after?) we show another status-like

105

summary. Just printing the names of modified files as they're stored

106

would be easy. Recording deleted and renamed files or directories is

107

more work: this can only be done by reference to the primary parent

108

tree and requires it be read in. Worse, reporting renames requires

109

searching by id across the entire parent tree. Possibly full

110

reporting should be a default-off verbose option because it does

111

require more work beyond the commit itself.

112

113

5 - Bazaar currently allows for missing files to be automatically

114

marked as removed at the time of commit. Leaving aside the ui

115

consequences, this means that we have to update the working inventory

116

to mark these files as removed. Since as discussed above we always

117

have to rewrite the dirstate on commit this is not substantial, though

118

we should make sure we do this in one pass, not two. I have

119

previously proposed to make this behaviour a non-default option.

120

121

We may need to run hooks or generate signatures during commit, but

122

they don't seem to have substantial performance consequences.

123

124

If one wanted to optimize solely for the speed of commit I think

125

hash-addressed file-per-text storage like in git (or bzr 0.1) is very

126

good. Remarkably, it does not need to read the inventory for the

127

previous revision. For each versioned file, we just need to get its

128

hash, either by reading the file or validating its stat data. If that

129

hash is not already in the repository, the file is just copied in and

130

compressed. As directories are traversed, they're turned into texts

131

and stored as well, and then finally the revision is too. This does

132

depend on later doing some delta compression of these texts.

133

134

Variations on this are possible. Rather than writing a single file

135

into the repository for each text, we could fold them into a single

136

collation or pack file. That would create a smaller number of files

137

in the repository, but looking up a single text would require looking

138

into their indexes rather than just asking the filesystem.

139

140

Rather than using hashes we can use file-id/rev-id pairs as at

141

present, which has several consequences pro and con.

142

143

144

Commit vs Status

145

----------------

146

147

At first glance, commit simply stores the changes status reports. In fact,

148

this isn't technically correct: commit considers some files modified that

149

status does not. The notes below were put together by John Arbash Meinel

150

and Aaron Bentley in May 2007 to explain the finer details of commit to

151

Ian Clatworthy. They are recorded here as they are likely to be useful to

152

others new to Bazaar ...

153

154

1) **Unknown files have a different effect.** With --no-strict (the default)

155

they have no effect and can be completely ignored. With --strict they

156

should cause the commit to abort (so you don't forget to add the two new

157

test files that you just created).

158

159

2) **Multiple parents.** 'status' always compares 2 trees, typically the

160

last-committed tree and the current working tree. 'commit' will compare

161

more trees if there has been a merge.

162

163

a) The "last modified" property for files.

164

A file may be marked as changed since the last commit, but that

165

change may have come in from the merge, and the change could have

166

happened several commits back. There are several edge cases to be

167

handled here, like if both branches modified the same file, or if

168

just one branch modified it.

169

170

b) The trickier case is when a file appears unmodified since last

171

commit, but it was modified versus one of the merged branches. I

172

believe there are a few ways this can happen, like if a merged

173

branch changes a file and then reverts it back (you still update

174

the 'last modified' field).

175

In general, if both sides disagree on the 'last-modified' flag,

176

then you need to generate a new entry pointing 'last-modified' at

177

this revision (because you are resolving the differences between

178

the 2 parents).

179

180

3) **Automatic deletion of 'missing' files.** This is a point that we go

181

back and forth on. I think the basic idea is that 'bzr commit' by

182

default should abort if it finds a 'missing' file (in case that file was

183

renamed rather than deleted), but 'bzr commit --auto' can add unknown

184

files and remove missing files automatically.

185

186

4) **sha1 for newly added files.** status doesn't really need this: it should

187

only care that the file is not present in base, but is present now. In

188

some ways commit doesn't care either, since it needs to read and sha the

189

file itself anyway.

190

191

5) **Nested trees.** status doesn't recurse into nested trees, but commit does.

192

This is just because not all of the nested-trees work has been merged yet.

193

194

A tree-reference is considered modified if the subtree has been

195

committed since the last containing-tree commit. But commit needs to

196

recurse into every subtree, to ensure that a commit is done if the

197

subtree has changed since its last commit. _iter_changes only reports

198

on tree-references that are modified, so it can't be used for doing

199

subtree commits.

200

201

202

Avoiding Work: Smarter Change Detection

203

---------------------------------------

204

205

Commit currently walks through every file building an inventory. Here is

206

Aaron's brain dump on a better way ...

207

208

_iter_changes won't tell us about tree references that haven't changed,

209

even if those subtrees have changed. (Unless we ask for unchanged

210

files, which we don't want to do, of course.)

211

212

There is an iter_references method, but using it looks just as expensive

213

as calling kind().

214

215

I did some work on updating commit to use iter_changes, but found for

216

multi-parent trees, I had to fall back to the slow inventory comparison

217

approach.

218

219

Really, I think we need a call akin to iter_changes that handles

220

multiple parents, and knows to emit entries when InventoryEntry.revision

221

is all that's changed.

222

223

224

Avoiding Work: Better Layering

225

------------------------------

226

227

For each file, commit is currently doing more work than it should. Here is

228

John's take on a better way ...

229

230

Note that "_iter_changes" *does* have to touch every path on disk, but

231

it just can do it in a more efficient manner. (It doesn't have to create

232

an InventoryEntry for all the ones that haven't changed).

233

234

I agree with Aaron that we need something a little different than

235

_iter_changes. Both because of handling multiple parents, as well as we

236

don't want it to actually read the files if we have a stat-cache miss.

237

238

Specifically, the commit code *has* to read the files because it is

239

going to add the text to the repository, and we want it to compute the

240

sha1 at *that* time, so we are guaranteed to have the valid sha (rather

241

than just whatever the last cached one was). So we want the code to

242

return 'None' if it doesn't have an up-to-date sha1, rather than reading

243

the file and computing it, just before it returns it to the parent.

244

245

The commit code (0.16) should really be restructured. It's layering is

246

pretty wrong.

247

248

Specifically, calling "kind()" requires a stat of the file. But we have

249

to do a stat to get the size/whether the record is up-to-date, etc. So

250

we really need to have a "create_an_up_to_date_inventory()" function.

251

But because we are accessing every object on disk, we want to be working

252

in tuples rather than Inventory objects. And because DirState already

253

has the parent records next to the current working inventory, it can do

254

all the work to do really fast comparison and throw-away of unimportant

255

records.

256

257

The way I made "bzr status" fast is by moving the 'ignore this record'

258

ability as deep into the stack as I could get. Status has the property

259

that you don't care about most of the records, just like commit. So the

260

sooner you can stop evaluating the 99% that you don't care about, the

261

less work you do.

262

Older »