~bzr-pqm/bzr/bzr.dev : revision 2513.1.1

1

Commit

2

======

3

4

The basic purpose of commit is to

5

1 - create and store a new revision based on the contents of the working tree

6

2 - make this the new basis revision for the working tree

7

8

We can do a selected commit of only some files or subtrees.

9

10

Minimum work

11

------------

12

13

The best performance we could hope for is:

14

- stat each versioned selected working file once

15

- read from the workingtree and write into the repository any new file texts

16

- in general, do work proportional to the size of the shape (eg

17

inventory) of the old and new selected trees, and to the total size of

18

the modified files

19

20

In more detail:

21

22

1.0 - Store new file texts: if a versioned file contains a new text

23

there is no avoiding storing it. To determine which ones have changed

24

we must go over the workingtree and at least stat each file. If the

25

file is modified since it was last hashed, it must be read in.

26

Ideally we would read it only once, and either notice that it has not

27

changed, or store it at that point.

28

29

On the other hand we want new code to be able to handle files that are

30

larger than will fit in memory. We may then need to read each file up

31

to two times: once to determine if there is a new text and calculate

32

its hash, and again to store it.

33

34

1.1 - Store a tree-shape description (ie inventory or similar.) This

35

describes the non-file objects, and provides a reference from the

36

Revision to the texts within it.

37

38

1.2 - Generate and store a new revision object.

39

40

1.3 - Do delta-compression on the stored objects. (git notably does

41

not do this at commit time, deferring this entirely until later.)

42

This requires finding the appropriate basis for each modified file: in

43

the current scheme we get the file id, last-revision from the

44

dirstate, look into the knit for that text, extract that text in

45

total, generate a delta, then store that into the knit. Most delta

46

operations are O(n^2) to O(n^3) in the size of the modified files.

47

48

1.4 - Cache annotation information for the changes: at the moment this

49

is done as part of the delta storage. There are some flaws in that

50

approach, such as that it is not updated when ghosts are filled, and

51

the annotation can't be re-run with new diff parameters.

52

53

2.1 - Make the new revision the basis for the tree, and clear the list

54

of parents. Strictly this is all that's logically necessary, unless

55

the working tree format requires more work.

56

57

The dirstate format does require more work, because it caches the

58

parent tree data for each file within the working tree data. In

59

practice this means that every commit rewrites the entire dirstate

60

file - we could try to avoid rewriting the whole file but this may be

61

difficult because variable-length data (the last-changed revision id)

62

is inserted into many rows.

63

64

The current dirstate design then seems to mean that any commit of a

65

single file imposes a cost proportional to the size of the current

66

workingtree. Maybe there are other benefits that outweigh this.

67

Alternatively if it was fast enough for operations to always look at

68

the original storage of the parent trees we could do without the

69

cache.

70

71

2.2 - Record the observed file hashes into the workingtree control

72

files. For the files that we just committed, we have the information

73

to store a valid hash cache entry: we know their stat information and

74

the sha1 of the file contents. This is not strictly necessary to the

75

speed of commit, but it will be useful later in avoiding reading those

76

files, and the only cost of doing it now is writing it out.

77

78

In fact there are some user interface niceties that complicate this:

79

80

3 - Before starting the commit proper, we prompt for a commit message

81

and in that commit message editor we show a list of the files that

82

will be committed: basically the output of bzr status. This is

83

basically the same as the list of changes we detect while storing the

84

commit, but because the user will sometimes change the tree after

85

opening the commit editor and expect the final state to be committed I

86

think we do have to look for changes twice. Since it takes the user a

87

while to enter a message this is not a big problem as long as both the

88

status summary and the commit are individually fast.

89

90

4 - As the commit proceeds (or after?) we show another status-like

91

summary. Just printing the names of modified files as they're stored

92

would be easy. Recording deleted and renamed files or directories is

93

more work: this can only be done by reference to the primary parent

94

tree and requires it be read in. Worse, reporting renames requires

95

searching by id across the entire parent tree. Possibly full

96

reporting should be a default-off verbose option because it does

97

require more work beyond the commit itself.

98

99

5 - Bazaar currently allows for missing files to be automatically

100

marked as removed at the time of commit. Leaving aside the ui

101

consequences, this means that we have to update the working inventory

102

to mark these files as removed. Since as discussed above we always

103

have to rewrite the dirstate on commit this is not substantial, though

104

we should make sure we do this in one pass, not two. I have

105

previously proposed to make this behaviour a non-default option.

106

107

We may need to run hooks or generate signatures during commit, but

108

they don't seem to have substantial performance consequences.

109

110

If one wanted to optimize solely for the speed of commit I think

111

hash-addressed file-per-text storage like in git (or bzr 0.1) is very

112

good. Remarkably, it does not need to read the inventory for the

113

previous revision. For each versioned file, we just need to get its

114

hash, either by reading the file or validating its stat data. If that

115

hash is not already in the repository, the file is just copied in and

116

compressed. As directories are traversed, they're turned into texts

117

and stored as well, and then finally the revision is too. This does

118

depend on later doing some delta compression of these texts.

119

120

Variations on this are possible. Rather than writing a single file

121

into the repository for each text, we could fold them into a single

122

collation or pack file. That would create a smaller number of files

123

in the repository, but looking up a single text would require looking

124

into their indexes rather than just asking the filesystem.

125

126

Rather than using hashes we can use file-id/rev-id pairs as at

127

present, which has several consequences pro and con.

128

129

Interface stack

130

---------------

131

132

The commit api is invoked by the command interface, and copies information

133

from the tree into the branch and its repository, possibly updating the

134

WorkingTree afterwards.

135

136

The command interface passes:

137

138

* a commit message (from an option, if any),

139

* or an indication that it should be read interactively from the ui object;

140

* a list of files to commit

141

* an option for a dry-run commit

142

* verbose option, or callback to indicate

143

* timestamp, timezone, committer, chosen revision id

144

* config (for what?)

145

* option for local-only commit on a bound branch

146

* option for strict commits (fail if there are unknown or missing files)

147

* option to allow "pointless" commits (with no tree changes)

148

149

>>> Branch.commit(from_tree, message, files_to_commit)

150

151

There will be different implementations of this for different Branch

152

classes, whether for foreign branches or Bazaar repositories using

153

different storage methods.

154

155

Most of the commit should occur during a single lockstep iteration across

156

the workingtree and parent trees. The WorkingTree interface needs to

157

provide methods that give commit all it needs. Some of these methods

158

(such as answering the file's last change revision) may be deprecated in

159

newer working trees and there we have a choice of either calculating the

160

value from the data that is present, or refusing to support commit to

161

newer repositories.

162

163

For a dirstate tree the iteration of changes from the parent can easily be

164

done within its own iter_changes.

165

166

XXX: We currently don't support selective-file commit of a merge; this

167

could be done if we decide how it should be recorded - is this to be

168

stored as an overall merge revision; as a preliminary non-merge revisions;

169

or will the per-file graph diverge from the revision graph.

170

171

Other things commit needs to do:

172

173

* check if there are any conflicts in the tree - if so, commit cannot

174

continue

175

176

* check if there are any unknown files, if --strict or automatic add is

177

turned on

178

179

* check the working tree basis version is up to date with the branch tip

180

181

* when automatically adding new files or deleting missing files during

182

commit, they must be noted during commit and written into the working

183

tree at some point

184

185

* refuse "pointless" commits with no file changes - should be easy by

186

just refusing to do the final step of storing a new overall inventory

187

and revision object

188

189

* heuristic detection of renames between add and delete (out of scope for

190

this change)

191

192

* pushing changes to a master branch if any

193

194

* running hooks, pre and post commit

195

196

* prompting for a commit message if necessary, including a list of the

197

changes that have already been observed

198

199

* if there are tree references and recursing into them is enabled, then

200

do so

201

202

Updates that need to be made in the working tree, either on conclusion

203

of commit or during the scan, include

204

205

* Changes made to the tree shape, including automatic adds, renames or

206

deletes

207

208

* For trees (eg dirstate) that cache parent inventories, the old parent

209

information must be removed and the new one inserted

210

211

* The tree hashcache information should be updated to reflect the stat

212

value at which the file was the same as the committed version. This

213

needs to be done carefully to prevent inconsistencies if the file is

214

modified during or shortly after the commit. Perhaps it would work to

215

read the mtime of the file before we read its text to commit.

216

217

Dirstate inventories may be most easily updated in a single operation at

218

the end; however it may be best to accumulate data as we proceed through

219

the tree rather than revisiting it at the end.

220

221

Showing a progress bar for commit may not be necessary if we report files

222

as they are committed. Alternatively we could transiently show a progress

223

bar for each directory that's scanned, even if no changes are observed.

224

225

This needs to collect a list of added/changed/removed files, each of which

226

must have its text stored (if any) and containing directory updated. This

227

can be done by calling Tree._iter_changes on the source tree, asking for

228

changes

229

230

In the 0.17 model the commit operation needs to know the per-file parents

231

and per-file last-changed revision.

232

233

XXX: If we want to retain explicitly stored per-file graphs, it would seem

234

that we do need to record per-file parents. We have not yet finally

235

settled that we do want to remove them or treat them as a cache. This api

236

stack is still ok whether we do or not, but the internals of it may

237

change.

238

239

(In this and other operations we must avoid having multiple layers walk

240

over the tree separately. For example, it is no good to have the Command

241

layer walk the tree to generate a list of all file ids to commit, because

242

the tree will also be walked later. The layers that do need to operate

243

per-file should probably be bound together in a per-dirblock iterator,

244

rather than each iterating independently.)