~bzr-pqm/bzr/bzr.dev

« back to all changes in this revision

Viewing changes to doc/revfile.txt

Committer: Martin Pool
Date: 2005-05-06 03:20:15 UTC
Revision ID: mbp@sourcefrog.net-20050506032014-decf4918803147d2

- split out notes on storing annotations in revfiles

files added:
doc/revfile-annotation.txt

files modified:
doc/index.txt

doc/revfile.txt

Show diffs side-by-side

added added

removed removed

doc/revfile.txt

128

129

130

131

Extension to store annotations

132

==============================

133

134

We might extend the revfile format in a future version to also store

135

annotations. *This is not implemented yet.*

136

137

In previous versions, the index file identified texts by their

138

SHA-1 digest. This was unsatisfying for two reasons. Firstly it

139

assumes that SHA-1 will not collide, which is not an assumption we

140

wish to make in long-lived files. Secondly for annotations we need

141

to be able to map from file versions back to a revision.

142

143

Texts are identified by the name of the revfile and a UUID

144

corresponding to the first revision in which they were first

145

introduced. This means that given a text we can identify which

146

revision it belongs to, and annotations can use the index within the

147

revfile to identify where a region was first introduced.

148

149

We cannot identify texts by the integer revision number, because

150

that would limit us to only referring to a file in a particular

151

branch.

152

153

I'd like to just use the revision-id, but those are variable-length

154

strings, and I'd like the revfile index to be fixed-length and

155

relatively short. UUIDs can be encoded in binary as only 16 bytes.

156

Perhaps we should just use UUIDs for revisions and be done?

157

158

159

160

Annotations

161

-----------

162

163

Annotations indicate which revision of a file first inserted a line

164

(or region of bytes).

165

166

Given a string, we can write annotations on it like so: a sequence of

167

*(index, length)* pairs, giving the *index* of the revision which

168

introduced the next run of *length* bytes. The sum of the lengths

169

must equal the length of the string. For text files the regions will

170

typically fall on line breaks. This can be transformed in memory to

171

other structures, such as a list of *(index, content)* pairs.

172

173

When a line was inserted from a merge revision then the annotation for

174

that line should still be the source in the merged branch, rather than

175

just being the revision in which the merge took place.

176

177

They can cheaply be calculated when inserting a new text, but are

178

expensive to calculate after the fact because that requires searching

179

back through all previous text and all texts which were merged in. It

180

therefore seems sensible to calculate them once and store them.

181

182

To do this we need two operators which update an existing annotated

183

file:

184

185

A. Given an annotated file and a working text, update the annotation to

186

mark regions inserted in the working file as new in this revision.

187

188

B. Given two annotated files, merge them to produce an annotated

189

result. When there are conflicts, both texts should be included

190

and annotated.

191

192

These may be repeated: after a merge there may be another merge, or

193

there may be manual fixups or conflict resolutions.

194

195

So what we require is given a diff or a diff3 between two files, map

196

the regions of bytes changed into corresponding updates to the origin

197

annotations.

198

199

Annotations can also be delta-compressed; we only need to add new

200

annotation data when there is a text insertion.

201

202

(It is possible in a merge to have a change of annotation when

203

there is no text change, though this seems unlikely. This can

204

still be represented as a "pointless" delta, plus an update to the

205

annotations.)

206

207

Index file

208

----------

209

210

In a proposed (not implemented) storage with annotations, the index

211

file is a series of fixed-length records::

212

213

byte[16] UUID of revision

214

byte[20] SHA-1 of expanded text (as binary, not hex)

215

uint32 flags: 1=zlib compressed

216

uint32 sequence number this is based on, or -1 for full text

217

uint32 offset in text file of start

218

uint32 length of compressed delta in text file

219

uint32[3] reserved

220

221

Total 64 bytes.

222

223

The header is also 64 bytes, for tidyness and easy calculation. For

224

this format the header must be ``bzr revfile v2\n`` padded with

225

``\xff`` to 64 bytes.

226

227

The first record after the header is index 0. A record's base index

228

must be less than its own index.

229

230

The SHA-1 is redundant with the inventory but stored just as a check

231

on the compression methods and so that the file can be validated

232

without reference to any other information.

233

234

Each byte in the text file should be included by at most one delta.

235

236

237

Deltas

238

------

239

240

In a proposed (not implemented) storage with annotations, deltas to

241

the text are stored as a series of variable-length records::

242

243

uint32 idx

244

uint32 m

245

uint32 n

246

uint32 l

247

byte[l] new

248

249

This describes a change originally introduced in the revision

250

described by *idx* in the index.

251

252

This indicates that the region [m:n] of the input file should be

253

replaced by the text *new*. If m==n this is a pure insertion of l

254

bytes. If l==0 this is a pure deletion of (n-m) bytes.

255

256

257

258

259

131

260

132

Open issues

261

133

===========

273

145

be fixed by creating the fixed repository as a separate branch, into

274

146

which only the preserved revisions are exported.

275

147

276

* Should annotations also indicate where text was deleted?

277

278

* This design calls for only one annotation per line, which seems

279

standard. However, this is lacking in at least two cases:

280

281

- Lines which originate in the same way in more than one revision,

282

through being independently introduced. In this case we would

283

apparently have to make an arbitrary choice; I suppose branches

284

could prefer to assume lines originated in their own history.

285

286

- It might be useful to directly indicate which mergers included

287

which lines. We do have that information in the revision history

288

though, so there seems no need to store it for every line.

289

290

148

* Should we also store full-texts as a transitional step?

291

292

* Storing the annotations with the text is reasonably simple and

293

compact, but means that we always need to process the annotation

294

structure even when we only want the text. In particular it means

295

that full-texts cannot just simply be copied out but rather composed

296

from chunks. That seems inefficient since it is probably common to

297

only want the text.

298

Older »