~bzr-pqm/bzr/bzr.dev

« back to all changes in this revision

Viewing changes to doc/revfile.txt

Committer: Martin Pool
Date: 2005-05-05 06:38:18 UTC
Revision ID: mbp@sourcefrog.net-20050505063818-3eb3260343878325

- do upload CHANGELOG to web server, even though it's autogenerated

files added:
bzrlib/cache.py

bzrlib/tests.py

files removed:
HACKING

Makefile

bzr-man.py

bzrlib/atomicfile.py

bzrlib/builtins.py

bzrlib/changeset.py

bzrlib/commit.py

bzrlib/delta.py

bzrlib/externalcommand.py

bzrlib/fetch.py

bzrlib/graph.py

bzrlib/hashcache.py

bzrlib/intset.py

bzrlib/lock.py

bzrlib/log.py

bzrlib/merge.py

bzrlib/merge3.py

bzrlib/merge_core.py

bzrlib/meta_store.py

bzrlib/missing.py

bzrlib/msgeditor.py

bzrlib/patch.py

bzrlib/plugin.py

bzrlib/plugins

bzrlib/plugins/__init__.py

bzrlib/progress.py

bzrlib/selftest

bzrlib/selftest/TestUtil.py

bzrlib/selftest/__init__.py

bzrlib/selftest/blackbox.py

bzrlib/selftest/plugins.py

bzrlib/selftest/test_merge_core.py

bzrlib/selftest/test_parent.py

bzrlib/selftest/test_smart_add.py

bzrlib/selftest/test_xml.py

bzrlib/selftest/testbranch.py

bzrlib/selftest/testdiff.py

bzrlib/selftest/testfetch.py

bzrlib/selftest/testgraph.py

bzrlib/selftest/testhashcache.py

bzrlib/selftest/testinv.py

bzrlib/selftest/testlog.py

bzrlib/selftest/testmerge.py

bzrlib/selftest/testmerge3.py

bzrlib/selftest/testrevision.py

bzrlib/selftest/testrevisionnamespaces.py

bzrlib/selftest/teststatus.py

bzrlib/selftest/teststore.py

bzrlib/selftest/versioning.py

bzrlib/selftest/whitebox.py

bzrlib/shellcomplete.py

bzrlib/ui.py

bzrlib/upgrade.py

bzrlib/util

bzrlib/util/__init__.py

bzrlib/util/effbot

bzrlib/util/effbot/__init__.py

bzrlib/util/effbot/org

bzrlib/util/effbot/org/__init__.py

bzrlib/util/effbot/org/gzip_consumer.py

bzrlib/util/effbot/org/http_client.py

bzrlib/util/effbot/org/http_manager.py

bzrlib/weave.py

bzrlib/weavefile.py

bzrlib/workingtree.py

contrib/bash/bzr

contrib/create_bzr_rollup.py

contrib/emacs

contrib/emacs/bzr-mode.el

contrib/fortune

contrib/pwclient.full

contrib/pwk

contrib/upload-bzr.dev

doc/revfile-annotation.txt

doc/split-join-files.txt

notes/inventory-v2-sample.xml

notes/inventory-v2.rnc

notes/revfile.txt

notes/schemas.xml

patches

patches/annotate3.patch

patches/annotate4.patch

patches/cache-remote-revisions.diff

patches/find-touching-from-seq.diff

patches/meta-data-in-inventory.patch

patches/ndiff.patch

patches/pending-merge.patch

patches/plugins-no-plugins.patch

patches/progress.diff

patches/symlink-support.patch

testsweet.py

tools

tools/convertfile.py

tools/convertinv.py

tools/history2revfiles.py

tools/history2weaves.py

tools/http_client.py

tools/testweave.py

tools/weavebench.py

tools/weavemerge.sh

tutorial.txt

files renamed:
contrib/bash/bzr.simple => contrib/bash/bzr

bzrlib/util/elementtree/ => elementtree/

bzrlib/util/urlgrabber/ => urlgrabber/

files modified:
.bzrignore

NEWS

README

TODO

build-api

bzrlib/__init__.py

bzrlib/add.py

bzrlib/branch.py

bzrlib/check.py

bzrlib/commands.py

bzrlib/diff.py

bzrlib/errors.py

bzrlib/help.py

bzrlib/info.py

bzrlib/inventory.py

bzrlib/mdiff.py

bzrlib/newinventory.py

bzrlib/osutils.py

bzrlib/remotebranch.py

bzrlib/revfile.py

bzrlib/revision.py

bzrlib/status.py

bzrlib/store.py

bzrlib/textui.py

bzrlib/trace.py

bzrlib/tree.py

bzrlib/xml.py

contrib/add-bzr-to-baz

contrib/zsh/_bzr

doc/formats.txt

doc/index.txt

doc/revfile.txt

doc/tagging.txt

doc/todo-from-arch.txt

setup.py

testbzr

urlgrabber/keepalive.py

Show diffs side-by-side

added added

removed removed

doc/revfile.txt

The unit for compressed storage in bzr is a *revfile*, whose design

was suggested by Matt Mackall.

This document describes version 1 of the file, and has some notes on

what might be done in version 2.

Requirements

============

* Storage of files of at least a few hundred MB.

* Lossless in useful ways: we can extract a series of texts and write

them back out without losing any information.

Design

======

the data file is much longer and only the relevant bits of it,

identified by the index file, need to be read.

This design is similar to that of Netscape `mail summary files`_, in

that there is a small index which can always be read into memory and

that quickly identifies where to look in the main file. They differ

in many other ways though, most particularly that the index is not

just a cache but holds precious data in its own right.

.. _`mail summary files`: http://www.jwz.org/doc/mailsum.html

In previous versions, the index file identified texts by their

SHA-1 digest. This was unsatisfying for two reasons. Firstly it

assumes that SHA-1 will not collide, which is not an assumption we

wish to make in long-lived files. Secondly for annotations we need

to be able to map from file versions back to a revision.

Texts are identified by the name of the revfile and a UUID

corresponding to the first revision in which they were first

introduced. This means that given a text we can identify which

revision it belongs to, and annotations can use the index within the

revfile to identify where a region was first introduced.

We cannot identify texts by the integer revision number, because

that would limit us to only referring to a file in a particular

branch.

I'd like to just use the revision-id, but those are variable-length

strings, and I'd like the revfile index to be fixed-length and

relatively short. UUIDs can be encoded in binary as only 16 bytes.

Perhaps we should just use UUIDs for revisions and be done?

This is meant to scale to hold 100,000 revisions of a single file, by

which time the index file will be ~4.8MB and a bit big to read

102

Some of the reserved fields could be used to implement a (semi?)

103

balanced tree indexed by SHA1 so we can much more efficiently find the

104

index associated with a particular hash. For 100,000 revs we would be

able to find it in about 17 random reads, which is not too bad. On

the other hand that would compromise the append-only indexing, and

100

100,000 revs is a fairly extreme case.

105

able to find it in about 17 random reads, which is not too bad.

101

106

102

107

This performs pretty well except when trying to calculate deltas of

103

108

really large files. For that the main thing would be to plug in

106

111

though perhaps that's too perverse?

107

112

108

113

109

Identifying texts

110

-----------------

111

112

In the current version, texts are identified by their SHA-1.

113

114

115

116

Skip-deltas

120

121

too many deltas to reproduce a particular file.

121

122

123

124

Annotations

125

-----------

126

127

Annotations indicate which revision of a file first inserted a line

128

(or region of bytes).

129

130

Given a string, we can write annotations on it like so: a sequence of

131

*(index, length)* pairs, giving the *index* of the revision which

132

introduced the next run of *length* bytes. The sum of the lengths

133

must equal the length of the string. For text files the regions will

134

typically fall on line breaks. This can be transformed in memory to

135

other structures, such as a list of *(index, content)* pairs.

136

137

When a line was inserted from a merge revision then the annotation for

138

that line should still be the source in the merged branch, rather than

139

just being the revision in which the merge took place.

140

141

They can cheaply be calculated when inserting a new text, but are

142

expensive to calculate after the fact because that requires searching

143

back through all previous text and all texts which were merged in. It

144

therefore seems sensible to calculate them once and store them.

145

146

To do this we need two operators which update an existing annotated

147

file:

148

149

A. Given an annotated file and a working text, update the annotation to

150

mark regions inserted in the working file as new in this revision.

151

152

B. Given two annotated files, merge them to produce an annotated

153

result. When there are conflicts, both texts should be included

154

and annotated.

155

156

These may be repeated: after a merge there may be another merge, or

157

there may be manual fixups or conflict resolutions.

158

159

So what we require is given a diff or a diff3 between two files, map

160

the regions of bytes changed into corresponding updates to the origin

161

annotations.

162

163

Annotations can also be delta-compressed; we only need to add new

164

annotation data when there is a text insertion.

165

166

(It is possible in a merge to have a change of annotation when

167

there is no text change, though this seems unlikely. This can

168

still be represented as a "pointless" delta, plus an update to the

169

annotations.)

170

171

172

123

173

Tools

124

174

-----

125

175

128

178

129

179

130

180

181

Format

182

======

183

184

Index file

185

----------

186

187

The index file is a series of fixed-length records::

188

189

byte[16] UUID of revision

190

byte[20] SHA-1 of expanded text (as binary, not hex)

191

uint32 flags: 1=zlib compressed

192

uint32 sequence number this is based on, or -1 for full text

193

uint32 offset in text file of start

194

uint32 length of compressed delta in text file

195

uint32[3] reserved

196

197

Total 64 bytes.

198

199

The header is also 64 bytes, for tidyness and easy calculation. For

200

this format the header must be ``bzr revfile v2\n`` padded with

201

``\xff`` to 64 bytes.

202

203

The first record after the header is index 0. A record's base index

204

must be less than its own index.

205

206

The SHA-1 is redundant with the inventory but stored just as a check

207

on the compression methods and so that the file can be validated

208

without reference to any other information.

209

210

Each byte in the text file should be included by at most one delta.

211

212

213

Deltas

214

------

215

216

Deltas to the text are stored as a series of variable-length records::

217

218

uint32 idx

219

uint32 m

220

uint32 n

221

uint32 l

222

byte[l] new

223

224

This describes a change originally introduced in the revision

225

described by *idx* in the index.

226

227

This indicates that the region [m:n] of the input file should be

228

replaced by the text *new*. If m==n this is a pure insertion of l

229

bytes. If l==0 this is a pure deletion of (n-m) bytes.

230

231

131

232

132

233

Open issues

133

234

===========

145

246

be fixed by creating the fixed repository as a separate branch, into

146

247

which only the preserved revisions are exported.

147

248

249

* Should annotations also indicate where text was deleted?

250

251

* This design calls for only one annotation per line, which seems

252

standard. However, this is lacking in at least two cases:

253

254

- Lines which originate in the same way in more than one revision,

255

through being independently introduced. In this case we would

256

apparently have to make an arbitrary choice; I suppose branches

257

could prefer to assume lines originated in their own history.

258

259

- It might be useful to directly indicate which mergers included

260

which lines. We do have that information in the revision history

261

though, so there seems no need to store it for every line.

262

148

263

* Should we also store full-texts as a transitional step?

264

265

* Storing the annotations with the text is reasonably simple and

266

compact, but means that we always need to process the annotation

267

structure even when we only want the text. In particular it means

268

that full-texts cannot just simply be copied out but rather composed

269

from chunks. That seems inefficient since it is probably common to

270

only want the text.

271

Older »