5
5
The unit for compressed storage in bzr is a *revfile*, whose design
6
6
was suggested by Matt Mackall.
8
This document describes version 1 of the file, and has some notes on
9
what might be done in version 2.
80
74
the data file is much longer and only the relevant bits of it,
81
75
identified by the index file, need to be read.
83
This design is similar to that of Netscape `mail summary files`_, in
84
that there is a small index which can always be read into memory and
85
that quickly identifies where to look in the main file. They differ
86
in many other ways though, most particularly that the index is not
87
just a cache but holds precious data in its own right.
89
.. _`mail summary files`: http://www.jwz.org/doc/mailsum.html
77
In previous versions, the index file identified texts by their
78
SHA-1 digest. This was unsatisfying for two reasons. Firstly it
79
assumes that SHA-1 will not collide, which is not an assumption we
80
wish to make in long-lived files. Secondly for annotations we need
81
to be able to map from file versions back to a revision.
83
Texts are identified by the name of the revfile and a UUID
84
corresponding to the first revision in which they were first
85
introduced. This means that given a text we can identify which
86
revision it belongs to, and annotations can use the index within the
87
revfile to identify where a region was first introduced.
89
We cannot identify texts by the integer revision number, because
90
that would limit us to only referring to a file in a particular
93
I'd like to just use the revision-id, but those are variable-length
94
strings, and I'd like the revfile index to be fixed-length and
95
relatively short. UUIDs can be encoded in binary as only 16 bytes.
96
Perhaps we should just use UUIDs for revisions and be done?
91
98
This is meant to scale to hold 100,000 revisions of a single file, by
92
99
which time the index file will be ~4.8MB and a bit big to read
95
102
Some of the reserved fields could be used to implement a (semi?)
96
103
balanced tree indexed by SHA1 so we can much more efficiently find the
97
104
index associated with a particular hash. For 100,000 revs we would be
98
able to find it in about 17 random reads, which is not too bad. On
99
the other hand that would compromise the append-only indexing, and
100
100,000 revs is a fairly extreme case.
105
able to find it in about 17 random reads, which is not too bad.
102
107
This performs pretty well except when trying to calculate deltas of
103
108
really large files. For that the main thing would be to plug in
120
121
too many deltas to reproduce a particular file.
127
Annotations indicate which revision of a file first inserted a line
128
(or region of bytes).
130
Given a string, we can write annotations on it like so: a sequence of
131
*(index, length)* pairs, giving the *index* of the revision which
132
introduced the next run of *length* bytes. The sum of the lengths
133
must equal the length of the string. For text files the regions will
134
typically fall on line breaks. This can be transformed in memory to
135
other structures, such as a list of *(index, content)* pairs.
137
When a line was inserted from a merge revision then the annotation for
138
that line should still be the source in the merged branch, rather than
139
just being the revision in which the merge took place.
141
They can cheaply be calculated when inserting a new text, but are
142
expensive to calculate after the fact because that requires searching
143
back through all previous text and all texts which were merged in. It
144
therefore seems sensible to calculate them once and store them.
146
To do this we need two operators which update an existing annotated
149
A. Given an annotated file and a working text, update the annotation to
150
mark regions inserted in the working file as new in this revision.
152
B. Given two annotated files, merge them to produce an annotated
153
result. When there are conflicts, both texts should be included
156
These may be repeated: after a merge there may be another merge, or
157
there may be manual fixups or conflict resolutions.
159
So what we require is given a diff or a diff3 between two files, map
160
the regions of bytes changed into corresponding updates to the origin
163
Annotations can also be delta-compressed; we only need to add new
164
annotation data when there is a text insertion.
166
(It is possible in a merge to have a change of annotation when
167
there is no text change, though this seems unlikely. This can
168
still be represented as a "pointless" delta, plus an update to the
187
The index file is a series of fixed-length records::
189
byte[16] UUID of revision
190
byte[20] SHA-1 of expanded text (as binary, not hex)
191
uint32 flags: 1=zlib compressed
192
uint32 sequence number this is based on, or -1 for full text
193
uint32 offset in text file of start
194
uint32 length of compressed delta in text file
199
The header is also 64 bytes, for tidyness and easy calculation. For
200
this format the header must be ``bzr revfile v2\n`` padded with
201
``\xff`` to 64 bytes.
203
The first record after the header is index 0. A record's base index
204
must be less than its own index.
206
The SHA-1 is redundant with the inventory but stored just as a check
207
on the compression methods and so that the file can be validated
208
without reference to any other information.
210
Each byte in the text file should be included by at most one delta.
216
Deltas to the text are stored as a series of variable-length records::
224
This describes a change originally introduced in the revision
225
described by *idx* in the index.
227
This indicates that the region [m:n] of the input file should be
228
replaced by the text *new*. If m==n this is a pure insertion of l
229
bytes. If l==0 this is a pure deletion of (n-m) bytes.
145
246
be fixed by creating the fixed repository as a separate branch, into
146
247
which only the preserved revisions are exported.
249
* Should annotations also indicate where text was deleted?
251
* This design calls for only one annotation per line, which seems
252
standard. However, this is lacking in at least two cases:
254
- Lines which originate in the same way in more than one revision,
255
through being independently introduced. In this case we would
256
apparently have to make an arbitrary choice; I suppose branches
257
could prefer to assume lines originated in their own history.
259
- It might be useful to directly indicate which mergers included
260
which lines. We do have that information in the revision history
261
though, so there seems no need to store it for every line.
148
263
* Should we also store full-texts as a transitional step?
265
* Storing the annotations with the text is reasonably simple and
266
compact, but means that we always need to process the annotation
267
structure even when we only want the text. In particular it means
268
that full-texts cannot just simply be copied out but rather composed
269
from chunks. That seems inefficient since it is probably common to