131
Extension to store annotations
132
==============================
134
We might extend the revfile format in a future version to also store
135
annotations. *This is not implemented yet.*
137
In previous versions, the index file identified texts by their
138
SHA-1 digest. This was unsatisfying for two reasons. Firstly it
139
assumes that SHA-1 will not collide, which is not an assumption we
140
wish to make in long-lived files. Secondly for annotations we need
141
to be able to map from file versions back to a revision.
143
Texts are identified by the name of the revfile and a UUID
144
corresponding to the first revision in which they were first
145
introduced. This means that given a text we can identify which
146
revision it belongs to, and annotations can use the index within the
147
revfile to identify where a region was first introduced.
149
We cannot identify texts by the integer revision number, because
150
that would limit us to only referring to a file in a particular
153
I'd like to just use the revision-id, but those are variable-length
154
strings, and I'd like the revfile index to be fixed-length and
155
relatively short. UUIDs can be encoded in binary as only 16 bytes.
156
Perhaps we should just use UUIDs for revisions and be done?
163
Annotations indicate which revision of a file first inserted a line
164
(or region of bytes).
166
Given a string, we can write annotations on it like so: a sequence of
167
*(index, length)* pairs, giving the *index* of the revision which
168
introduced the next run of *length* bytes. The sum of the lengths
169
must equal the length of the string. For text files the regions will
170
typically fall on line breaks. This can be transformed in memory to
171
other structures, such as a list of *(index, content)* pairs.
173
When a line was inserted from a merge revision then the annotation for
174
that line should still be the source in the merged branch, rather than
175
just being the revision in which the merge took place.
177
They can cheaply be calculated when inserting a new text, but are
178
expensive to calculate after the fact because that requires searching
179
back through all previous text and all texts which were merged in. It
180
therefore seems sensible to calculate them once and store them.
182
To do this we need two operators which update an existing annotated
185
A. Given an annotated file and a working text, update the annotation to
186
mark regions inserted in the working file as new in this revision.
188
B. Given two annotated files, merge them to produce an annotated
189
result. When there are conflicts, both texts should be included
192
These may be repeated: after a merge there may be another merge, or
193
there may be manual fixups or conflict resolutions.
195
So what we require is given a diff or a diff3 between two files, map
196
the regions of bytes changed into corresponding updates to the origin
199
Annotations can also be delta-compressed; we only need to add new
200
annotation data when there is a text insertion.
202
(It is possible in a merge to have a change of annotation when
203
there is no text change, though this seems unlikely. This can
204
still be represented as a "pointless" delta, plus an update to the
210
In a proposed (not implemented) storage with annotations, the index
211
file is a series of fixed-length records::
213
byte[16] UUID of revision
214
byte[20] SHA-1 of expanded text (as binary, not hex)
215
uint32 flags: 1=zlib compressed
216
uint32 sequence number this is based on, or -1 for full text
217
uint32 offset in text file of start
218
uint32 length of compressed delta in text file
223
The header is also 64 bytes, for tidyness and easy calculation. For
224
this format the header must be ``bzr revfile v2\n`` padded with
225
``\xff`` to 64 bytes.
227
The first record after the header is index 0. A record's base index
228
must be less than its own index.
230
The SHA-1 is redundant with the inventory but stored just as a check
231
on the compression methods and so that the file can be validated
232
without reference to any other information.
234
Each byte in the text file should be included by at most one delta.
240
In a proposed (not implemented) storage with annotations, deltas to
241
the text are stored as a series of variable-length records::
249
This describes a change originally introduced in the revision
250
described by *idx* in the index.
252
This indicates that the region [m:n] of the input file should be
253
replaced by the text *new*. If m==n this is a pure insertion of l
254
bytes. If l==0 this is a pure deletion of (n-m) bytes.
273
145
be fixed by creating the fixed repository as a separate branch, into
274
146
which only the preserved revisions are exported.
276
* Should annotations also indicate where text was deleted?
278
* This design calls for only one annotation per line, which seems
279
standard. However, this is lacking in at least two cases:
281
- Lines which originate in the same way in more than one revision,
282
through being independently introduced. In this case we would
283
apparently have to make an arbitrary choice; I suppose branches
284
could prefer to assume lines originated in their own history.
286
- It might be useful to directly indicate which mergers included
287
which lines. We do have that information in the revision history
288
though, so there seems no need to store it for every line.
290
148
* Should we also store full-texts as a transitional step?
292
* Storing the annotations with the text is reasonably simple and
293
compact, but means that we always need to process the annotation
294
structure even when we only want the text. In particular it means
295
that full-texts cannot just simply be copied out but rather composed
296
from chunks. That seems inefficient since it is probably common to