391
by Martin Pool
- split out notes on storing annotations in revfiles |
1 |
============================== |
2 |
Extension to store annotations |
|
3 |
============================== |
|
4 |
||
5 |
We might extend the revfile format in a future version to also store |
|
6 |
annotations. *This is not implemented yet.* |
|
7 |
||
8 |
In previous versions, the index file identified texts by their |
|
9 |
SHA-1 digest. This was unsatisfying for two reasons. Firstly it |
|
10 |
assumes that SHA-1 will not collide, which is not an assumption we |
|
11 |
wish to make in long-lived files. Secondly for annotations we need |
|
12 |
to be able to map from file versions back to a revision. |
|
13 |
||
14 |
Texts are identified by the name of the revfile and a UUID |
|
15 |
corresponding to the first revision in which they were first |
|
16 |
introduced. This means that given a text we can identify which |
|
17 |
revision it belongs to, and annotations can use the index within the |
|
18 |
revfile to identify where a region was first introduced. |
|
19 |
||
20 |
We cannot identify texts by the integer revision number, because |
|
21 |
that would limit us to only referring to a file in a particular |
|
22 |
branch. |
|
23 |
||
24 |
I'd like to just use the revision-id, but those are variable-length |
|
25 |
strings, and I'd like the revfile index to be fixed-length and |
|
26 |
relatively short. UUIDs can be encoded in binary as only 16 bytes. |
|
27 |
Perhaps we should just use UUIDs for revisions and be done? |
|
28 |
||
29 |
||
30 |
||
31 |
Annotations |
|
32 |
----------- |
|
33 |
||
34 |
Annotations indicate which revision of a file first inserted a line |
|
35 |
(or region of bytes). |
|
36 |
||
37 |
Given a string, we can write annotations on it like so: a sequence of |
|
38 |
*(index, length)* pairs, giving the *index* of the revision which |
|
39 |
introduced the next run of *length* bytes. The sum of the lengths |
|
40 |
must equal the length of the string. For text files the regions will |
|
41 |
typically fall on line breaks. This can be transformed in memory to |
|
42 |
other structures, such as a list of *(index, content)* pairs. |
|
43 |
||
44 |
When a line was inserted from a merge revision then the annotation for |
|
45 |
that line should still be the source in the merged branch, rather than |
|
46 |
just being the revision in which the merge took place. |
|
47 |
||
48 |
They can cheaply be calculated when inserting a new text, but are |
|
49 |
expensive to calculate after the fact because that requires searching |
|
50 |
back through all previous text and all texts which were merged in. It |
|
51 |
therefore seems sensible to calculate them once and store them. |
|
52 |
||
53 |
To do this we need two operators which update an existing annotated |
|
54 |
file: |
|
55 |
||
56 |
A. Given an annotated file and a working text, update the annotation to |
|
57 |
mark regions inserted in the working file as new in this revision. |
|
58 |
||
59 |
B. Given two annotated files, merge them to produce an annotated |
|
60 |
result. When there are conflicts, both texts should be included |
|
61 |
and annotated. |
|
62 |
||
63 |
These may be repeated: after a merge there may be another merge, or |
|
64 |
there may be manual fixups or conflict resolutions. |
|
65 |
||
66 |
So what we require is given a diff or a diff3 between two files, map |
|
67 |
the regions of bytes changed into corresponding updates to the origin |
|
68 |
annotations. |
|
69 |
||
70 |
Annotations can also be delta-compressed; we only need to add new |
|
71 |
annotation data when there is a text insertion. |
|
72 |
||
73 |
(It is possible in a merge to have a change of annotation when |
|
74 |
there is no text change, though this seems unlikely. This can |
|
75 |
still be represented as a "pointless" delta, plus an update to the |
|
76 |
annotations.) |
|
77 |
||
78 |
Index file |
|
79 |
---------- |
|
80 |
||
81 |
In a proposed (not implemented) storage with annotations, the index |
|
82 |
file is a series of fixed-length records:: |
|
83 |
||
84 |
byte[16] UUID of revision |
|
85 |
byte[20] SHA-1 of expanded text (as binary, not hex) |
|
86 |
uint32 flags: 1=zlib compressed |
|
87 |
uint32 sequence number this is based on, or -1 for full text |
|
88 |
uint32 offset in text file of start |
|
89 |
uint32 length of compressed delta in text file |
|
90 |
uint32[3] reserved |
|
91 |
||
92 |
Total 64 bytes. |
|
93 |
||
94 |
The header is also 64 bytes, for tidyness and easy calculation. For |
|
95 |
this format the header must be ``bzr revfile v2\n`` padded with |
|
96 |
``\xff`` to 64 bytes. |
|
97 |
||
98 |
The first record after the header is index 0. A record's base index |
|
99 |
must be less than its own index. |
|
100 |
||
101 |
The SHA-1 is redundant with the inventory but stored just as a check |
|
102 |
on the compression methods and so that the file can be validated |
|
103 |
without reference to any other information. |
|
104 |
||
105 |
Each byte in the text file should be included by at most one delta. |
|
106 |
||
107 |
||
108 |
Deltas |
|
109 |
------ |
|
110 |
||
111 |
In a proposed (not implemented) storage with annotations, deltas to |
|
112 |
the text are stored as a series of variable-length records:: |
|
113 |
||
114 |
uint32 idx |
|
115 |
uint32 m |
|
116 |
uint32 n |
|
117 |
uint32 l |
|
118 |
byte[l] new |
|
119 |
||
120 |
This describes a change originally introduced in the revision |
|
121 |
described by *idx* in the index. |
|
122 |
||
123 |
This indicates that the region [m:n] of the input file should be |
|
124 |
replaced by the text *new*. If m==n this is a pure insertion of l |
|
125 |
bytes. If l==0 this is a pure deletion of (n-m) bytes. |
|
126 |
||
127 |
||
128 |
||
129 |
||
130 |
||
131 |
Open issues |
|
132 |
----------- |
|
133 |
||
134 |
||
135 |
* Storing the annotations with the text is reasonably simple and |
|
136 |
compact, but means that we always need to process the annotation |
|
137 |
structure even when we only want the text. In particular it means |
|
138 |
that full-texts cannot just simply be copied out but rather composed |
|
139 |
from chunks. That seems inefficient since it is probably common to |
|
140 |
only want the text. |
|
141 |
||
142 |
* Should annotations also indicate where text was deleted? |
|
143 |
||
144 |
* This design calls for only one annotation per line, which seems |
|
145 |
standard. However, this is lacking in at least two cases: |
|
146 |
||
147 |
- Lines which originate in the same way in more than one revision, |
|
148 |
through being independently introduced. In this case we would |
|
149 |
apparently have to make an arbitrary choice; I suppose branches |
|
150 |
could prefer to assume lines originated in their own history. |
|
151 |
||
152 |
- It might be useful to directly indicate which mergers included |
|
153 |
which lines. We do have that information in the revision history |
|
154 |
though, so there seems no need to store it for every line. |
|
155 |