321
by Martin Pool
doc: revfile storage and related things |
1 |
******** |
2 |
Revfiles |
|
3 |
******** |
|
4 |
||
5 |
The unit for compressed storage in bzr is a *revfile*, whose design |
|
6 |
was suggested by Matt Mackall. |
|
7 |
||
8 |
||
9 |
Requirements |
|
10 |
============ |
|
11 |
||
12 |
Compressed storage is a tradeoff between several goals: |
|
13 |
||
14 |
* Reasonably compact storage of long histories. |
|
15 |
||
16 |
* Robustness and simplicity. |
|
17 |
||
18 |
* Fast extraction of versions and addition of new versions (preferably |
|
19 |
without rewriting the whole file, or reading the whole history.) |
|
20 |
||
21 |
* Fast and precise annotations. |
|
22 |
||
23 |
* Storage of files of at least a few hundred MB. |
|
24 |
||
25 |
||
26 |
Design |
|
27 |
====== |
|
28 |
||
29 |
revfiles store the history of a single logical file, which is |
|
30 |
identified in bzr by its file-id. In this sense they are similar to |
|
31 |
an RCS or CVS ``,v`` file or an SCCS sfile. |
|
32 |
||
33 |
Each state of the file is called a *text*. |
|
34 |
||
35 |
Renaming, adding and deleting this file is handled at a higher level |
|
36 |
by the inventory system, and is outside the scope of the revfile. The |
|
37 |
revfile name is typically based on the file id which is itself |
|
38 |
typically based on the name the file had when it was first added. But |
|
39 |
this is purely cosmetic. |
|
40 |
||
41 |
For example a file now called ``frob.c`` may have the id |
|
42 |
``frobber.c-12873`` because it was originally called |
|
43 |
``frobber.c``. Its texts are kept in the revfile |
|
44 |
``.bzr/revfiles/frobber.c-12873.revs``. |
|
45 |
||
46 |
When the file is deleted from the inventory the revfile does not |
|
47 |
change. It's just not used in reproducing trees from that point |
|
48 |
onwards. |
|
49 |
||
50 |
The revfile does not record the date when the text was added, a commit |
|
51 |
message, properties, or any other metadata. That is handled in the |
|
52 |
higher-level revision history. |
|
53 |
||
54 |
Inventories and other metadata files that vary from one version to the |
|
55 |
next can themselves be stored in revfiles. |
|
56 |
||
57 |
revfiles store files as simple byte streams, with no consideration of |
|
58 |
translating character sets, line endings, or keywords. Those are also |
|
59 |
handled at a higher level. However, the revfile may make use of |
|
60 |
knowledge that a file is line-based in generating a diff. |
|
61 |
||
62 |
(The Python builtin difflib is too slow when generating a purely |
|
63 |
byte-by-byte delta so we always make a line-by-line diff; when this |
|
64 |
is fixed it may be feasible to use line-by-line diffs for all |
|
65 |
files.) |
|
66 |
||
67 |
Files whose text does not change from one revision to the next are |
|
68 |
stored as just a single text in the revfile. This can happen even if |
|
69 |
the file was renamed or other properties were changed in the |
|
325
by Martin Pool
- more revfile design notes |
70 |
inventory. |
71 |
||
72 |
The revfile is held on disk as two files: an *index* and a *data* |
|
73 |
file. The index file is short and always read completely into memory; |
|
74 |
the data file is much longer and only the relevant bits of it, |
|
75 |
identified by the index file, need to be read. |
|
76 |
||
77 |
In previous versions, the index file identified texts by their |
|
78 |
SHA-1 digest. This was unsatisfying for two reasons. Firstly it |
|
79 |
assumes that SHA-1 will not collide, which is not an assumption we |
|
80 |
wish to make in long-lived files. Secondly for annotations we need |
|
81 |
to be able to map from file versions back to a revision. |
|
82 |
||
83 |
Texts are identified by the name of the revfile and a UUID |
|
84 |
corresponding to the first revision in which they were first |
|
85 |
introduced. This means that given a text we can identify which |
|
86 |
revision it belongs to, and annotations can use the index within the |
|
87 |
revfile to identify where a region was first introduced. |
|
88 |
||
89 |
We cannot identify texts by the integer revision number, because |
|
90 |
that would limit us to only referring to a file in a particular |
|
91 |
branch. |
|
92 |
||
93 |
I'd like to just use the revision-id, but those are variable-length |
|
94 |
strings, and I'd like the revfile index to be fixed-length and |
|
95 |
relatively short. UUIDs can be encoded in binary as only 16 bytes. |
|
96 |
Perhaps we should just use UUIDs for revisions and be done? |
|
97 |
||
98 |
This is meant to scale to hold 100,000 revisions of a single file, by |
|
99 |
which time the index file will be ~4.8MB and a bit big to read |
|
100 |
sequentially. |
|
101 |
||
102 |
Some of the reserved fields could be used to implement a (semi?) |
|
103 |
balanced tree indexed by SHA1 so we can much more efficiently find the |
|
104 |
index associated with a particular hash. For 100,000 revs we would be |
|
105 |
able to find it in about 17 random reads, which is not too bad. |
|
106 |
||
107 |
This performs pretty well except when trying to calculate deltas of |
|
108 |
really large files. For that the main thing would be to plug in |
|
109 |
something faster than difflib, which is after all pure Python. |
|
110 |
Another approach is to just store the gzipped full text of big files, |
|
111 |
though perhaps that's too perverse? |
|
112 |
||
113 |
||
321
by Martin Pool
doc: revfile storage and related things |
114 |
|
115 |
||
116 |
Skip-deltas |
|
117 |
----------- |
|
118 |
||
119 |
Because the basis of a delta does not need to be the text's logical |
|
325
by Martin Pool
- more revfile design notes |
120 |
predecessor, we can adjust the deltas to avoid ever needing to apply |
121 |
too many deltas to reproduce a particular file. |
|
321
by Martin Pool
doc: revfile storage and related things |
122 |
|
123 |
||
124 |
Annotations |
|
125 |
----------- |
|
126 |
||
325
by Martin Pool
- more revfile design notes |
127 |
Annotations indicate which revision of a file first inserted a line |
128 |
(or region of bytes). |
|
129 |
||
130 |
Given a string, we can write annotations on it like so: a sequence of |
|
131 |
*(index, length)* pairs, giving the *index* of the revision which |
|
132 |
introduced the next run of *length* bytes. The sum of the lengths |
|
133 |
must equal the length of the string. For text files the regions will |
|
134 |
typically fall on line breaks. This can be transformed in memory to |
|
135 |
other structures, such as a list of *(index, content)* pairs. |
|
136 |
||
137 |
When a line was inserted from a merge revision then the annotation for |
|
138 |
that line should still be the source in the merged branch, rather than |
|
139 |
just being the revision in which the merge took place. |
|
140 |
||
141 |
They can cheaply be calculated when inserting a new text, but are |
|
142 |
expensive to calculate after the fact because that requires searching |
|
143 |
back through all previous text and all texts which were merged in. It |
|
144 |
therefore seems sensible to calculate them once and store them. |
|
145 |
||
146 |
To do this we need two operators which update an existing annotated |
|
147 |
file: |
|
148 |
||
149 |
A. Given an annotated file and a working text, update the annotation to |
|
150 |
mark regions inserted in the working file as new in this revision. |
|
151 |
||
152 |
B. Given two annotated files, merge them to produce an annotated |
|
153 |
result. When there are conflicts, both texts should be included |
|
154 |
and annotated. |
|
155 |
||
156 |
These may be repeated: after a merge there may be another merge, or |
|
157 |
there may be manual fixups or conflict resolutions. |
|
158 |
||
159 |
So what we require is given a diff or a diff3 between two files, map |
|
160 |
the regions of bytes changed into corresponding updates to the origin |
|
161 |
annotations. |
|
321
by Martin Pool
doc: revfile storage and related things |
162 |
|
328
by Martin Pool
- more documentation of revfile+annotation |
163 |
Annotations can also be delta-compressed; we only need to add new |
164 |
annotation data when there is a text insertion. |
|
165 |
||
166 |
(It is possible in a merge to have a change of annotation when |
|
167 |
there is no text change, though this seems unlikely. This can |
|
168 |
still be represented as a "pointless" delta, plus an update to the |
|
169 |
annotations.) |
|
170 |
||
171 |
||
172 |
||
173 |
Tools |
|
174 |
----- |
|
175 |
||
176 |
The revfile module can be invoked as a program to give low-level |
|
177 |
access for data recovery, debugging, etc. |
|
178 |
||
179 |
||
180 |
||
181 |
Format |
|
182 |
====== |
|
183 |
||
184 |
Index file |
|
185 |
---------- |
|
186 |
||
187 |
The index file is a series of fixed-length records:: |
|
188 |
||
189 |
byte[16] UUID of revision |
|
190 |
byte[20] SHA-1 of expanded text (as binary, not hex) |
|
191 |
uint32 flags: 1=zlib compressed |
|
192 |
uint32 sequence number this is based on, or -1 for full text |
|
193 |
uint32 offset in text file of start |
|
194 |
uint32 length of compressed delta in text file |
|
195 |
uint32[3] reserved |
|
196 |
||
197 |
Total 64 bytes. |
|
198 |
||
199 |
The header is also 64 bytes, for tidyness and easy calculation. For |
|
200 |
this format the header must be ``bzr revfile v2\n`` padded with |
|
201 |
``\xff`` to 64 bytes. |
|
202 |
||
203 |
The first record after the header is index 0. A record's base index |
|
204 |
must be less than its own index. |
|
205 |
||
206 |
The SHA-1 is redundant with the inventory but stored just as a check |
|
207 |
on the compression methods and so that the file can be validated |
|
208 |
without reference to any other information. |
|
209 |
||
210 |
Each byte in the text file should be included by at most one delta. |
|
211 |
||
212 |
||
213 |
Deltas |
|
214 |
------ |
|
215 |
||
216 |
Deltas to the text are stored as a series of variable-length records:: |
|
217 |
||
218 |
uint32 idx |
|
219 |
uint32 m |
|
220 |
uint32 n |
|
221 |
uint32 l |
|
222 |
byte[l] new |
|
223 |
||
224 |
This describes a change originally introduced in the revision |
|
225 |
described by *idx* in the index. |
|
226 |
||
227 |
This indicates that the region [m:n] of the input file should be |
|
228 |
replaced by the text *new*. If m==n this is a pure insertion of l |
|
229 |
bytes. If l==0 this is a pure deletion of (n-m) bytes. |
|
230 |
||
231 |
||
321
by Martin Pool
doc: revfile storage and related things |
232 |
|
233 |
Open issues |
|
234 |
=========== |
|
235 |
||
236 |
* revfiles use unsigned 32-bit integers both in diffs and the index. |
|
237 |
This should be more than enough for any reasonable source file but |
|
238 |
perhaps not enough for large binaries that are frequently committed. |
|
239 |
||
240 |
Perhaps for those files there should be an option to continue to use |
|
241 |
the text-store. There is unlikely to be any benefit in holding |
|
242 |
deltas between them, and deltas will anyhow be hard to calculate. |
|
243 |
||
244 |
* The append-only design does not allow for destroying committed data, |
|
245 |
as when confidential information is accidentally added. That could |
|
246 |
be fixed by creating the fixed repository as a separate branch, into |
|
247 |
which only the preserved revisions are exported. |
|
325
by Martin Pool
- more revfile design notes |
248 |
|
249 |
* Should annotations also indicate where text was deleted? |
|
250 |
||
251 |
* This design calls for only one annotation per line, which seems |
|
252 |
standard. However, this is lacking in at least two cases: |
|
253 |
||
254 |
- Lines which originate in the same way in more than one revision, |
|
255 |
through being independently introduced. In this case we would |
|
256 |
apparently have to make an arbitrary choice; I suppose branches |
|
257 |
could prefer to assume lines originated in their own history. |
|
258 |
||
259 |
- It might be useful to directly indicate which mergers included |
|
260 |
which lines. We do have that information in the revision history |
|
261 |
though, so there seems no need to store it for every line. |
|
328
by Martin Pool
- more documentation of revfile+annotation |
262 |
|
263 |
* Should we also store full-texts as a transitional step? |
|
264 |
||
265 |
* Storing the annotations with the text is reasonably simple and |
|
266 |
compact, but means that we always need to process the annotation |
|
267 |
structure even when we only want the text. In particular it means |
|
268 |
that full-texts cannot just simply be copied out but rather composed |
|
269 |
from chunks. That seems inefficient since it is probably common to |
|
270 |
only want the text. |
|
329
by Martin Pool
- refactor command functions into command classes |
271 |