1185.1.29
by Robert Collins
merge merge tweaks from aaron, which includes latest .dev |
1 |
******** |
2 |
Revfiles |
|
3 |
******** |
|
4 |
||
5 |
The unit for compressed storage in bzr is a *revfile*, whose design |
|
6 |
was suggested by Matt Mackall. |
|
7 |
||
8 |
This document describes version 1 of the file, and has some notes on |
|
9 |
what might be done in version 2. |
|
10 |
||
11 |
||
12 |
Requirements |
|
13 |
============ |
|
14 |
||
15 |
Compressed storage is a tradeoff between several goals: |
|
16 |
||
17 |
* Reasonably compact storage of long histories. |
|
18 |
||
19 |
* Robustness and simplicity. |
|
20 |
||
21 |
* Fast extraction of versions and addition of new versions (preferably |
|
22 |
without rewriting the whole file, or reading the whole history.) |
|
23 |
||
24 |
* Fast and precise annotations. |
|
25 |
||
26 |
* Storage of files of at least a few hundred MB. |
|
27 |
||
28 |
* Lossless in useful ways: we can extract a series of texts and write |
|
29 |
them back out without losing any information. |
|
30 |
||
31 |
||
32 |
Design |
|
33 |
====== |
|
34 |
||
35 |
revfiles store the history of a single logical file, which is |
|
36 |
identified in bzr by its file-id. In this sense they are similar to |
|
37 |
an RCS or CVS ``,v`` file or an SCCS sfile. |
|
38 |
||
39 |
Each state of the file is called a *text*. |
|
40 |
||
41 |
Renaming, adding and deleting this file is handled at a higher level |
|
42 |
by the inventory system, and is outside the scope of the revfile. The |
|
43 |
revfile name is typically based on the file id which is itself |
|
44 |
typically based on the name the file had when it was first added. But |
|
45 |
this is purely cosmetic. |
|
46 |
||
47 |
For example a file now called ``frob.c`` may have the id |
|
48 |
``frobber.c-12873`` because it was originally called |
|
49 |
``frobber.c``. Its texts are kept in the revfile |
|
50 |
``.bzr/revfiles/frobber.c-12873.revs``. |
|
51 |
||
52 |
When the file is deleted from the inventory the revfile does not |
|
53 |
change. It's just not used in reproducing trees from that point |
|
54 |
onwards. |
|
55 |
||
56 |
The revfile does not record the date when the text was added, a commit |
|
57 |
message, properties, or any other metadata. That is handled in the |
|
58 |
higher-level revision history. |
|
59 |
||
60 |
Inventories and other metadata files that vary from one version to the |
|
61 |
next can themselves be stored in revfiles. |
|
62 |
||
63 |
revfiles store files as simple byte streams, with no consideration of |
|
64 |
translating character sets, line endings, or keywords. Those are also |
|
65 |
handled at a higher level. However, the revfile may make use of |
|
66 |
knowledge that a file is line-based in generating a diff. |
|
67 |
||
68 |
(The Python builtin difflib is too slow when generating a purely |
|
69 |
byte-by-byte delta so we always make a line-by-line diff; when this |
|
70 |
is fixed it may be feasible to use line-by-line diffs for all |
|
71 |
files.) |
|
72 |
||
73 |
Files whose text does not change from one revision to the next are |
|
74 |
stored as just a single text in the revfile. This can happen even if |
|
75 |
the file was renamed or other properties were changed in the |
|
76 |
inventory. |
|
77 |
||
78 |
The revfile is held on disk as two files: an *index* and a *data* |
|
79 |
file. The index file is short and always read completely into memory; |
|
80 |
the data file is much longer and only the relevant bits of it, |
|
81 |
identified by the index file, need to be read. |
|
82 |
||
83 |
This design is similar to that of Netscape `mail summary files`_, in |
|
84 |
that there is a small index which can always be read into memory and |
|
85 |
that quickly identifies where to look in the main file. They differ |
|
86 |
in many other ways though, most particularly that the index is not |
|
87 |
just a cache but holds precious data in its own right. |
|
88 |
||
89 |
.. _`mail summary files`: http://www.jwz.org/doc/mailsum.html |
|
90 |
||
91 |
This is meant to scale to hold 100,000 revisions of a single file, by |
|
92 |
which time the index file will be ~4.8MB and a bit big to read |
|
93 |
sequentially. |
|
94 |
||
95 |
Some of the reserved fields could be used to implement a (semi?) |
|
96 |
balanced tree indexed by SHA1 so we can much more efficiently find the |
|
97 |
index associated with a particular hash. For 100,000 revs we would be |
|
98 |
able to find it in about 17 random reads, which is not too bad. On |
|
99 |
the other hand that would compromise the append-only indexing, and |
|
100 |
100,000 revs is a fairly extreme case. |
|
101 |
||
102 |
This performs pretty well except when trying to calculate deltas of |
|
103 |
really large files. For that the main thing would be to plug in |
|
104 |
something faster than difflib, which is after all pure Python. |
|
105 |
Another approach is to just store the gzipped full text of big files, |
|
106 |
though perhaps that's too perverse? |
|
107 |
||
108 |
||
109 |
Identifying texts |
|
110 |
----------------- |
|
111 |
||
112 |
In the current version, texts are identified by their SHA-1. |
|
113 |
||
114 |
||
115 |
Skip-deltas |
|
116 |
----------- |
|
117 |
||
118 |
Because the basis of a delta does not need to be the text's logical |
|
119 |
predecessor, we can adjust the deltas to avoid ever needing to apply |
|
120 |
too many deltas to reproduce a particular file. |
|
121 |
||
122 |
||
123 |
Tools |
|
124 |
----- |
|
125 |
||
126 |
The revfile module can be invoked as a program to give low-level |
|
127 |
access for data recovery, debugging, etc. |
|
128 |
||
129 |
||
130 |
||
131 |
||
132 |
Open issues |
|
133 |
=========== |
|
134 |
||
135 |
* revfiles use unsigned 32-bit integers both in diffs and the index. |
|
136 |
This should be more than enough for any reasonable source file but |
|
137 |
perhaps not enough for large binaries that are frequently committed. |
|
138 |
||
139 |
Perhaps for those files there should be an option to continue to use |
|
140 |
the text-store. There is unlikely to be any benefit in holding |
|
141 |
deltas between them, and deltas will anyhow be hard to calculate. |
|
142 |
||
143 |
* The append-only design does not allow for destroying committed data, |
|
144 |
as when confidential information is accidentally added. That could |
|
145 |
be fixed by creating the fixed repository as a separate branch, into |
|
146 |
which only the preserved revisions are exported. |
|
147 |
||
148 |
* Should we also store full-texts as a transitional step? |