5
5
The unit for compressed storage in bzr is a *revfile*, whose design
6
6
was suggested by Matt Mackall.
8
This document describes version 1 of the file, and has some notes on
9
what might be done in version 2.
74
80
the data file is much longer and only the relevant bits of it,
75
81
identified by the index file, need to be read.
77
In previous versions, the index file identified texts by their
78
SHA-1 digest. This was unsatisfying for two reasons. Firstly it
79
assumes that SHA-1 will not collide, which is not an assumption we
80
wish to make in long-lived files. Secondly for annotations we need
81
to be able to map from file versions back to a revision.
83
Texts are identified by the name of the revfile and a UUID
84
corresponding to the first revision in which they were first
85
introduced. This means that given a text we can identify which
86
revision it belongs to, and annotations can use the index within the
87
revfile to identify where a region was first introduced.
89
We cannot identify texts by the integer revision number, because
90
that would limit us to only referring to a file in a particular
93
I'd like to just use the revision-id, but those are variable-length
94
strings, and I'd like the revfile index to be fixed-length and
95
relatively short. UUIDs can be encoded in binary as only 16 bytes.
96
Perhaps we should just use UUIDs for revisions and be done?
83
This design is similar to that of Netscape `mail summary files`_, in
84
that there is a small index which can always be read into memory and
85
that quickly identifies where to look in the main file. They differ
86
in many other ways though, most particularly that the index is not
87
just a cache but holds precious data in its own right.
89
.. _`mail summary files`: http://www.jwz.org/doc/mailsum.html
98
91
This is meant to scale to hold 100,000 revisions of a single file, by
99
92
which time the index file will be ~4.8MB and a bit big to read
102
95
Some of the reserved fields could be used to implement a (semi?)
103
96
balanced tree indexed by SHA1 so we can much more efficiently find the
104
97
index associated with a particular hash. For 100,000 revs we would be
105
able to find it in about 17 random reads, which is not too bad.
98
able to find it in about 17 random reads, which is not too bad. On
99
the other hand that would compromise the append-only indexing, and
100
100,000 revs is a fairly extreme case.
107
102
This performs pretty well except when trying to calculate deltas of
108
103
really large files. For that the main thing would be to plug in
121
120
too many deltas to reproduce a particular file.
126
The revfile module can be invoked as a program to give low-level
127
access for data recovery, debugging, etc.
131
Extension to store annotations
132
==============================
134
We might extend the revfile format in a future version to also store
135
annotations. *This is not implemented yet.*
137
In previous versions, the index file identified texts by their
138
SHA-1 digest. This was unsatisfying for two reasons. Firstly it
139
assumes that SHA-1 will not collide, which is not an assumption we
140
wish to make in long-lived files. Secondly for annotations we need
141
to be able to map from file versions back to a revision.
143
Texts are identified by the name of the revfile and a UUID
144
corresponding to the first revision in which they were first
145
introduced. This means that given a text we can identify which
146
revision it belongs to, and annotations can use the index within the
147
revfile to identify where a region was first introduced.
149
We cannot identify texts by the integer revision number, because
150
that would limit us to only referring to a file in a particular
153
I'd like to just use the revision-id, but those are variable-length
154
strings, and I'd like the revfile index to be fixed-length and
155
relatively short. UUIDs can be encoded in binary as only 16 bytes.
156
Perhaps we should just use UUIDs for revisions and be done?
168
204
still be represented as a "pointless" delta, plus an update to the
176
The revfile module can be invoked as a program to give low-level
177
access for data recovery, debugging, etc.
187
The index file is a series of fixed-length records::
210
In a proposed (not implemented) storage with annotations, the index
211
file is a series of fixed-length records::
189
213
byte[16] UUID of revision
190
214
byte[20] SHA-1 of expanded text (as binary, not hex)