~bzr-pqm/bzr/bzr.dev : revision 390

5

The unit for compressed storage in bzr is a *revfile*, whose design

6

was suggested by Matt Mackall.

7

8

This document describes version 1 of the file, and has some notes on

9

what might be done in version 2.

10

8

11

9

12

Requirements

10

13

============

22

25

23

26

* Storage of files of at least a few hundred MB.

24

27

28

* Lossless in useful ways: we can extract a series of texts and write

29

them back out without losing any information.

30

25

31

26

32

Design

27

33

======

74

80

the data file is much longer and only the relevant bits of it,

75

81

identified by the index file, need to be read.

76

82

77

In previous versions, the index file identified texts by their

78

SHA-1 digest. This was unsatisfying for two reasons. Firstly it

79

assumes that SHA-1 will not collide, which is not an assumption we

80

wish to make in long-lived files. Secondly for annotations we need

81

to be able to map from file versions back to a revision.

82

83

Texts are identified by the name of the revfile and a UUID

84

corresponding to the first revision in which they were first

85

introduced. This means that given a text we can identify which

86

revision it belongs to, and annotations can use the index within the

87

revfile to identify where a region was first introduced.

88

89

We cannot identify texts by the integer revision number, because

90

that would limit us to only referring to a file in a particular

91

branch.

92

93

I'd like to just use the revision-id, but those are variable-length

94

strings, and I'd like the revfile index to be fixed-length and

95

relatively short. UUIDs can be encoded in binary as only 16 bytes.

96

Perhaps we should just use UUIDs for revisions and be done?

83

This design is similar to that of Netscape `mail summary files`_, in

84

that there is a small index which can always be read into memory and

85

that quickly identifies where to look in the main file. They differ

86

in many other ways though, most particularly that the index is not

87

just a cache but holds precious data in its own right.

88

89

.. _`mail summary files`: http://www.jwz.org/doc/mailsum.html

97

90

98

91

This is meant to scale to hold 100,000 revisions of a single file, by

99

92

which time the index file will be ~4.8MB and a bit big to read

102

95

Some of the reserved fields could be used to implement a (semi?)

103

96

balanced tree indexed by SHA1 so we can much more efficiently find the

104

97

index associated with a particular hash. For 100,000 revs we would be

105

able to find it in about 17 random reads, which is not too bad.

98

able to find it in about 17 random reads, which is not too bad. On

99

the other hand that would compromise the append-only indexing, and

100

100,000 revs is a fairly extreme case.

106

101

107

102

This performs pretty well except when trying to calculate deltas of

108

103

really large files. For that the main thing would be to plug in

111

106

though perhaps that's too perverse?

112

107

113

108

109

Identifying texts

110

-----------------

111

112

In the current version, texts are identified by their SHA-1.

114

113

115

114

116

115

Skip-deltas

121

120

too many deltas to reproduce a particular file.

122

121

123

122

123

Tools

124

-----

125

126

The revfile module can be invoked as a program to give low-level

127

access for data recovery, debugging, etc.

128

129

130

131

Extension to store annotations

132

==============================

133

134

We might extend the revfile format in a future version to also store

135

annotations. *This is not implemented yet.*

136

137

In previous versions, the index file identified texts by their

138

SHA-1 digest. This was unsatisfying for two reasons. Firstly it

139

assumes that SHA-1 will not collide, which is not an assumption we

140

wish to make in long-lived files. Secondly for annotations we need

141

to be able to map from file versions back to a revision.

142

143

Texts are identified by the name of the revfile and a UUID

144

corresponding to the first revision in which they were first

145

introduced. This means that given a text we can identify which

146

revision it belongs to, and annotations can use the index within the

147

revfile to identify where a region was first introduced.

148

149

We cannot identify texts by the integer revision number, because

150

that would limit us to only referring to a file in a particular

151

branch.

152

153

I'd like to just use the revision-id, but those are variable-length

154

strings, and I'd like the revfile index to be fixed-length and

155

relatively short. UUIDs can be encoded in binary as only 16 bytes.

156

Perhaps we should just use UUIDs for revisions and be done?

157

158

159

124

160

Annotations

125

161

-----------

126

162

168

204

still be represented as a "pointless" delta, plus an update to the

169

205

annotations.)

170

206

171

172

173

Tools

174

-----

175

176

The revfile module can be invoked as a program to give low-level

177

access for data recovery, debugging, etc.

178

179

180

181

Format

182

======

183

184

207

Index file

185

208

----------

186

209

187

The index file is a series of fixed-length records::

210

In a proposed (not implemented) storage with annotations, the index

211

file is a series of fixed-length records::

188

212

189

213

byte[16] UUID of revision

190

214

byte[20] SHA-1 of expanded text (as binary, not hex)

213

237

Deltas

214

238

------

215

239

216

Deltas to the text are stored as a series of variable-length records::

240

In a proposed (not implemented) storage with annotations, deltas to

241

the text are stored as a series of variable-length records::

217

242

218

243

uint32 idx

219

244

uint32 m

230

255

231

256

232

257

258

259

233

260

Open issues

234

261

===========

235

262