~bzr-pqm/bzr/bzr.dev : contents of doc/revfile.txt at revision 363

~bzr-pqm/bzr/bzr.dev : (revision 363)

********
Revfiles
********

The unit for compressed storage in bzr is a *revfile*, whose design
was suggested by Matt Mackall.


Requirements
============

Compressed storage is a tradeoff between several goals:

* Reasonably compact storage of long histories.

* Robustness and simplicity.

* Fast extraction of versions and addition of new versions (preferably
  without rewriting the whole file, or reading the whole history.)

* Fast and precise annotations.

* Storage of files of at least a few hundred MB.


Design
======

revfiles store the history of a single logical file, which is
identified in bzr by its file-id.  In this sense they are similar to
an RCS or CVS ``,v`` file or an SCCS sfile.

Each state of the file is called a *text*. 

Renaming, adding and deleting this file is handled at a higher level
by the inventory system, and is outside the scope of the revfile.  The
revfile name is typically based on the file id which is itself
typically based on the name the file had when it was first added.  But
this is purely cosmetic.

    For example a file now called ``frob.c`` may have the id
    ``frobber.c-12873`` because it was originally called
    ``frobber.c``.  Its texts are kept in the revfile
    ``.bzr/revfiles/frobber.c-12873.revs``.

When the file is deleted from the inventory the revfile does not
change.  It's just not used in reproducing trees from that point
onwards.

The revfile does not record the date when the text was added, a commit
message, properties, or any other metadata.  That is handled in the
higher-level revision history.

Inventories and other metadata files that vary from one version to the
next can themselves be stored in revfiles.

revfiles store files as simple byte streams, with no consideration of
translating character sets, line endings, or keywords.  Those are also
handled at a higher level.  However, the revfile may make use of
knowledge that a file is line-based in generating a diff.  

   (The Python builtin difflib is too slow when generating a purely
   byte-by-byte delta so we always make a line-by-line diff; when this
   is fixed it may be feasible to use line-by-line diffs for all
   files.)

Files whose text does not change from one revision to the next are
stored as just a single text in the revfile.  This can happen even if
the file was renamed or other properties were changed in the
inventory.

The revfile is held on disk as two files: an *index* and a *data*
file.  The index file is short and always read completely into memory;
the data file is much longer and only the relevant bits of it,
identified by the index file, need to be read.

  In previous versions, the  index file identified texts by their
  SHA-1 digest.  This was unsatisfying for two reasons.  Firstly it
  assumes that SHA-1 will not collide, which is not an assumption we
  wish to make in long-lived files.  Secondly for annotations we need
  to be able to map from file versions back to a revision.

Texts are identified by the name of the revfile and a UUID
corresponding to the first revision in which they were first
introduced.  This means that given a text we can identify which
revision it belongs to, and annotations can use the index within the
revfile to identify where a region was first introduced.

  We cannot identify texts by the integer revision number, because
  that would limit us to only referring to a file in a particular
  branch.

  I'd like to just use the revision-id, but those are variable-length
  strings, and I'd like the revfile index to be fixed-length and
  relatively short.  UUIDs can be encoded in binary as only 16 bytes.
  Perhaps we should just use UUIDs for revisions and be done?

This is meant to scale to hold 100,000 revisions of a single file, by
which time the index file will be ~4.8MB and a bit big to read
sequentially.

Some of the reserved fields could be used to implement a (semi?)
balanced tree indexed by SHA1 so we can much more efficiently find the
index associated with a particular hash.  For 100,000 revs we would be
able to find it in about 17 random reads, which is not too bad.

This performs pretty well except when trying to calculate deltas of
really large files.  For that the main thing would be to plug in
something faster than difflib, which is after all pure Python.
Another approach is to just store the gzipped full text of big files,
though perhaps that's too perverse?




Skip-deltas
-----------

Because the basis of a delta does not need to be the text's logical
predecessor, we can adjust the deltas to avoid ever needing to apply
too many deltas to reproduce a particular file.  


Annotations
-----------

Annotations indicate which revision of a file first inserted a line
(or region of bytes).

Given a string, we can write annotations on it like so: a sequence of
*(index, length)* pairs, giving the *index* of the revision which
introduced the next run of *length* bytes.  The sum of the lengths
must equal the length of the string.  For text files the regions will
typically fall on line breaks.  This can be transformed in memory to
other structures, such as a list of *(index, content)* pairs.

When a line was inserted from a merge revision then the annotation for
that line should still be the source in the merged branch, rather than
just being the revision in which the merge took place.

They can cheaply be calculated when inserting a new text, but are
expensive to calculate after the fact because that requires searching
back through all previous text and all texts which were merged in.  It
therefore seems sensible to calculate them once and store them.

To do this we need two operators which update an existing annotated
file:

A. Given an annotated file and a working text, update the annotation to
   mark regions inserted in the working file as new in this revision.

B. Given two annotated files, merge them to produce an annotated
   result.    When there are conflicts, both texts should be included
   and annotated.

These may be repeated: after a merge there may be another merge, or
there may be manual fixups or conflict resolutions.

So what we require is given a diff or a diff3 between two files, map
the regions of bytes changed into corresponding updates to the origin
annotations.

Annotations can also be delta-compressed; we only need to add new
annotation data when there is a text insertion.

    (It is possible in a merge to have a change of annotation when
    there is no text change, though this seems unlikely.  This can
    still be represented as a "pointless" delta, plus an update to the
    annotations.)



Tools
-----

The revfile module can be invoked as a program to give low-level
access for data recovery, debugging, etc.



Format
======

Index file
----------

The index file is a series of fixed-length records::

  byte[16]     UUID of revision
  byte[20]     SHA-1 of expanded text (as binary, not hex)
  uint32       flags: 1=zlib compressed
  uint32       sequence number this is based on, or -1 for full text
  uint32       offset in text file of start
  uint32       length of compressed delta in text file
  uint32[3]    reserved

Total 64 bytes.

The header is also 64 bytes, for tidyness and easy calculation.  For
this format the header must be ``bzr revfile v2\n`` padded with
``\xff`` to 64 bytes.

The first record after the header is index 0.  A record's base index
must be less than its own index.

The SHA-1 is redundant with the inventory but stored just as a check
on the compression methods and so that the file can be validated
without reference to any other information.

Each byte in the text file should be included by at most one delta.


Deltas
------

Deltas to the text are stored as a series of variable-length records::

  uint32        idx
  uint32        m
  uint32        n
  uint32        l
  byte[l]       new

This describes a change originally introduced in the revision
described by *idx* in the index.

This indicates that the region [m:n] of the input file should be
replaced by the text *new*.  If m==n this is a pure insertion of l
bytes.  If l==0 this is a pure deletion of (n-m) bytes.



Open issues
===========

* revfiles use unsigned 32-bit integers both in diffs and the index.
  This should be more than enough for any reasonable source file but
  perhaps not enough for large binaries that are frequently committed.

  Perhaps for those files there should be an option to continue to use
  the text-store.  There is unlikely to be any benefit in holding
  deltas between them, and deltas will anyhow be hard to calculate. 

* The append-only design does not allow for destroying committed data,
  as when confidential information is accidentally added.  That could
  be fixed by creating the fixed repository as a separate branch, into
  which only the preserved revisions are exported.

* Should annotations also indicate where text was deleted?

* This design calls for only one annotation per line, which seems
  standard.  However, this is lacking in at least two cases:

  - Lines which originate in the same way in more than one revision,
    through being independently introduced.  In this case we would
    apparently have to make an arbitrary choice; I suppose branches
    could prefer to assume lines originated in their own history.

  - It might be useful to directly indicate which mergers included
    which lines.  We do have that information in the revision history
    though, so there seems no need to store it for every line.

* Should we also store full-texts as a transitional step?

* Storing the annotations with the text is reasonably simple and
  compact, but means that we always need to process the annotation
  structure even when we only want the text.  In particular it means
  that full-texts cannot just simply be copied out but rather composed
  from chunks.  That seems inefficient since it is probably common to
  only want the text.


321 by Martin Pool doc: revfile storage and related things	1	********
	2	Revfiles
	3	********
	4
	5	The unit for compressed storage in bzr is a revfile, whose design
	6	was suggested by Matt Mackall.
	7
	8
	9	Requirements
	10	============
	11
	12	Compressed storage is a tradeoff between several goals:
	13
	14	* Reasonably compact storage of long histories.
	15
	16	* Robustness and simplicity.
	17
	18	* Fast extraction of versions and addition of new versions (preferably
	19	without rewriting the whole file, or reading the whole history.)
	20
	21	* Fast and precise annotations.
	22
	23	* Storage of files of at least a few hundred MB.
	24
	25
	26	Design
	27	======
	28
	29	revfiles store the history of a single logical file, which is
	30	identified in bzr by its file-id. In this sense they are similar to
	31	an RCS or CVS ``,v`` file or an SCCS sfile.
	32
	33	Each state of the file is called a text.
	34
	35	Renaming, adding and deleting this file is handled at a higher level
	36	by the inventory system, and is outside the scope of the revfile. The
	37	revfile name is typically based on the file id which is itself
	38	typically based on the name the file had when it was first added. But
	39	this is purely cosmetic.
	40
	41	For example a file now called ``frob.c`` may have the id
	42	``frobber.c-12873`` because it was originally called
	43	``frobber.c``. Its texts are kept in the revfile
	44	``.bzr/revfiles/frobber.c-12873.revs``.
	45
	46	When the file is deleted from the inventory the revfile does not
	47	change. It's just not used in reproducing trees from that point
	48	onwards.
	49
	50	The revfile does not record the date when the text was added, a commit
	51	message, properties, or any other metadata. That is handled in the
	52	higher-level revision history.
	53
	54	Inventories and other metadata files that vary from one version to the
	55	next can themselves be stored in revfiles.
	56
	57	revfiles store files as simple byte streams, with no consideration of
	58	translating character sets, line endings, or keywords. Those are also
	59	handled at a higher level. However, the revfile may make use of
	60	knowledge that a file is line-based in generating a diff.
	61
	62	(The Python builtin difflib is too slow when generating a purely
	63	byte-by-byte delta so we always make a line-by-line diff; when this
	64	is fixed it may be feasible to use line-by-line diffs for all
65	files.)
66
67	Files whose text does not change from one revision to the next are
68	stored as just a single text in the revfile. This can happen even if
69	the file was renamed or other properties were changed in the
325 by Martin Pool - more revfile design notes	70	inventory.
	71
	72	The revfile is held on disk as two files: an index and a data
	73	file. The index file is short and always read completely into memory;
	74	the data file is much longer and only the relevant bits of it,
	75	identified by the index file, need to be read.
	76
	77	In previous versions, the index file identified texts by their
	78	SHA-1 digest. This was unsatisfying for two reasons. Firstly it
	79	assumes that SHA-1 will not collide, which is not an assumption we
	80	wish to make in long-lived files. Secondly for annotations we need
	81	to be able to map from file versions back to a revision.
	82
	83	Texts are identified by the name of the revfile and a UUID
	84	corresponding to the first revision in which they were first
	85	introduced. This means that given a text we can identify which
	86	revision it belongs to, and annotations can use the index within the
	87	revfile to identify where a region was first introduced.
	88
	89	We cannot identify texts by the integer revision number, because
	90	that would limit us to only referring to a file in a particular
	91	branch.
	92
	93	I'd like to just use the revision-id, but those are variable-length
	94	strings, and I'd like the revfile index to be fixed-length and
	95	relatively short. UUIDs can be encoded in binary as only 16 bytes.
	96	Perhaps we should just use UUIDs for revisions and be done?
	97
	98	This is meant to scale to hold 100,000 revisions of a single file, by
	99	which time the index file will be ~4.8MB and a bit big to read
	100	sequentially.
	101
	102	Some of the reserved fields could be used to implement a (semi?)
	103	balanced tree indexed by SHA1 so we can much more efficiently find the
	104	index associated with a particular hash. For 100,000 revs we would be
	105	able to find it in about 17 random reads, which is not too bad.
	106
	107	This performs pretty well except when trying to calculate deltas of
	108	really large files. For that the main thing would be to plug in
	109	something faster than difflib, which is after all pure Python.
	110	Another approach is to just store the gzipped full text of big files,
	111	though perhaps that's too perverse?
	112
	113
321 by Martin Pool doc: revfile storage and related things	114
	115
	116	Skip-deltas
	117	-----------
	118
	119	Because the basis of a delta does not need to be the text's logical
325 by Martin Pool - more revfile design notes	120	predecessor, we can adjust the deltas to avoid ever needing to apply
325 by Martin Pool - more revfile design notes	121	too many deltas to reproduce a particular file.
321 by Martin Pool doc: revfile storage and related things	122
	123
	124	Annotations
	125	-----------
	126
325 by Martin Pool - more revfile design notes	127	Annotations indicate which revision of a file first inserted a line
	128	(or region of bytes).
	129
	130	Given a string, we can write annotations on it like so: a sequence of
	131	(index, length) pairs, giving the index of the revision which
	132	introduced the next run of length bytes. The sum of the lengths
	133	must equal the length of the string. For text files the regions will
	134	typically fall on line breaks. This can be transformed in memory to
	135	other structures, such as a list of (index, content) pairs.
	136
	137	When a line was inserted from a merge revision then the annotation for
	138	that line should still be the source in the merged branch, rather than
	139	just being the revision in which the merge took place.
	140
	141	They can cheaply be calculated when inserting a new text, but are
	142	expensive to calculate after the fact because that requires searching
	143	back through all previous text and all texts which were merged in. It
	144	therefore seems sensible to calculate them once and store them.
	145
	146	To do this we need two operators which update an existing annotated
	147	file:
	148
	149	A. Given an annotated file and a working text, update the annotation to
	150	mark regions inserted in the working file as new in this revision.
	151
	152	B. Given two annotated files, merge them to produce an annotated
	153	result. When there are conflicts, both texts should be included
	154	and annotated.
	155
	156	These may be repeated: after a merge there may be another merge, or
	157	there may be manual fixups or conflict resolutions.
	158
	159	So what we require is given a diff or a diff3 between two files, map
	160	the regions of bytes changed into corresponding updates to the origin
	161	annotations.
321 by Martin Pool doc: revfile storage and related things	162
328 by Martin Pool - more documentation of revfile+annotation	163	Annotations can also be delta-compressed; we only need to add new
	164	annotation data when there is a text insertion.
	165
	166	(It is possible in a merge to have a change of annotation when
	167	there is no text change, though this seems unlikely. This can
	168	still be represented as a "pointless" delta, plus an update to the
	169	annotations.)
	170
	171
	172
	173	Tools
	174	-----
	175
	176	The revfile module can be invoked as a program to give low-level
	177	access for data recovery, debugging, etc.
	178
	179
	180
	181	Format
	182	======
	183
	184	Index file
	185	----------
	186
	187	The index file is a series of fixed-length records::
	188
	189	byte[16] UUID of revision
	190	byte[20] SHA-1 of expanded text (as binary, not hex)
	191	uint32 flags: 1=zlib compressed
	192	uint32 sequence number this is based on, or -1 for full text
	193	uint32 offset in text file of start
	194	uint32 length of compressed delta in text file
	195	uint32[3] reserved
	196
	197	Total 64 bytes.
	198
	199	The header is also 64 bytes, for tidyness and easy calculation. For
	200	this format the header must be ``bzr revfile v2\n`` padded with
	201	``\xff`` to 64 bytes.
	202
	203	The first record after the header is index 0. A record's base index
	204	must be less than its own index.
	205
	206	The SHA-1 is redundant with the inventory but stored just as a check
	207	on the compression methods and so that the file can be validated
	208	without reference to any other information.
	209
	210	Each byte in the text file should be included by at most one delta.
	211
	212
	213	Deltas
	214	------
	215
	216	Deltas to the text are stored as a series of variable-length records::
	217
	218	uint32 idx
	219	uint32 m
	220	uint32 n
	221	uint32 l
	222	byte[l] new
	223
	224	This describes a change originally introduced in the revision
	225	described by idx in the index.
	226
227	This indicates that the region [m:n] of the input file should be
228	replaced by the text new. If m==n this is a pure insertion of l
229	bytes. If l==0 this is a pure deletion of (n-m) bytes.
230
231
321 by Martin Pool doc: revfile storage and related things	232
	233	Open issues
	234	===========
	235
	236	* revfiles use unsigned 32-bit integers both in diffs and the index.
	237	This should be more than enough for any reasonable source file but
	238	perhaps not enough for large binaries that are frequently committed.
	239
	240	Perhaps for those files there should be an option to continue to use
	241	the text-store. There is unlikely to be any benefit in holding
	242	deltas between them, and deltas will anyhow be hard to calculate.
	243
	244	* The append-only design does not allow for destroying committed data,
	245	as when confidential information is accidentally added. That could
	246	be fixed by creating the fixed repository as a separate branch, into
	247	which only the preserved revisions are exported.
325 by Martin Pool - more revfile design notes	248
	249	* Should annotations also indicate where text was deleted?
	250
	251	* This design calls for only one annotation per line, which seems
	252	standard. However, this is lacking in at least two cases:
	253
	254	- Lines which originate in the same way in more than one revision,
	255	through being independently introduced. In this case we would
	256	apparently have to make an arbitrary choice; I suppose branches
	257	could prefer to assume lines originated in their own history.
	258
	259	- It might be useful to directly indicate which mergers included
	260	which lines. We do have that information in the revision history
	261	though, so there seems no need to store it for every line.
328 by Martin Pool - more documentation of revfile+annotation	262
	263	* Should we also store full-texts as a transitional step?
	264
	265	* Storing the annotations with the text is reasonably simple and
	266	compact, but means that we always need to process the annotation
	267	structure even when we only want the text. In particular it means
	268	that full-texts cannot just simply be copied out but rather composed
	269	from chunks. That seems inefficient since it is probably common to
	270	only want the text.
329 by Martin Pool - refactor command functions into command classes	271