~bzr-pqm/bzr/bzr.dev : contents of doc/revfile.txt at revision 1398

~bzr-pqm/bzr/bzr.dev : (revision 1398)

********
Revfiles
********

The unit for compressed storage in bzr is a *revfile*, whose design
was suggested by Matt Mackall.

This document describes version 1 of the file, and has some notes on
what might be done in version 2.


Requirements
============

Compressed storage is a tradeoff between several goals:

* Reasonably compact storage of long histories.

* Robustness and simplicity.

* Fast extraction of versions and addition of new versions (preferably
  without rewriting the whole file, or reading the whole history.)

* Fast and precise annotations.

* Storage of files of at least a few hundred MB.

* Lossless in useful ways: we can extract a series of texts and write
  them back out without losing any information.


Design
======

revfiles store the history of a single logical file, which is
identified in bzr by its file-id.  In this sense they are similar to
an RCS or CVS ``,v`` file or an SCCS sfile.

Each state of the file is called a *text*. 

Renaming, adding and deleting this file is handled at a higher level
by the inventory system, and is outside the scope of the revfile.  The
revfile name is typically based on the file id which is itself
typically based on the name the file had when it was first added.  But
this is purely cosmetic.

    For example a file now called ``frob.c`` may have the id
    ``frobber.c-12873`` because it was originally called
    ``frobber.c``.  Its texts are kept in the revfile
    ``.bzr/revfiles/frobber.c-12873.revs``.

When the file is deleted from the inventory the revfile does not
change.  It's just not used in reproducing trees from that point
onwards.

The revfile does not record the date when the text was added, a commit
message, properties, or any other metadata.  That is handled in the
higher-level revision history.

Inventories and other metadata files that vary from one version to the
next can themselves be stored in revfiles.

revfiles store files as simple byte streams, with no consideration of
translating character sets, line endings, or keywords.  Those are also
handled at a higher level.  However, the revfile may make use of
knowledge that a file is line-based in generating a diff.  

   (The Python builtin difflib is too slow when generating a purely
   byte-by-byte delta so we always make a line-by-line diff; when this
   is fixed it may be feasible to use line-by-line diffs for all
   files.)

Files whose text does not change from one revision to the next are
stored as just a single text in the revfile.  This can happen even if
the file was renamed or other properties were changed in the
inventory.

The revfile is held on disk as two files: an *index* and a *data*
file.  The index file is short and always read completely into memory;
the data file is much longer and only the relevant bits of it,
identified by the index file, need to be read.

  This design is similar to that of Netscape `mail summary files`_, in
  that there is a small index which can always be read into memory and
  that quickly identifies where to look in the main file.  They differ
  in many other ways though, most particularly that the index is not
  just a cache but holds precious data in its own right.

.. _`mail summary files`: http://www.jwz.org/doc/mailsum.html

This is meant to scale to hold 100,000 revisions of a single file, by
which time the index file will be ~4.8MB and a bit big to read
sequentially.

Some of the reserved fields could be used to implement a (semi?)
balanced tree indexed by SHA1 so we can much more efficiently find the
index associated with a particular hash.  For 100,000 revs we would be
able to find it in about 17 random reads, which is not too bad.  On
the other hand that would compromise the append-only indexing, and
100,000 revs is a fairly extreme case.

This performs pretty well except when trying to calculate deltas of
really large files.  For that the main thing would be to plug in
something faster than difflib, which is after all pure Python.
Another approach is to just store the gzipped full text of big files,
though perhaps that's too perverse?


Identifying texts
-----------------

In the current version, texts are identified by their SHA-1.  


Skip-deltas
-----------

Because the basis of a delta does not need to be the text's logical
predecessor, we can adjust the deltas to avoid ever needing to apply
too many deltas to reproduce a particular file.  


Tools
-----

The revfile module can be invoked as a program to give low-level
access for data recovery, debugging, etc.




Open issues
===========

* revfiles use unsigned 32-bit integers both in diffs and the index.
  This should be more than enough for any reasonable source file but
  perhaps not enough for large binaries that are frequently committed.

  Perhaps for those files there should be an option to continue to use
  the text-store.  There is unlikely to be any benefit in holding
  deltas between them, and deltas will anyhow be hard to calculate. 

* The append-only design does not allow for destroying committed data,
  as when confidential information is accidentally added.  That could
  be fixed by creating the fixed repository as a separate branch, into
  which only the preserved revisions are exported.

* Should we also store full-texts as a transitional step?

321 by Martin Pool doc: revfile storage and related things	1	********
	2	Revfiles
	3	********
	4
	5	The unit for compressed storage in bzr is a revfile, whose design
	6	was suggested by Matt Mackall.
	7
390 by Martin Pool - Update revfile docs; most of what's in there is speculative	8	This document describes version 1 of the file, and has some notes on
	9	what might be done in version 2.
	10
321 by Martin Pool doc: revfile storage and related things	11
	12	Requirements
	13	============
	14
	15	Compressed storage is a tradeoff between several goals:
	16
	17	* Reasonably compact storage of long histories.
	18
	19	* Robustness and simplicity.
	20
	21	* Fast extraction of versions and addition of new versions (preferably
	22	without rewriting the whole file, or reading the whole history.)
	23
	24	* Fast and precise annotations.
	25
	26	* Storage of files of at least a few hundred MB.
	27
390 by Martin Pool - Update revfile docs; most of what's in there is speculative	28	* Lossless in useful ways: we can extract a series of texts and write
	29	them back out without losing any information.
	30
321 by Martin Pool doc: revfile storage and related things	31
	32	Design
	33	======
	34
	35	revfiles store the history of a single logical file, which is
	36	identified in bzr by its file-id. In this sense they are similar to
	37	an RCS or CVS ``,v`` file or an SCCS sfile.
	38
	39	Each state of the file is called a text.
	40
	41	Renaming, adding and deleting this file is handled at a higher level
	42	by the inventory system, and is outside the scope of the revfile. The
	43	revfile name is typically based on the file id which is itself
	44	typically based on the name the file had when it was first added. But
	45	this is purely cosmetic.
	46
	47	For example a file now called ``frob.c`` may have the id
	48	``frobber.c-12873`` because it was originally called
	49	``frobber.c``. Its texts are kept in the revfile
	50	``.bzr/revfiles/frobber.c-12873.revs``.
	51
	52	When the file is deleted from the inventory the revfile does not
	53	change. It's just not used in reproducing trees from that point
	54	onwards.
	55
	56	The revfile does not record the date when the text was added, a commit
	57	message, properties, or any other metadata. That is handled in the
	58	higher-level revision history.
	59
	60	Inventories and other metadata files that vary from one version to the
	61	next can themselves be stored in revfiles.
	62
	63	revfiles store files as simple byte streams, with no consideration of
	64	translating character sets, line endings, or keywords. Those are also
	65	handled at a higher level. However, the revfile may make use of
	66	knowledge that a file is line-based in generating a diff.
	67
	68	(The Python builtin difflib is too slow when generating a purely
	69	byte-by-byte delta so we always make a line-by-line diff; when this
	70	is fixed it may be feasible to use line-by-line diffs for all
	71	files.)
	72
	73	Files whose text does not change from one revision to the next are
	74	stored as just a single text in the revfile. This can happen even if
	75	the file was renamed or other properties were changed in the
325 by Martin Pool - more revfile design notes	76	inventory.
	77
	78	The revfile is held on disk as two files: an index and a data
	79	file. The index file is short and always read completely into memory;
	80	the data file is much longer and only the relevant bits of it,
	81	identified by the index file, need to be read.
	82
390 by Martin Pool - Update revfile docs; most of what's in there is speculative	83	This design is similar to that of Netscape `mail summary files`_, in
	84	that there is a small index which can always be read into memory and
	85	that quickly identifies where to look in the main file. They differ
	86	in many other ways though, most particularly that the index is not
	87	just a cache but holds precious data in its own right.
	88
	89	.. _`mail summary files`: http://www.jwz.org/doc/mailsum.html
325 by Martin Pool - more revfile design notes	90
	91	This is meant to scale to hold 100,000 revisions of a single file, by
	92	which time the index file will be ~4.8MB and a bit big to read
	93	sequentially.
	94
	95	Some of the reserved fields could be used to implement a (semi?)
	96	balanced tree indexed by SHA1 so we can much more efficiently find the
	97	index associated with a particular hash. For 100,000 revs we would be
390 by Martin Pool - Update revfile docs; most of what's in there is speculative	98	able to find it in about 17 random reads, which is not too bad. On
	99	the other hand that would compromise the append-only indexing, and
	100	100,000 revs is a fairly extreme case.
325 by Martin Pool - more revfile design notes	101
	102	This performs pretty well except when trying to calculate deltas of
	103	really large files. For that the main thing would be to plug in
	104	something faster than difflib, which is after all pure Python.
	105	Another approach is to just store the gzipped full text of big files,
	106	though perhaps that's too perverse?
	107
	108
390 by Martin Pool - Update revfile docs; most of what's in there is speculative	109	Identifying texts
	110	-----------------
	111
	112	In the current version, texts are identified by their SHA-1.
321 by Martin Pool doc: revfile storage and related things	113
	114
	115	Skip-deltas
	116	-----------
	117
	118	Because the basis of a delta does not need to be the text's logical
325 by Martin Pool - more revfile design notes	119	predecessor, we can adjust the deltas to avoid ever needing to apply
325 by Martin Pool - more revfile design notes	120	too many deltas to reproduce a particular file.
321 by Martin Pool doc: revfile storage and related things	121
321 by Martin Pool doc: revfile storage and related things	122
390 by Martin Pool - Update revfile docs; most of what's in there is speculative	123	Tools
	124	-----
	125
	126	The revfile module can be invoked as a program to give low-level
	127	access for data recovery, debugging, etc.
	128
	129
	130
	131
321 by Martin Pool doc: revfile storage and related things	132	Open issues
	133	===========
	134
	135	* revfiles use unsigned 32-bit integers both in diffs and the index.
	136	This should be more than enough for any reasonable source file but
	137	perhaps not enough for large binaries that are frequently committed.
	138
	139	Perhaps for those files there should be an option to continue to use
	140	the text-store. There is unlikely to be any benefit in holding
	141	deltas between them, and deltas will anyhow be hard to calculate.
	142
	143	* The append-only design does not allow for destroying committed data,
	144	as when confidential information is accidentally added. That could
	145	be fixed by creating the fixed repository as a separate branch, into
	146	which only the preserved revisions are exported.
325 by Martin Pool - more revfile design notes	147
328 by Martin Pool - more documentation of revfile+annotation	148	* Should we also store full-texts as a transitional step?