******** Revfiles ******** The unit for compressed storage in bzr is a *revfile*, whose design was suggested by Matt Mackall. Requirements ============ Compressed storage is a tradeoff between several goals: * Reasonably compact storage of long histories. * Robustness and simplicity. * Fast extraction of versions and addition of new versions (preferably without rewriting the whole file, or reading the whole history.) * Fast and precise annotations. * Storage of files of at least a few hundred MB. Design ====== revfiles store the history of a single logical file, which is identified in bzr by its file-id. In this sense they are similar to an RCS or CVS ``,v`` file or an SCCS sfile. Each state of the file is called a *text*. Renaming, adding and deleting this file is handled at a higher level by the inventory system, and is outside the scope of the revfile. The revfile name is typically based on the file id which is itself typically based on the name the file had when it was first added. But this is purely cosmetic. For example a file now called ``frob.c`` may have the id ``frobber.c-12873`` because it was originally called ``frobber.c``. Its texts are kept in the revfile ``.bzr/revfiles/frobber.c-12873.revs``. When the file is deleted from the inventory the revfile does not change. It's just not used in reproducing trees from that point onwards. The revfile does not record the date when the text was added, a commit message, properties, or any other metadata. That is handled in the higher-level revision history. Inventories and other metadata files that vary from one version to the next can themselves be stored in revfiles. revfiles store files as simple byte streams, with no consideration of translating character sets, line endings, or keywords. Those are also handled at a higher level. However, the revfile may make use of knowledge that a file is line-based in generating a diff. (The Python builtin difflib is too slow when generating a purely byte-by-byte delta so we always make a line-by-line diff; when this is fixed it may be feasible to use line-by-line diffs for all files.) Files whose text does not change from one revision to the next are stored as just a single text in the revfile. This can happen even if the file was renamed or other properties were changed in the inventory. The revfile is held on disk as two files: an *index* and a *data* file. The index file is short and always read completely into memory; the data file is much longer and only the relevant bits of it, identified by the index file, need to be read. In previous versions, the index file identified texts by their SHA-1 digest. This was unsatisfying for two reasons. Firstly it assumes that SHA-1 will not collide, which is not an assumption we wish to make in long-lived files. Secondly for annotations we need to be able to map from file versions back to a revision. Texts are identified by the name of the revfile and a UUID corresponding to the first revision in which they were first introduced. This means that given a text we can identify which revision it belongs to, and annotations can use the index within the revfile to identify where a region was first introduced. We cannot identify texts by the integer revision number, because that would limit us to only referring to a file in a particular branch. I'd like to just use the revision-id, but those are variable-length strings, and I'd like the revfile index to be fixed-length and relatively short. UUIDs can be encoded in binary as only 16 bytes. Perhaps we should just use UUIDs for revisions and be done? This is meant to scale to hold 100,000 revisions of a single file, by which time the index file will be ~4.8MB and a bit big to read sequentially. Some of the reserved fields could be used to implement a (semi?) balanced tree indexed by SHA1 so we can much more efficiently find the index associated with a particular hash. For 100,000 revs we would be able to find it in about 17 random reads, which is not too bad. This performs pretty well except when trying to calculate deltas of really large files. For that the main thing would be to plug in something faster than difflib, which is after all pure Python. Another approach is to just store the gzipped full text of big files, though perhaps that's too perverse? Skip-deltas ----------- Because the basis of a delta does not need to be the text's logical predecessor, we can adjust the deltas to avoid ever needing to apply too many deltas to reproduce a particular file. Annotations ----------- Annotations indicate which revision of a file first inserted a line (or region of bytes). Given a string, we can write annotations on it like so: a sequence of *(index, length)* pairs, giving the *index* of the revision which introduced the next run of *length* bytes. The sum of the lengths must equal the length of the string. For text files the regions will typically fall on line breaks. This can be transformed in memory to other structures, such as a list of *(index, content)* pairs. When a line was inserted from a merge revision then the annotation for that line should still be the source in the merged branch, rather than just being the revision in which the merge took place. They can cheaply be calculated when inserting a new text, but are expensive to calculate after the fact because that requires searching back through all previous text and all texts which were merged in. It therefore seems sensible to calculate them once and store them. To do this we need two operators which update an existing annotated file: A. Given an annotated file and a working text, update the annotation to mark regions inserted in the working file as new in this revision. B. Given two annotated files, merge them to produce an annotated result. When there are conflicts, both texts should be included and annotated. These may be repeated: after a merge there may be another merge, or there may be manual fixups or conflict resolutions. So what we require is given a diff or a diff3 between two files, map the regions of bytes changed into corresponding updates to the origin annotations. Annotations can also be delta-compressed; we only need to add new annotation data when there is a text insertion. (It is possible in a merge to have a change of annotation when there is no text change, though this seems unlikely. This can still be represented as a "pointless" delta, plus an update to the annotations.) Tools ----- The revfile module can be invoked as a program to give low-level access for data recovery, debugging, etc. Format ====== Index file ---------- The index file is a series of fixed-length records:: byte[16] UUID of revision byte[20] SHA-1 of expanded text (as binary, not hex) uint32 flags: 1=zlib compressed uint32 sequence number this is based on, or -1 for full text uint32 offset in text file of start uint32 length of compressed delta in text file uint32[3] reserved Total 64 bytes. The header is also 64 bytes, for tidyness and easy calculation. For this format the header must be ``bzr revfile v2\n`` padded with ``\xff`` to 64 bytes. The first record after the header is index 0. A record's base index must be less than its own index. The SHA-1 is redundant with the inventory but stored just as a check on the compression methods and so that the file can be validated without reference to any other information. Each byte in the text file should be included by at most one delta. Deltas ------ Deltas to the text are stored as a series of variable-length records:: uint32 idx uint32 m uint32 n uint32 l byte[l] new This describes a change originally introduced in the revision described by *idx* in the index. This indicates that the region [m:n] of the input file should be replaced by the text *new*. If m==n this is a pure insertion of l bytes. If l==0 this is a pure deletion of (n-m) bytes. Open issues =========== * revfiles use unsigned 32-bit integers both in diffs and the index. This should be more than enough for any reasonable source file but perhaps not enough for large binaries that are frequently committed. Perhaps for those files there should be an option to continue to use the text-store. There is unlikely to be any benefit in holding deltas between them, and deltas will anyhow be hard to calculate. * The append-only design does not allow for destroying committed data, as when confidential information is accidentally added. That could be fixed by creating the fixed repository as a separate branch, into which only the preserved revisions are exported. * Should annotations also indicate where text was deleted? * This design calls for only one annotation per line, which seems standard. However, this is lacking in at least two cases: - Lines which originate in the same way in more than one revision, through being independently introduced. In this case we would apparently have to make an arbitrary choice; I suppose branches could prefer to assume lines originated in their own history. - It might be useful to directly indicate which mergers included which lines. We do have that information in the revision history though, so there seems no need to store it for every line. * Should we also store full-texts as a transitional step? * Storing the annotations with the text is reasonably simple and compact, but means that we always need to process the annotation structure even when we only want the text. In particular it means that full-texts cannot just simply be copied out but rather composed from chunks. That seems inefficient since it is probably common to only want the text.