~bzr-pqm/bzr/bzr.dev

« back to all changes in this revision

Viewing changes to doc/revfile.txt

Committer: Martin Pool
Date: 2005-09-02 01:56:05 UTC
Revision ID: mbp@sourcefrog.net-20050902015604-3e3003f71665950b

- message typo

files added:
HACKING

Makefile

bzr-man.py

bzrlib/atomicfile.py

bzrlib/builtins.py

bzrlib/changeset.py

bzrlib/commit.py

bzrlib/delta.py

bzrlib/externalcommand.py

bzrlib/fetch.py

bzrlib/hashcache.py

bzrlib/help.py

bzrlib/intset.py

bzrlib/lock.py

bzrlib/log.py

bzrlib/merge.py

bzrlib/merge3.py

bzrlib/merge_core.py

bzrlib/meta_store.py

bzrlib/missing.py

bzrlib/msgeditor.py

bzrlib/patch.py

bzrlib/plugin.py

bzrlib/plugins

bzrlib/plugins/__init__.py

bzrlib/progress.py

bzrlib/selftest

bzrlib/selftest/TestUtil.py

bzrlib/selftest/__init__.py

bzrlib/selftest/blackbox.py

bzrlib/selftest/plugins.py

bzrlib/selftest/test_merge_core.py

bzrlib/selftest/test_parent.py

bzrlib/selftest/test_smart_add.py

bzrlib/selftest/testbranch.py

bzrlib/selftest/testdiff.py

bzrlib/selftest/testfetch.py

bzrlib/selftest/testhashcache.py

bzrlib/selftest/testinv.py

bzrlib/selftest/testlog.py

bzrlib/selftest/testmerge3.py

bzrlib/selftest/testrevision.py

bzrlib/selftest/testrevisionnamespaces.py

bzrlib/selftest/teststatus.py

bzrlib/selftest/versioning.py

bzrlib/selftest/whitebox.py

bzrlib/shellcomplete.py

bzrlib/status.py

bzrlib/ui.py

bzrlib/upgrade.py

bzrlib/util

bzrlib/util/__init__.py

bzrlib/util/effbot

bzrlib/util/effbot/__init__.py

bzrlib/util/effbot/org

bzrlib/util/effbot/org/__init__.py

bzrlib/util/effbot/org/gzip_consumer.py

bzrlib/util/effbot/org/http_client.py

bzrlib/util/effbot/org/http_manager.py

bzrlib/weave.py

bzrlib/weavefile.py

bzrlib/workingtree.py

contrib

contrib/add-bzr-to-baz

contrib/bash

contrib/bash/bzr

contrib/bash/bzr.simple

contrib/create_bzr_rollup.py

contrib/emacs

contrib/emacs/bzr-mode.el

contrib/fortune

contrib/pwclient.full

contrib/pwk

contrib/upload-bzr.dev

contrib/zsh

contrib/zsh/_bzr

doc/revfile-annotation.txt

doc/revfile.txt

doc/split-join-files.txt

doc/switch-in-branch.txt

notes/inventory-v2-sample.xml

notes/inventory-v2.rnc

notes/revfile.txt

notes/schemas.xml

patches

patches/annotate3.patch

patches/annotate4.patch

patches/cache-remote-revisions.diff

patches/find-touching-from-seq.diff

patches/meta-data-in-inventory.patch

patches/ndiff.patch

patches/pending-merge.patch

patches/plugins-no-plugins.patch

patches/progress.diff

patches/symlink-support.patch

testsweet.py

tools

tools/convertfile.py

tools/convertinv.py

tools/history2revfiles.py

tools/history2weaves.py

tools/http_client.py

tools/testweave.py

tools/weavebench.py

tools/weavemerge.sh

tutorial.txt

files removed:
bzrlib/tests.py

doc/faq.txt

doc/quickref.txt

test.sh

files renamed:
elementtree/ => bzrlib/util/elementtree/

urlgrabber/ => bzrlib/util/urlgrabber/

files modified:
.bzrignore

.rsyncexclude

NEWS

README

TODO

build-api

bzrlib/__init__.py

bzrlib/add.py

bzrlib/branch.py

bzrlib/check.py

bzrlib/commands.py

bzrlib/diff.py

bzrlib/errors.py

bzrlib/info.py

bzrlib/inventory.py

bzrlib/mdiff.py

bzrlib/newinventory.py

bzrlib/osutils.py

bzrlib/remotebranch.py

bzrlib/revfile.py

bzrlib/revision.py

bzrlib/store.py

bzrlib/textui.py

bzrlib/trace.py

bzrlib/tree.py

bzrlib/util/elementtree/ElementTree.py

bzrlib/util/urlgrabber/keepalive.py

bzrlib/xml.py

doc/Makefile

doc/formats.txt

doc/index.txt

doc/merge.txt

doc/tagging.txt

doc/todo-from-arch.txt

setup.py

testbzr

Show diffs side-by-side

added added

removed removed

doc/revfile.txt

********

Revfiles

********

The unit for compressed storage in bzr is a *revfile*, whose design

was suggested by Matt Mackall.

This document describes version 1 of the file, and has some notes on

what might be done in version 2.

Requirements

============

Compressed storage is a tradeoff between several goals:

* Reasonably compact storage of long histories.

* Robustness and simplicity.

* Fast extraction of versions and addition of new versions (preferably

without rewriting the whole file, or reading the whole history.)

* Fast and precise annotations.

* Storage of files of at least a few hundred MB.

* Lossless in useful ways: we can extract a series of texts and write

them back out without losing any information.

Design

======

revfiles store the history of a single logical file, which is

identified in bzr by its file-id. In this sense they are similar to

an RCS or CVS ``,v`` file or an SCCS sfile.

Each state of the file is called a *text*.

Renaming, adding and deleting this file is handled at a higher level

by the inventory system, and is outside the scope of the revfile. The

revfile name is typically based on the file id which is itself

typically based on the name the file had when it was first added. But

this is purely cosmetic.

For example a file now called ``frob.c`` may have the id

``frobber.c-12873`` because it was originally called

``frobber.c``. Its texts are kept in the revfile

``.bzr/revfiles/frobber.c-12873.revs``.

When the file is deleted from the inventory the revfile does not

change. It's just not used in reproducing trees from that point

onwards.

The revfile does not record the date when the text was added, a commit

message, properties, or any other metadata. That is handled in the

higher-level revision history.

Inventories and other metadata files that vary from one version to the

next can themselves be stored in revfiles.

revfiles store files as simple byte streams, with no consideration of

translating character sets, line endings, or keywords. Those are also

handled at a higher level. However, the revfile may make use of

knowledge that a file is line-based in generating a diff.

(The Python builtin difflib is too slow when generating a purely

byte-by-byte delta so we always make a line-by-line diff; when this

is fixed it may be feasible to use line-by-line diffs for all

files.)

Files whose text does not change from one revision to the next are

stored as just a single text in the revfile. This can happen even if

the file was renamed or other properties were changed in the

inventory.

The revfile is held on disk as two files: an *index* and a *data*

file. The index file is short and always read completely into memory;

the data file is much longer and only the relevant bits of it,

identified by the index file, need to be read.

This design is similar to that of Netscape `mail summary files`_, in

that there is a small index which can always be read into memory and

that quickly identifies where to look in the main file. They differ

in many other ways though, most particularly that the index is not

just a cache but holds precious data in its own right.

.. _`mail summary files`: http://www.jwz.org/doc/mailsum.html

This is meant to scale to hold 100,000 revisions of a single file, by

which time the index file will be ~4.8MB and a bit big to read

sequentially.

Some of the reserved fields could be used to implement a (semi?)

balanced tree indexed by SHA1 so we can much more efficiently find the

index associated with a particular hash. For 100,000 revs we would be

able to find it in about 17 random reads, which is not too bad. On

the other hand that would compromise the append-only indexing, and

100

100,000 revs is a fairly extreme case.

101

102

This performs pretty well except when trying to calculate deltas of

103

really large files. For that the main thing would be to plug in

104

something faster than difflib, which is after all pure Python.

105

Another approach is to just store the gzipped full text of big files,

106

though perhaps that's too perverse?

107

108

109

Identifying texts

110

-----------------

111

112

In the current version, texts are identified by their SHA-1.

113

114

115

Skip-deltas

116

-----------

117

118

Because the basis of a delta does not need to be the text's logical

119

predecessor, we can adjust the deltas to avoid ever needing to apply

120

too many deltas to reproduce a particular file.

121

122

123

Tools

124

-----

125

126

The revfile module can be invoked as a program to give low-level

127

access for data recovery, debugging, etc.

128

129

130

131

132

Open issues

133

===========

134

135

* revfiles use unsigned 32-bit integers both in diffs and the index.

136

This should be more than enough for any reasonable source file but

137

perhaps not enough for large binaries that are frequently committed.

138

139

Perhaps for those files there should be an option to continue to use

140

the text-store. There is unlikely to be any benefit in holding

141

deltas between them, and deltas will anyhow be hard to calculate.

142

143

* The append-only design does not allow for destroying committed data,

144

as when confidential information is accidentally added. That could

145

be fixed by creating the fixed repository as a separate branch, into

146

which only the preserved revisions are exported.

147

148

* Should we also store full-texts as a transitional step?

Older »