~bzr-pqm/bzr/bzr.dev

« back to all changes in this revision

Viewing changes to doc/revfile-annotation.txt

Committer: mbp at sourcefrog
Date: 2005-04-07 06:16:02 UTC
Revision ID: mbp@sourcefrog.net-20050407061602-6b7da239ef883b0817b7fcd3

more XML performance tests

files added:
bzrlib/tests.py

doc/faq.txt

doc/quickref.txt

doc/roadmap.txt

doc/testing.txt

doc/work-order.txt

test.sh

files removed:
.rsyncexclude

HACKING

TODO

bzr-man.py

bzrlib/atomicfile.py

bzrlib/changeset.py

bzrlib/commit.py

bzrlib/delta.py

bzrlib/fetch.py

bzrlib/hashcache.py

bzrlib/help.py

bzrlib/intset.py

bzrlib/lock.py

bzrlib/log.py

bzrlib/mdiff.py

bzrlib/merge.py

bzrlib/merge3.py

bzrlib/merge_core.py

bzrlib/meta_store.py

bzrlib/missing.py

bzrlib/patch.py

bzrlib/plugin.py

bzrlib/plugins

bzrlib/plugins/__init__.py

bzrlib/plugins/checkperms

bzrlib/progress.py

bzrlib/revfile.py

bzrlib/selftest

bzrlib/selftest/__init__.py

bzrlib/selftest/blackbox.py

bzrlib/selftest/plugins.py

bzrlib/selftest/testbranch.py

bzrlib/selftest/testdiff.py

bzrlib/selftest/testhashcache.py

bzrlib/selftest/testinv.py

bzrlib/selftest/testlog.py

bzrlib/selftest/testmerge3.py

bzrlib/selftest/testrevision.py

bzrlib/selftest/testrevisionnamespaces.py

bzrlib/selftest/teststatus.py

bzrlib/selftest/versioning.py

bzrlib/selftest/whitebox.py

bzrlib/status.py

bzrlib/textinv.py

bzrlib/upgrade.py

bzrlib/util

bzrlib/util/__init__.py

bzrlib/util/effbot

bzrlib/util/effbot/__init__.py

bzrlib/util/effbot/org

bzrlib/util/effbot/org/__init__.py

bzrlib/util/effbot/org/gzip_consumer.py

bzrlib/util/effbot/org/http_client.py

bzrlib/util/effbot/org/http_manager.py

bzrlib/util/urlgrabber

bzrlib/util/urlgrabber/__init__.py

bzrlib/util/urlgrabber/byterange.py

bzrlib/util/urlgrabber/grabber.py

bzrlib/util/urlgrabber/keepalive.py

bzrlib/util/urlgrabber/mirror.py

bzrlib/util/urlgrabber/progress.py

bzrlib/weave.py

bzrlib/weavefile.py

bzrlib/workingtree.py

contrib

contrib/add-bzr-to-baz

contrib/bash

contrib/bash/bzr

contrib/bash/bzr.simple

contrib/create_bzr_rollup.py

contrib/emacs

contrib/emacs/bzr-mode.el

contrib/fortune

contrib/pwclient.full

contrib/pwk

contrib/upload-bzr.dev

contrib/zsh

contrib/zsh/_bzr

doc/revfile-annotation.txt

doc/revfile.txt

doc/split-join-files.txt

doc/switch-in-branch.txt

notes/revfile.txt

patches

patches/annotate3.patch

patches/annotate4.patch

patches/cache-remote-revisions.diff

patches/find-touching-from-seq.diff

patches/meta-data-in-inventory.patch

patches/ndiff.patch

patches/pending-merge.patch

patches/plugins-no-plugins.patch

patches/progress.diff

patches/symlink-support.patch

testbzr

testsweet.py

tools

tools/convertfile.py

tools/convertinv.py

tools/history2revfiles.py

tools/testweave.py

tools/weavebench.py

tools/weavemerge.sh

tutorial.txt

files renamed:
bzrlib/util/elementtree/ => elementtree/

files modified:
.bzrignore

NEWS

README

build-api

bzrlib/__init__.py

bzrlib/add.py

bzrlib/branch.py

bzrlib/check.py

bzrlib/commands.py

bzrlib/diff.py

bzrlib/errors.py

bzrlib/info.py

bzrlib/inventory.py

bzrlib/newinventory.py

bzrlib/osutils.py

bzrlib/remotebranch.py

bzrlib/revision.py

bzrlib/store.py

bzrlib/textui.py

bzrlib/trace.py

bzrlib/tree.py

bzrlib/xml.py

doc/Makefile

doc/bitkeeper.txt

doc/formats.txt

doc/index.txt

doc/interrupted.txt

doc/merge.txt

doc/python.txt

doc/random.txt

doc/tagging.txt

doc/todo-from-arch.txt

elementtree/ElementTree.py

setup.py

Show diffs side-by-side

added added

removed removed

doc/revfile-annotation.txt

==============================

Extension to store annotations

==============================

We might extend the revfile format in a future version to also store

annotations. *This is not implemented yet.*

In previous versions, the index file identified texts by their

SHA-1 digest. This was unsatisfying for two reasons. Firstly it

assumes that SHA-1 will not collide, which is not an assumption we

wish to make in long-lived files. Secondly for annotations we need

to be able to map from file versions back to a revision.

Texts are identified by the name of the revfile and a UUID

corresponding to the first revision in which they were first

introduced. This means that given a text we can identify which

revision it belongs to, and annotations can use the index within the

revfile to identify where a region was first introduced.

We cannot identify texts by the integer revision number, because

that would limit us to only referring to a file in a particular

branch.

I'd like to just use the revision-id, but those are variable-length

strings, and I'd like the revfile index to be fixed-length and

relatively short. UUIDs can be encoded in binary as only 16 bytes.

Perhaps we should just use UUIDs for revisions and be done?

Annotations

-----------

Annotations indicate which revision of a file first inserted a line

(or region of bytes).

Given a string, we can write annotations on it like so: a sequence of

*(index, length)* pairs, giving the *index* of the revision which

introduced the next run of *length* bytes. The sum of the lengths

must equal the length of the string. For text files the regions will

typically fall on line breaks. This can be transformed in memory to

other structures, such as a list of *(index, content)* pairs.

When a line was inserted from a merge revision then the annotation for

that line should still be the source in the merged branch, rather than

just being the revision in which the merge took place.

They can cheaply be calculated when inserting a new text, but are

expensive to calculate after the fact because that requires searching

back through all previous text and all texts which were merged in. It

therefore seems sensible to calculate them once and store them.

To do this we need two operators which update an existing annotated

file:

A. Given an annotated file and a working text, update the annotation to

mark regions inserted in the working file as new in this revision.

B. Given two annotated files, merge them to produce an annotated

result. When there are conflicts, both texts should be included

and annotated.

These may be repeated: after a merge there may be another merge, or

there may be manual fixups or conflict resolutions.

So what we require is given a diff or a diff3 between two files, map

the regions of bytes changed into corresponding updates to the origin

annotations.

Annotations can also be delta-compressed; we only need to add new

annotation data when there is a text insertion.

(It is possible in a merge to have a change of annotation when

there is no text change, though this seems unlikely. This can

still be represented as a "pointless" delta, plus an update to the

annotations.)

Index file

----------

In a proposed (not implemented) storage with annotations, the index

file is a series of fixed-length records::

byte[16] UUID of revision

byte[20] SHA-1 of expanded text (as binary, not hex)

uint32 flags: 1=zlib compressed

uint32 sequence number this is based on, or -1 for full text

uint32 offset in text file of start

uint32 length of compressed delta in text file

uint32[3] reserved

Total 64 bytes.

The header is also 64 bytes, for tidyness and easy calculation. For

this format the header must be ``bzr revfile v2\n`` padded with

``\xff`` to 64 bytes.

The first record after the header is index 0. A record's base index

must be less than its own index.

100

101

The SHA-1 is redundant with the inventory but stored just as a check

102

on the compression methods and so that the file can be validated

103

without reference to any other information.

104

105

Each byte in the text file should be included by at most one delta.

106

107

108

Deltas

109

------

110

111

In a proposed (not implemented) storage with annotations, deltas to

112

the text are stored as a series of variable-length records::

113

114

uint32 idx

115

uint32 m

116

uint32 n

117

uint32 l

118

byte[l] new

119

120

This describes a change originally introduced in the revision

121

described by *idx* in the index.

122

123

This indicates that the region [m:n] of the input file should be

124

replaced by the text *new*. If m==n this is a pure insertion of l

125

bytes. If l==0 this is a pure deletion of (n-m) bytes.

126

127

128

129

130

131

Open issues

132

-----------

133

134

135

* Storing the annotations with the text is reasonably simple and

136

compact, but means that we always need to process the annotation

137

structure even when we only want the text. In particular it means

138

that full-texts cannot just simply be copied out but rather composed

139

from chunks. That seems inefficient since it is probably common to

140

only want the text.

141

142

* Should annotations also indicate where text was deleted?

143

144

* This design calls for only one annotation per line, which seems

145

standard. However, this is lacking in at least two cases:

146

147

- Lines which originate in the same way in more than one revision,

148

through being independently introduced. In this case we would

149

apparently have to make an arbitrary choice; I suppose branches

150

could prefer to assume lines originated in their own history.

151

152

- It might be useful to directly indicate which mergers included

153

which lines. We do have that information in the revision history

154

though, so there seems no need to store it for every line.

155

Older »