~bzr-pqm/bzr/bzr.dev : contents of doc/revfile-annotation.txt at revision 1398

~bzr-pqm/bzr/bzr.dev : (revision 1398)

==============================
Extension to store annotations
==============================

We might extend the revfile format in a future version to also store
annotations.  *This is not implemented yet.*

In previous versions, the  index file identified texts by their
SHA-1 digest.  This was unsatisfying for two reasons.  Firstly it
assumes that SHA-1 will not collide, which is not an assumption we
wish to make in long-lived files.  Secondly for annotations we need
to be able to map from file versions back to a revision.

Texts are identified by the name of the revfile and a UUID
corresponding to the first revision in which they were first
introduced.  This means that given a text we can identify which
revision it belongs to, and annotations can use the index within the
revfile to identify where a region was first introduced.

  We cannot identify texts by the integer revision number, because
  that would limit us to only referring to a file in a particular
  branch.

  I'd like to just use the revision-id, but those are variable-length
  strings, and I'd like the revfile index to be fixed-length and
  relatively short.  UUIDs can be encoded in binary as only 16 bytes.
  Perhaps we should just use UUIDs for revisions and be done?



Annotations
-----------

Annotations indicate which revision of a file first inserted a line
(or region of bytes).

Given a string, we can write annotations on it like so: a sequence of
*(index, length)* pairs, giving the *index* of the revision which
introduced the next run of *length* bytes.  The sum of the lengths
must equal the length of the string.  For text files the regions will
typically fall on line breaks.  This can be transformed in memory to
other structures, such as a list of *(index, content)* pairs.

When a line was inserted from a merge revision then the annotation for
that line should still be the source in the merged branch, rather than
just being the revision in which the merge took place.

They can cheaply be calculated when inserting a new text, but are
expensive to calculate after the fact because that requires searching
back through all previous text and all texts which were merged in.  It
therefore seems sensible to calculate them once and store them.

To do this we need two operators which update an existing annotated
file:

A. Given an annotated file and a working text, update the annotation to
   mark regions inserted in the working file as new in this revision.

B. Given two annotated files, merge them to produce an annotated
   result.    When there are conflicts, both texts should be included
   and annotated.

These may be repeated: after a merge there may be another merge, or
there may be manual fixups or conflict resolutions.

So what we require is given a diff or a diff3 between two files, map
the regions of bytes changed into corresponding updates to the origin
annotations.

Annotations can also be delta-compressed; we only need to add new
annotation data when there is a text insertion.

    (It is possible in a merge to have a change of annotation when
    there is no text change, though this seems unlikely.  This can
    still be represented as a "pointless" delta, plus an update to the
    annotations.)

Index file
----------

In a proposed (not implemented) storage with annotations, the index
file is a series of fixed-length records::

  byte[16]     UUID of revision
  byte[20]     SHA-1 of expanded text (as binary, not hex)
  uint32       flags: 1=zlib compressed
  uint32       sequence number this is based on, or -1 for full text
  uint32       offset in text file of start
  uint32       length of compressed delta in text file
  uint32[3]    reserved

Total 64 bytes.

The header is also 64 bytes, for tidyness and easy calculation.  For
this format the header must be ``bzr revfile v2\n`` padded with
``\xff`` to 64 bytes.

The first record after the header is index 0.  A record's base index
must be less than its own index.

The SHA-1 is redundant with the inventory but stored just as a check
on the compression methods and so that the file can be validated
without reference to any other information.

Each byte in the text file should be included by at most one delta.


Deltas
------

In a proposed (not implemented) storage with annotations, deltas to
the text are stored as a series of variable-length records::

  uint32        idx
  uint32        m
  uint32        n
  uint32        l
  byte[l]       new

This describes a change originally introduced in the revision
described by *idx* in the index.

This indicates that the region [m:n] of the input file should be
replaced by the text *new*.  If m==n this is a pure insertion of l
bytes.  If l==0 this is a pure deletion of (n-m) bytes.





Open issues
-----------


* Storing the annotations with the text is reasonably simple and
  compact, but means that we always need to process the annotation
  structure even when we only want the text.  In particular it means
  that full-texts cannot just simply be copied out but rather composed
  from chunks.  That seems inefficient since it is probably common to
  only want the text.

* Should annotations also indicate where text was deleted?

* This design calls for only one annotation per line, which seems
  standard.  However, this is lacking in at least two cases:

  - Lines which originate in the same way in more than one revision,
    through being independently introduced.  In this case we would
    apparently have to make an arbitrary choice; I suppose branches
    could prefer to assume lines originated in their own history.

  - It might be useful to directly indicate which mergers included
    which lines.  We do have that information in the revision history
    though, so there seems no need to store it for every line.


391 by Martin Pool - split out notes on storing annotations in revfiles	1	==============================
	2	Extension to store annotations
	3	==============================
	4
	5	We might extend the revfile format in a future version to also store
	6	annotations. This is not implemented yet.
	7
	8	In previous versions, the index file identified texts by their
	9	SHA-1 digest. This was unsatisfying for two reasons. Firstly it
	10	assumes that SHA-1 will not collide, which is not an assumption we
	11	wish to make in long-lived files. Secondly for annotations we need
	12	to be able to map from file versions back to a revision.
	13
	14	Texts are identified by the name of the revfile and a UUID
	15	corresponding to the first revision in which they were first
	16	introduced. This means that given a text we can identify which
	17	revision it belongs to, and annotations can use the index within the
	18	revfile to identify where a region was first introduced.
	19
	20	We cannot identify texts by the integer revision number, because
	21	that would limit us to only referring to a file in a particular
	22	branch.
	23
	24	I'd like to just use the revision-id, but those are variable-length
	25	strings, and I'd like the revfile index to be fixed-length and
	26	relatively short. UUIDs can be encoded in binary as only 16 bytes.
	27	Perhaps we should just use UUIDs for revisions and be done?
	28
	29
	30
	31	Annotations
	32	-----------
	33
	34	Annotations indicate which revision of a file first inserted a line
	35	(or region of bytes).
	36
	37	Given a string, we can write annotations on it like so: a sequence of
	38	(index, length) pairs, giving the index of the revision which
	39	introduced the next run of length bytes. The sum of the lengths
	40	must equal the length of the string. For text files the regions will
	41	typically fall on line breaks. This can be transformed in memory to
	42	other structures, such as a list of (index, content) pairs.
	43
	44	When a line was inserted from a merge revision then the annotation for
	45	that line should still be the source in the merged branch, rather than
	46	just being the revision in which the merge took place.
	47
	48	They can cheaply be calculated when inserting a new text, but are
	49	expensive to calculate after the fact because that requires searching
	50	back through all previous text and all texts which were merged in. It
	51	therefore seems sensible to calculate them once and store them.
	52
	53	To do this we need two operators which update an existing annotated
	54	file:
	55
	56	A. Given an annotated file and a working text, update the annotation to
	57	mark regions inserted in the working file as new in this revision.
	58
	59	B. Given two annotated files, merge them to produce an annotated
	60	result. When there are conflicts, both texts should be included
	61	and annotated.
	62
	63	These may be repeated: after a merge there may be another merge, or
	64	there may be manual fixups or conflict resolutions.
65
66	So what we require is given a diff or a diff3 between two files, map
67	the regions of bytes changed into corresponding updates to the origin
68	annotations.
69
70	Annotations can also be delta-compressed; we only need to add new
71	annotation data when there is a text insertion.
72
73	(It is possible in a merge to have a change of annotation when
74	there is no text change, though this seems unlikely. This can
75	still be represented as a "pointless" delta, plus an update to the
76	annotations.)
77
78	Index file
79	----------
80
81	In a proposed (not implemented) storage with annotations, the index
82	file is a series of fixed-length records::
83
84	byte[16] UUID of revision
85	byte[20] SHA-1 of expanded text (as binary, not hex)
86	uint32 flags: 1=zlib compressed
87	uint32 sequence number this is based on, or -1 for full text
88	uint32 offset in text file of start
89	uint32 length of compressed delta in text file
90	uint32[3] reserved
91
92	Total 64 bytes.
93
94	The header is also 64 bytes, for tidyness and easy calculation. For
95	this format the header must be ``bzr revfile v2\n`` padded with
96	``\xff`` to 64 bytes.
97
98	The first record after the header is index 0. A record's base index
99	must be less than its own index.
100
101	The SHA-1 is redundant with the inventory but stored just as a check
102	on the compression methods and so that the file can be validated
103	without reference to any other information.
104
105	Each byte in the text file should be included by at most one delta.
106
107
108	Deltas
109	------
110
111	In a proposed (not implemented) storage with annotations, deltas to
112	the text are stored as a series of variable-length records::
113
114	uint32 idx
115	uint32 m
116	uint32 n
117	uint32 l
118	byte[l] new
119
120	This describes a change originally introduced in the revision
121	described by idx in the index.
122
123	This indicates that the region [m:n] of the input file should be
124	replaced by the text new. If m==n this is a pure insertion of l
125	bytes. If l==0 this is a pure deletion of (n-m) bytes.
126
127
128
129
130
131	Open issues
132	-----------
133
134
135	* Storing the annotations with the text is reasonably simple and
136	compact, but means that we always need to process the annotation
137	structure even when we only want the text. In particular it means
138	that full-texts cannot just simply be copied out but rather composed
139	from chunks. That seems inefficient since it is probably common to
140	only want the text.
141
142	* Should annotations also indicate where text was deleted?
143
144	* This design calls for only one annotation per line, which seems
145	standard. However, this is lacking in at least two cases:
146
147	- Lines which originate in the same way in more than one revision,
148	through being independently introduced. In this case we would
149	apparently have to make an arbitrary choice; I suppose branches
150	could prefer to assume lines originated in their own history.
151
152	- It might be useful to directly indicate which mergers included
153	which lines. We do have that information in the revision history
154	though, so there seems no need to store it for every line.
155