~bzr-pqm/bzr/bzr.dev : contents of doc/compared-codeville.txt at revision 496

~bzr-pqm/bzr/bzr.dev : (revision 496)

Codeville
*********

Documentation on how this actually works is pretty scarce to say the
least.

I *think* I understand their merge algorithm though, and it's pretty
clever.  Basically we do a two-way merge between annotated forms of
the two files: that is, with each line marked with the revision in
which it last changed.  (I am simplifying here by speaking of lines
and changes, but I don't think it causes any essential problem.)

Now we walk through each file, line by line.  If the change that
introduced the line state in branch A is already merged into branch B,
then we can just take B.

Now is this actually better?

It may be better in several ways:

* Do not need to choose just a single ancestor, but rather can
  take advantage of all possible previous changes.

* Can handle OTHER containing changes which have been merged into
  THIS, but have then been overwritten.

* Can handle cherrypicks(!) by remembering which lines came in from
  that cherrypick; then we don't need to merge them again.

Some questions:

* Do we actually need to store the annotations, or can we just infer
  it at the time we do the merge?

* Can this be accomodated in something like an SCCS weave format?  I
  think something like a weave may work, in as much as it is basically
  a concatenation of annotations, but I don't know if it represents
  merges well enough to cope.

Can this handle binaries or type-specific merges, and if so how?
Unmergeable binaries are easy: just get the user to pick one.  Things
like XML are harder; we probably need to punt out to a type-specific
three-way merge.  Of course this approach does not forbid also
offering a 3-way merge.

----

I suppose this could be accomodated by an annotation cache on top of
plain history storage, or by using a storage format such as a weave
that can efficiently produce annotation information.

That is to say there is nothing inherently necessary about remembering
the line history at the point when it is committed, except that it
might be more efficient to do this once and remember it than to 

----

There is another interesting approach that can be used even in a tool
that does not inherently remember annotations:

Given two files to merge, find all regions of difference.  For each
such, try to find a common ancestor having the same content for the
region.  Subdivide the region if necessary.

This naive approach is probably infeasible, since it would mean
checking every possible predecessor.

----

Rather than storing or calculating annotations, we could try using a
complex weave, which allows one file version to be represented as a
weave of multiple disjoint previous versions.  It sounds complex but
it might work.

Essentially we store each file as a selection of lines that should be
turned on in that file.  These files might come from any of the
predecessors that were merged into that file.  Complex to get right
but it might work.

This is written in terms of lines, but it might make more sense to
just use byte ranges: perhaps more efficient when handling long files,
and makes binaries less of a special case.

codeville in fact does *not* seem to do this, though to me it seems
like a fairly natural corollary of their design.

This seems to imply holding the file text and ancestry of every branch
that ever merged into this one, rather than just finding them if we
later want them.  Hm.  That is nice in terms of doing smart merges.
That possibly causes trouble in terms of having a name for these
branches floating around inside our space, and making sure we don't
clash with them.  It may make sense in terms of having a working
directory be just a view into a shared database, looking at a
particular line of development.

Indeed the main difficulty seems to be of naming branches in this
space.  Perhaps we should move back to using repositories and named
branches within them, but not rely on branch names being unique out of
the context of a single repository.

Wow, this seems to open a big can of worms.

----

So the conclusion is that this is very cool, but it does not require a
fundamental change of model and can be implemented later.

6 by mbp at sourcefrog import all docs from arch	1	Codeville
	2	*********
	3
	4	Documentation on how this actually works is pretty scarce to say the
	5	least.
	6
	7	I think I understand their merge algorithm though, and it's pretty
	8	clever. Basically we do a two-way merge between annotated forms of
	9	the two files: that is, with each line marked with the revision in
	10	which it last changed. (I am simplifying here by speaking of lines
	11	and changes, but I don't think it causes any essential problem.)
	12
	13	Now we walk through each file, line by line. If the change that
	14	introduced the line state in branch A is already merged into branch B,
	15	then we can just take B.
	16
	17	Now is this actually better?
	18
	19	It may be better in several ways:
	20
	21	* Do not need to choose just a single ancestor, but rather can
	22	take advantage of all possible previous changes.
	23
	24	* Can handle OTHER containing changes which have been merged into
	25	THIS, but have then been overwritten.
	26
	27	* Can handle cherrypicks(!) by remembering which lines came in from
	28	that cherrypick; then we don't need to merge them again.
	29
	30	Some questions:
	31
	32	* Do we actually need to store the annotations, or can we just infer
	33	it at the time we do the merge?
	34
	35	* Can this be accomodated in something like an SCCS weave format? I
	36	think something like a weave may work, in as much as it is basically
	37	a concatenation of annotations, but I don't know if it represents
	38	merges well enough to cope.
	39
	40	Can this handle binaries or type-specific merges, and if so how?
	41	Unmergeable binaries are easy: just get the user to pick one. Things
	42	like XML are harder; we probably need to punt out to a type-specific
	43	three-way merge. Of course this approach does not forbid also
	44	offering a 3-way merge.
	45
	46	----
	47
	48	I suppose this could be accomodated by an annotation cache on top of
	49	plain history storage, or by using a storage format such as a weave
	50	that can efficiently produce annotation information.
	51
	52	That is to say there is nothing inherently necessary about remembering
	53	the line history at the point when it is committed, except that it
	54	might be more efficient to do this once and remember it than to
	55
	56	----
	57
	58	There is another interesting approach that can be used even in a tool
	59	that does not inherently remember annotations:
	60
	61	Given two files to merge, find all regions of difference. For each
	62	such, try to find a common ancestor having the same content for the
	63	region. Subdivide the region if necessary.
	64
65	This naive approach is probably infeasible, since it would mean
66	checking every possible predecessor.
67
68	----
69
28 by Martin Pool doc: notes on implementing codeville-style merge on	70	Rather than storing or calculating annotations, we could try using a
	71	complex weave, which allows one file version to be represented as a
	72	weave of multiple disjoint previous versions. It sounds complex but
	73	it might work.
	74
	75	Essentially we store each file as a selection of lines that should be
	76	turned on in that file. These files might come from any of the
	77	predecessors that were merged into that file. Complex to get right
	78	but it might work.
	79
	80	This is written in terms of lines, but it might make more sense to
	81	just use byte ranges: perhaps more efficient when handling long files,
	82	and makes binaries less of a special case.
	83
	84	codeville in fact does not seem to do this, though to me it seems
	85	like a fairly natural corollary of their design.
	86
	87	This seems to imply holding the file text and ancestry of every branch
	88	that ever merged into this one, rather than just finding them if we
	89	later want them. Hm. That is nice in terms of doing smart merges.
	90	That possibly causes trouble in terms of having a name for these
	91	branches floating around inside our space, and making sure we don't
	92	clash with them. It may make sense in terms of having a working
	93	directory be just a view into a shared database, looking at a
	94	particular line of development.
	95
	96	Indeed the main difficulty seems to be of naming branches in this
	97	space. Perhaps we should move back to using repositories and named
	98	branches within them, but not rely on branch names being unique out of
	99	the context of a single repository.
	100
	101	Wow, this seems to open a big can of worms.
	102
	103	----
	104
6 by mbp at sourcefrog import all docs from arch	105	So the conclusion is that this is very cool, but it does not require a
6 by mbp at sourcefrog import all docs from arch	106	fundamental change of model and can be implemented later.