~bzr-pqm/bzr/bzr.dev : contents of doc/developers/incremental-push-pull.txt at revision 4634.108.9

~bzr-pqm/bzr/bzr.dev : (revision 4634.108.9)

Incremental push/pull
=====================

This use case covers pulling in or pushing out some number of revisions which
is typically a small fraction of the number already present in the target
repository. Pushing and pulling are defined as branch level operations for ease
of interaction with VCS systems that have no repository abstraction (such as
bzr-svn or GNU Arch) but within bzrlib's core they are currently the
responsibility of the Repository object.

Functional Requirements
-----------------------

A push or pull operation must:
 * Copy all the data to reconstruct the selected revisions in the target
   branch. This is the goal of push and pull after all.
 * Reject corrupt data. As bzr has no innate mechanism for discarding corrupted
   data, corrupted data should not be incorporated accidentally.

Factors which should add work for push/pull
-------------------------------------------

 * Baseline overhead: The time to connect to both branches.
 * Actual new data in the revisions being pulled (drives the amount of data to
   move around, includes the commit messages etc)
 * Number of revisions in the two repositories (scaling affects the
   determination of what revisions to move around).

Push/pull overview
------------------

1. New data is identified in the source repository.
2. That data is read from the source repository.
3. The same data is verified and written to the target repository in such a
   manner that its not visible to readers until its ready for use.

New data identification
~~~~~~~~~~~~~~~~~~~~~~~

We have a single top level data object: revisions. Everything else is
subordinate to revisions, so determining the revisions to propagate should be
all thats needed. This depends on revisions with partial data - such as those
with no signature - being flagged in some efficient manner.

We could do this in two manners: determine revisions to sync and signatures to sync in two passes, or change the 'value' of a revision implicitly when the signature is different. E.g. by using merkle hash trees with the signature data a separate component the signatures will naturally be identified to sync.

We want to only exchange data proportional to the number of new revisions and
signatures in the system though. One way to achieve this for revisions is to
walk the graph out from the desired tips until the surface area intersection is
found. For signatures a set difference seems to be needed as there is no DAG of signatures: the presence of one has no implications on the presence of another, so a full pass over the set of signatures would be required to confirm no new signatures are needed (let alone replaced signatures).

IFF we can determine 'new revisions' and 'new signatures' without full graph access then we can scale acceptable for push and pull.

Ghosts are revisions which are not present in a particular repository. Filling ghosts refers to removing ghosts in the target repository when the ghost is present in the source repository. Filling ghosts can be either an explicit or implicit action. The common case is no ghosts.

Set synchronisation approaches
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

A set synchronisation approach is one which synchronises two sets without
regard for innate structure. This can be very efficient but requires adding a
new node to be processed with every commit. Caching of the results of the
various set based syncs I've seen is possible but because the data structures
look different depending on the tip revision being synced up to the cache needs
to be very complex. I recommend not using such an approach for the common case
pull because of the failure to scale. We can use such an approach for
synchronisation of new signatures and ghosts, which should be an explicit
option in both cases.

DAG synchronisation approaches
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

A DAG based approach to synchronistion is one that uses the DAG structure to
determine the difference in present nodes. It can as a result operate from the
tip of the DAG backwards. A dag based approach should allow incremental access
to data and not require a full-graph scan for incremental operations.

File level scaling
^^^^^^^^^^^^^^^^^^

We should read roughly as much of the revision level graph as is needed from
each repository to determine the node difference.  If requested we should
perform a detailed scan to pick up ghost revisions and revisions which have had
signatures added. This should not be the default as it requires full history
access in both cases.

Expected file IO and access pattern:

 * Common case: repo with many branches of one project, to the same.

   1. Source and Target branch tips read.
   2. Find the tip of each branch in their repo (will require reading some of
      the revision graph but is typically near the end of the graph).
   3. Read and parse increasing amounts of the revision graph until one is
      found to be a subset of the other, or a complete list of revisions to be
      transmitted is created.

 * Uncommon cases: 
   
   1. Repositories with many projects or branches which are very old may
      require reading a lot of unrelated graph data.

   1. Initial push/pull scenarios should not require reading an entire graph.


API scaling
^^^^^^^^^^^

 1. Get branch tips.
 2. Determine one sided graph difference. To avoid obtaining a full graph over
    the wire this needs to be done without reference to the full graph, and
    with some logarthmic scaling algorithm. There are several already available
    for this. 

With ghost and new-signature detection:

 * File IO access pattern will read the entire graph on the 'target' side - if
   no ghosts are present then stop, otherwise seek the new revisions on the
   source side with the regular algorithm and also explicitly search for the
   ghost points from the target; plus a set difference search is needed on
   signatures.

 * Semantic level can probably be tuned, but as its also complex I suggest
   deferring analysis for optimal behaviour of this use case.


Data reading
~~~~~~~~~~~~

When transferring information about a revision the graph of data for the
revision is walked: revision -> inventory, revision -> matching signature,
inventory -> file ids:revision pairs.

File level scaling
^^^^^^^^^^^^^^^^^^

As we're reading already committed data, as long as nothing is mutating data on
disk reading should be race free. We will:

 - read each revision object
 - read the matching inventory delta
 - attempt to read a signature object
 - parse the inventory delta
 - read the fileid:revisionid compressed chunk for each line in the inventory
   delta

Theres no point validating that the data read is valid, as transmission through to the client writing the data might invalidate it; we need to validate before we write.

API scaling
^^^^^^^^^^^

Given that we have established the revisions needed, a single API call should
suffice to obtain all data; the API should present the data in such an order
that it can be validated as it arrives and thus not require large scale
buffering on disk. Specifically each item of data should be validatable (e.g.
for some file data we want the fileid:revisionid:validationhash + content).


Data Verification and writing
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

New data written to a repository should be completed intact when it is made
visible. This suggests that either all the data for a revision must be made
atomically visible (e.g. by renaming a single file) or the leaf nodes of the
reference graph must become visible first.

Data is referred to via the following graph:
revision -> revision
revision -> signature
revision -> inventory
inventory -> fileid:revisionid
fileid:revisionid -> fileid:revisionid

Data is verifiable via a different ordering:
signature -> revision -> inventory -> fileid:revisionid texts.

We dont gpg verify each revision today; this analysis only speaks to hash
verification of contents.

To validate a revision we need to validate the data it refers to. But to
validate the contents of a revision we need the new texts in the inventory for
the revision - to check a fileid:revisionid we need to know the expected sha1
of the full text and thus also need to read the delta chain to construct the
text as we accept it to determine if its valid. Providing separate validators
for the chosen representation would address this.
e.g: For an inventory entry FILEID:REVISIONID we store the validator of the
full text :SHA1:. If we also stored the validator of the chosen disk
representation (:DELTASHA1:) we could validate the transmitted representation
without expanding the delta in the common case. If that failed we could expand
the delta chain and try against the full text validator, and finally fail. As
different delta generators might generate different deltas, :DELTASHA1: should
not become part of the revision validator, only the inventory disk encoding. In
a related manner a transmission format that allowed cheap validation of content
without applying locally stored deltas would be advantageous because no local
reads would be incurred to validate new content. For instance, always sending a
full text for any file, possibly with a delta-chain when transmitting multiple
revisionids of the file, would allow this. (git pack-files have this property).

Overview summary
^^^^^^^^^^^^^^^^

A single-file local format would allow safe atomic addition of data while
allowing optimisal transmission order of data. Failing this the validation of
data should be tuned to not require reading local texts during data addition
even in the presence of delta chains. We should have transmission-validators
separate from content validators that allow validation of the delta-transmitted
form of objects.

File level scaling
^^^^^^^^^^^^^^^^^^

* Every new file text requires transmission and local serialisation.
* Every commit requires transmission and storage of a revision, signature and inventory.

Thus 4000 commits to a 50000 path tree of 10 files on averages requires (with
knits) between 26 writes (2*(3+10)) and 80006 (2*(4000*10 + 3)) writes. In all
cases there are 4000 * 13 distinct objects to record.

Grouping data by fileid, content and metadata, gives the figures above.
Data grouping:

* File per full identifier (fileid:revisionid:meta|content): 104000
* Delta-chain per object: object id count * constant overhead per object id 
  (26 -> 80006)
* Collation/pack file: 1

Performance for these depends heavily on implementation:
 - Using full ids we could name by validator or by id, giving best performance
   that depends on either receiving data in validator order or in id order.
 - using delta-chain per object we get least seek overhead and syscall overhead
   if we recieve in topological order within the object id, and object ids in
   lexical order.
 - Using a collation/pack file we can stream it into place and validate as we go,
   giving near ideal performance.

API scaling
^^^^^^^^^^^

The api for writing new data recieved over the network will need to be geared
to the transmission and local storage method. What we need is for the
transmission method to reasonably closely match the desired write ordering
locally. This suggests that once we decide on the best local storage means we
should design the api.


take N commits from A to B, if B is local then merge changes into the tree.
copy ebough data to recreate snapshots
avoid ending up wth corrupt/bad data

Notes from London
-----------------

 #. setup

   look at graph of revisions for ~N comits to deretmine eligibility for 
   if preserve mainline is on, check LH only

    identify objects to send that are not on the client repo
      - revision - may be proportional to the graph
      - inventory - proportional to work
      - texts     - proportional to work
      - signatures - ???

 #. data transmission

  * send data proportional to the new information
  * validate the data:

   #. validate the sha1 of the full text of each transmitted text.
   #. validate the sha1:name mapping in each newly referenced inventory item.
   #. validate the sha1 of the XML of each inventory against the revision.
      **this is proportional to tree size and must be fixed**

 #. write the data to the local repo.
    The API should output the file texts needed by the merge as by product of the transmission

 #. tree application

Combine the output from the transmission step with additional 'new work data' for anything already in the local repository that is new in this tree.
should write new files and stat existing files proportional to the count of the new work and the size of the full texts.

2481.1.5 by Robert Collins Incremental push-pull performance anlysis draft.	1	Incremental push/pull
2506.1.1 by Alexander Belchenko sanitize developers docs	2	=====================
2481.1.5 by Robert Collins Incremental push-pull performance anlysis draft.	3
	4	This use case covers pulling in or pushing out some number of revisions which
	5	is typically a small fraction of the number already present in the target
	6	repository. Pushing and pulling are defined as branch level operations for ease
	7	of interaction with VCS systems that have no repository abstraction (such as
	8	bzr-svn or GNU Arch) but within bzrlib's core they are currently the
	9	responsibility of the Repository object.
	10
	11	Functional Requirements
2506.1.1 by Alexander Belchenko sanitize developers docs	12	-----------------------
2481.1.5 by Robert Collins Incremental push-pull performance anlysis draft.	13
	14	A push or pull operation must:
	15	* Copy all the data to reconstruct the selected revisions in the target
	16	branch. This is the goal of push and pull after all.
	17	* Reject corrupt data. As bzr has no innate mechanism for discarding corrupted
	18	data, corrupted data should not be incorporated accidentally.
	19
	20	Factors which should add work for push/pull
2506.1.1 by Alexander Belchenko sanitize developers docs	21	-------------------------------------------
2481.1.5 by Robert Collins Incremental push-pull performance anlysis draft.	22
	23	* Baseline overhead: The time to connect to both branches.
	24	* Actual new data in the revisions being pulled (drives the amount of data to
	25	move around, includes the commit messages etc)
	26	* Number of revisions in the two repositories (scaling affects the
	27	determination of what revisions to move around).
	28
	29	Push/pull overview
2506.1.1 by Alexander Belchenko sanitize developers docs	30	------------------
2481.1.5 by Robert Collins Incremental push-pull performance anlysis draft.	31
	32	1. New data is identified in the source repository.
	33	2. That data is read from the source repository.
	34	3. The same data is verified and written to the target repository in such a
	35	manner that its not visible to readers until its ready for use.
	36
	37	New data identification
2506.1.1 by Alexander Belchenko sanitize developers docs	38	~~~~~~~~~~~~~~~~~~~~~~~
2481.1.5 by Robert Collins Incremental push-pull performance anlysis draft.	39
	40	We have a single top level data object: revisions. Everything else is
4031.3.1 by Frank Aspell Fixing various typos	41	subordinate to revisions, so determining the revisions to propagate should be
2481.1.5 by Robert Collins Incremental push-pull performance anlysis draft.	42	all thats needed. This depends on revisions with partial data - such as those
	43	with no signature - being flagged in some efficient manner.
	44
	45	We could do this in two manners: determine revisions to sync and signatures to sync in two passes, or change the 'value' of a revision implicitly when the signature is different. E.g. by using merkle hash trees with the signature data a separate component the signatures will naturally be identified to sync.
	46
	47	We want to only exchange data proportional to the number of new revisions and
	48	signatures in the system though. One way to achieve this for revisions is to
	49	walk the graph out from the desired tips until the surface area intersection is
	50	found. For signatures a set difference seems to be needed as there is no DAG of signatures: the presence of one has no implications on the presence of another, so a full pass over the set of signatures would be required to confirm no new signatures are needed (let alone replaced signatures).
	51
	52	IFF we can determine 'new revisions' and 'new signatures' without full graph access then we can scale acceptable for push and pull.
	53
	54	Ghosts are revisions which are not present in a particular repository. Filling ghosts refers to removing ghosts in the target repository when the ghost is present in the source repository. Filling ghosts can be either an explicit or implicit action. The common case is no ghosts.
	55
	56	Set synchronisation approaches
	57	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
	58
	59	A set synchronisation approach is one which synchronises two sets without
	60	regard for innate structure. This can be very efficient but requires adding a
	61	new node to be processed with every commit. Caching of the results of the
	62	various set based syncs I've seen is possible but because the data structures
	63	look different depending on the tip revision being synced up to the cache needs
	64	to be very complex. I recommend not using such an approach for the common case
	65	pull because of the failure to scale. We can use such an approach for
	66	synchronisation of new signatures and ghosts, which should be an explicit
	67	option in both cases.
	68
	69	DAG synchronisation approaches
	70	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
	71
	72	A DAG based approach to synchronistion is one that uses the DAG structure to
	73	determine the difference in present nodes. It can as a result operate from the
	74	tip of the DAG backwards. A dag based approach should allow incremental access
	75	to data and not require a full-graph scan for incremental operations.
	76
	77	File level scaling
	78	^^^^^^^^^^^^^^^^^^
	79
	80	We should read roughly as much of the revision level graph as is needed from
	81	each repository to determine the node difference. If requested we should
	82	perform a detailed scan to pick up ghost revisions and revisions which have had
	83	signatures added. This should not be the default as it requires full history
	84	access in both cases.
	85
	86	Expected file IO and access pattern:
	87
	88	* Common case: repo with many branches of one project, to the same.
	89
	90	1. Source and Target branch tips read.
	91	2. Find the tip of each branch in their repo (will require reading some of
	92	the revision graph but is typically near the end of the graph).
	93	3. Read and parse increasing amounts of the revision graph until one is
	94	found to be a subset of the other, or a complete list of revisions to be
	95	transmitted is created.
	96
	97	* Uncommon cases:
	98
	99	1. Repositories with many projects or branches which are very old may
	100	require reading a lot of unrelated graph data.
	101
	102	1. Initial push/pull scenarios should not require reading an entire graph.
	103
	104
	105	API scaling
106	^^^^^^^^^^^
107
108	1. Get branch tips.
109	2. Determine one sided graph difference. To avoid obtaining a full graph over
110	the wire this needs to be done without reference to the full graph, and
111	with some logarthmic scaling algorithm. There are several already available
112	for this.
113
114	With ghost and new-signature detection:
115
116	* File IO access pattern will read the entire graph on the 'target' side - if
117	no ghosts are present then stop, otherwise seek the new revisions on the
118	source side with the regular algorithm and also explicitly search for the
119	ghost points from the target; plus a set difference search is needed on
120	signatures.
121
122	* Semantic level can probably be tuned, but as its also complex I suggest
123	deferring analysis for optimal behaviour of this use case.
124
125
126	Data reading
2506.1.1 by Alexander Belchenko sanitize developers docs	127	~~~~~~~~~~~~
2481.1.5 by Robert Collins Incremental push-pull performance anlysis draft.	128
	129	When transferring information about a revision the graph of data for the
	130	revision is walked: revision -> inventory, revision -> matching signature,
	131	inventory -> file ids:revision pairs.
	132
	133	File level scaling
	134	^^^^^^^^^^^^^^^^^^
	135
	136	As we're reading already committed data, as long as nothing is mutating data on
	137	disk reading should be race free. We will:
	138
	139	- read each revision object
	140	- read the matching inventory delta
	141	- attempt to read a signature object
	142	- parse the inventory delta
	143	- read the fileid:revisionid compressed chunk for each line in the inventory
	144	delta
	145
	146	Theres no point validating that the data read is valid, as transmission through to the client writing the data might invalidate it; we need to validate before we write.
	147
	148	API scaling
	149	^^^^^^^^^^^
	150
	151	Given that we have established the revisions needed, a single API call should
	152	suffice to obtain all data; the API should present the data in such an order
	153	that it can be validated as it arrives and thus not require large scale
	154	buffering on disk. Specifically each item of data should be validatable (e.g.
	155	for some file data we want the fileid:revisionid:validationhash + content).
	156
	157
	158	Data Verification and writing
2506.1.1 by Alexander Belchenko sanitize developers docs	159	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2481.1.5 by Robert Collins Incremental push-pull performance anlysis draft.	160
	161	New data written to a repository should be completed intact when it is made
	162	visible. This suggests that either all the data for a revision must be made
	163	atomically visible (e.g. by renaming a single file) or the leaf nodes of the
	164	reference graph must become visible first.
	165
	166	Data is referred to via the following graph:
	167	revision -> revision
	168	revision -> signature
	169	revision -> inventory
	170	inventory -> fileid:revisionid
	171	fileid:revisionid -> fileid:revisionid
	172
	173	Data is verifiable via a different ordering:
	174	signature -> revision -> inventory -> fileid:revisionid texts.
	175
	176	We dont gpg verify each revision today; this analysis only speaks to hash
	177	verification of contents.
	178
	179	To validate a revision we need to validate the data it refers to. But to
	180	validate the contents of a revision we need the new texts in the inventory for
	181	the revision - to check a fileid:revisionid we need to know the expected sha1
	182	of the full text and thus also need to read the delta chain to construct the
	183	text as we accept it to determine if its valid. Providing separate validators
	184	for the chosen representation would address this.
	185	e.g: For an inventory entry FILEID:REVISIONID we store the validator of the
	186	full text :SHA1:. If we also stored the validator of the chosen disk
	187	representation (:DELTASHA1:) we could validate the transmitted representation
	188	without expanding the delta in the common case. If that failed we could expand
	189	the delta chain and try against the full text validator, and finally fail. As
	190	different delta generators might generate different deltas, :DELTASHA1: should
	191	not become part of the revision validator, only the inventory disk encoding. In
	192	a related manner a transmission format that allowed cheap validation of content
	193	without applying locally stored deltas would be advantageous because no local
	194	reads would be incurred to validate new content. For instance, always sending a
	195	full text for any file, possibly with a delta-chain when transmitting multiple
	196	revisionids of the file, would allow this. (git pack-files have this property).
	197
	198	Overview summary
	199	^^^^^^^^^^^^^^^^
	200
	201	A single-file local format would allow safe atomic addition of data while
	202	allowing optimisal transmission order of data. Failing this the validation of
	203	data should be tuned to not require reading local texts during data addition
	204	even in the presence of delta chains. We should have transmission-validators
	205	separate from content validators that allow validation of the delta-transmitted
	206	form of objects.
	207
	208	File level scaling
	209	^^^^^^^^^^^^^^^^^^
	210
	211	* Every new file text requires transmission and local serialisation.
	212	* Every commit requires transmission and storage of a revision, signature and inventory.
	213
	214	Thus 4000 commits to a 50000 path tree of 10 files on averages requires (with
	215	knits) between 26 writes (2(3+10)) and 80006 (2(4000*10 + 3)) writes. In all
	216	cases there are 4000 * 13 distinct objects to record.
	217
	218	Grouping data by fileid, content and metadata, gives the figures above.
	219	Data grouping:
	220
	221	* File per full identifier (fileid:revisionid:meta\|content): 104000
	222	* Delta-chain per object: object id count * constant overhead per object id
	223	(26 -> 80006)
224	* Collation/pack file: 1
225
226	Performance for these depends heavily on implementation:
227	- Using full ids we could name by validator or by id, giving best performance
228	that depends on either receiving data in validator order or in id order.
229	- using delta-chain per object we get least seek overhead and syscall overhead
230	if we recieve in topological order within the object id, and object ids in
231	lexical order.
232	- Using a collation/pack file we can stream it into place and validate as we go,
233	giving near ideal performance.
234
235	API scaling
236	^^^^^^^^^^^
237
238	The api for writing new data recieved over the network will need to be geared
239	to the transmission and local storage method. What we need is for the
240	transmission method to reasonably closely match the desired write ordering
241	locally. This suggests that once we decide on the best local storage means we
242	should design the api.
2485.4.5 by Robert Collins Incremental push-pull notes.	243
	244
	245	take N commits from A to B, if B is local then merge changes into the tree.
	246	copy ebough data to recreate snapshots
	247	avoid ending up wth corrupt/bad data
	248
2485.4.7 by Robert Collins Review feedback.	249	Notes from London
2506.1.1 by Alexander Belchenko sanitize developers docs	250	-----------------
2485.4.7 by Robert Collins Review feedback.	251
	252	#. setup
	253
	254	look at graph of revisions for ~N comits to deretmine eligibility for
	255	if preserve mainline is on, check LH only
2485.4.5 by Robert Collins Incremental push-pull notes.	256
	257	identify objects to send that are not on the client repo
	258	- revision - may be proportional to the graph
	259	- inventory - proportional to work
	260	- texts - proportional to work
	261	- signatures - ???
	262
2485.4.7 by Robert Collins Review feedback.	263	#. data transmission
	264
	265	* send data proportional to the new information
	266	* validate the data:
	267
	268	#. validate the sha1 of the full text of each transmitted text.
	269	#. validate the sha1:name mapping in each newly referenced inventory item.
	270	#. validate the sha1 of the XML of each inventory against the revision.
2506.1.1 by Alexander Belchenko sanitize developers docs	271	this is proportional to tree size and must be fixed
2485.4.7 by Robert Collins Review feedback.	272
	273	#. write the data to the local repo.
	274	The API should output the file texts needed by the merge as by product of the transmission
	275
	276	#. tree application
	277
	278	Combine the output from the transmission step with additional 'new work data' for anything already in the local repository that is new in this tree.
2485.4.5 by Robert Collins Incremental push-pull notes.	279	should write new files and stat existing files proportional to the count of the new work and the size of the full texts.