~bzr-pqm/bzr/bzr.dev : contents of doc/developers/planned-change-integration.txt at revision 5462.5.1

~bzr-pqm/bzr/bzr.dev : (revision 5462.5.1)

Integration of performance changes
==================================

To deliver a version of bzr with all our planned changes will require
significant integration work. Minimally each change needs to integrate with
some aspect of the bzr version it's merged into, but in reality many of these
changes while conceptually independent will in fact have to integrate with the
other changes we have planned before can have a completed system.

Additionally changes that alter disk formats are inherently more tricky to
integrate because we will often need to alter apis throughout the code base to
expose the increased or reduced model of the preferred disk format.

You can generate a graph ``performance.png`` in the source tree from
Graphviz "dot" file ``performance.dot``.  This graphs out the dependencies
to let us make accurate assessments of the changes needed in terms of code
and API, hopefully minimising the number of different integration steps we
have to take, while giving us a broad surface area for development. It's
based on a summary in the next section of this document of the planned
changes with their expected collaborators and dependencies. Where a
command is listed, the expectation is that all uses of that command -
local, remote, dumb transport and smart transport are being addressed
together.


The following provides a summary of the planned changes and their expected
collaborators within the code base, along with an estimate of whether they are
likely to require changes to their collaborators to be considered 'finished'.

 * Use case target APIs: Each of these is likely to alter the Tree interface.
   Some few of them focus on Branch and will alter Branch and Repository
   accordingly. As they are targeted APIs we can deep changes all the way down
   the stack to the underlying representation to make it all fit well.
   Presenting a top level API for many things will be possible now as long as
   the exposed data is audited for things we plan to make optional, or remove:
   Such things cannot be present in the final API. Writing these APIs now will
   provide strong feedback to the design process for those things which are
   considered optional or removable, so these APIs should be implemented
   before removing or making optional existing data.

 * Deprecating versioned files as a supported API: This collaborates with the
   Repository API but can probably be done by adding a replacement API for
   places where the versioned-file api is used. We may well want to keep a
   concept of 'a file over time' or 'inventories over time', so the existing
   repository model of exposing versioned file objects may be ok; what we need
   to ensure we do is remove the places in the code base where you create or
   remove or otherwise describe manipulation of the storage by knit rather than
   talking at the level of file ids and revision ids. The current
   versioned-file API would be a burden for implementors of a blob based
   repository format, so the removal of callers, and deprecation of those parts
   of the API should be done before creating a blob based repository format.

 * Creating a revision validator: Revision validators may depend on storage
   layer changes to inventories so while we can create a revision validator
   API, we cannot create the final one until we have the inventory structural
   changes completed.

 * Annotation caching API: This API is a prerequisite for new repository
   formats. If written after they are introduced we may find that the
   repository is lacking in functionality, so the API should be implemented
   first.

 * _iter_changes based merging: If the current _iter_changes_ API is
   insufficient, we should know about that before designing the disk format for
   generating fast _iter_changes_ output.

 * Network-efficient revision graph API: This influences what questions we will
   want to ask a local repository very quickly; as such it's a driver for the
   new repository format and should be in place first if possible. Its probably
   not sufficiently different to local operations to make this a hard ordering
   though.

 * Working tree disk ordering: Knowing the expected order for disk operations
   may influence the needed use case specific APIs, so having a solid
   understanding of what is optimal - and why - and whether it is pessimal on
   non-Linux-kernel platforms is rather important.

 * Be able to version files greater than memory in size: This cannot be
   achieved until all parts of the library which deal with user files are able
   to provide access to files larger than memory. Many strategies can be
   considered for this - such as temporary files on disk, memory mapping etc.
   We should have enough of a design laid out that developers of repository and
   tree logic are able to start exposing apis, and considering requirements
   related to them, to let this happen.

 * Per-file graph access API: This should be implemented on top of or as part
   of the newer API for accessing data about a file over time. It can be a
   separate step easily; but as it's in the same area of the library should not
   be done in parallel.

 * Repository stacking API: The key dependency/change required for this is that
   repositories must individually be happy with having partial data - e.g. many
   ghosts. However the way the API needs to be used should be driven from the
   command layer in, because its unclear at the moment what will work best.

 * Revision stream API: This API will become clear as we streamline commands.
   On the data insertion side commit will want to generate new data. The
   commands pull, bundle, merge, push, possibly uncommit will want to copy
   existing data in a streaming fashion.

 * New container format: Its hard to tell what the right way to structure the
   layering is. Probably having smooth layering down to the point that code
   wants to operate on the containers directly will make this more clear. As
   bundles will become a read-only branch & repository, the smart server wants
   streaming-containers, and we are planning a pack based repository, it
   appears that we will have three different direct container users. However,
   the bundle user may in fact be fake - because it really is a repository.

 * Separation of annotation cache: Making the disk changes to achieve this
   depends on the new API being created. Bundles probably want to be
   annotation-free, so they are a form of implementation of this and will need
   the on-demand annotation facility.

 * Repository operation disk ordering: Dramatically changing the ordering of
   disk operations requires a new repository format. We have most of the
   analysis done to be able to specify the desired ordering, so it should be
   possible to write such a format now based on the container logic, but
   without any of the inventory representation or delta representation changes.
   This would for instance involve pack combining ordering the existing diffs
   in reverse order.

 * Inventory representation: This has a dependency on what data is
   dropped from the core and what is kept. Without those changes being known we
   can implement a new representation, but it won't be a final one. One of the
   services the new inventory representation is expected to deliver is one of
   validators for subtrees -- a means of comparing just subtrees of two
   inventories without comparing all the data within that subtree.

 * Delta storage optimisation: This has a strict dependency on a new repository
   format. Optimisation takes many forms - we probably cannot complete the
   desired optimisations under knits though we could use xdelta within a
   knit-variation.

 * Greatest distance from origin cache: The potential users of this exist
   today, it is likely able to be implemented immediately, but we are not sure
   that its needed anymore, so it is being shelved.

 * Removing derivable data: Its very hard to do this while the derived data is
   exposed in API's but not used by commands. Implemented the targeted API's
   for our core use cases should allow use to remove accidental use of derived
   data, making only explicit uses of it visible, and isolating the impact of
   removing it : allowing us to experiment sensibly. This covers both dropping
   the per-file merge graph and the hash-based-names proposals.

2522.3.1 by Robert Collins Draft proposed integration order for performance changes.	1	Integration of performance changes
	2	==================================
	3
	4	To deliver a version of bzr with all our planned changes will require
	5	significant integration work. Minimally each change needs to integrate with
	6	some aspect of the bzr version it's merged into, but in reality many of these
	7	changes while conceptually independent will in fact have to integrate with the
	8	other changes we have planned before can have a completed system.
	9
	10	Additionally changes that alter disk formats are inherently more tricky to
	11	integrate because we will often need to alter apis throughout the code base to
	12	expose the increased or reduced model of the preferred disk format.
	13
4424.1.2 by Martin Pool Remove another reference to performance.png	14	You can generate a graph ``performance.png`` in the source tree from
	15	Graphviz "dot" file ``performance.dot``. This graphs out the dependencies
	16	to let us make accurate assessments of the changes needed in terms of code
	17	and API, hopefully minimising the number of different integration steps we
	18	have to take, while giving us a broad surface area for development. It's
	19	based on a summary in the next section of this document of the planned
	20	changes with their expected collaborators and dependencies. Where a
	21	command is listed, the expectation is that all uses of that command -
	22	local, remote, dumb transport and smart transport are being addressed
	23	together.
2522.3.1 by Robert Collins Draft proposed integration order for performance changes.	24
	25
	26	The following provides a summary of the planned changes and their expected
	27	collaborators within the code base, along with an estimate of whether they are
	28	likely to require changes to their collaborators to be considered 'finished'.
	29
	30	* Use case target APIs: Each of these is likely to alter the Tree interface.
	31	Some few of them focus on Branch and will alter Branch and Repository
	32	accordingly. As they are targeted APIs we can deep changes all the way down
	33	the stack to the underlying representation to make it all fit well.
	34	Presenting a top level API for many things will be possible now as long as
	35	the exposed data is audited for things we plan to make optional, or remove:
	36	Such things cannot be present in the final API. Writing these APIs now will
	37	provide strong feedback to the design process for those things which are
	38	considered optional or removable, so these APIs should be implemented
	39	before removing or making optional existing data.
4853.1.1 by Patrick Regan Removed trailing whitespace from files in doc directory	40
2522.3.1 by Robert Collins Draft proposed integration order for performance changes.	41	* Deprecating versioned files as a supported API: This collaborates with the
	42	Repository API but can probably be done by adding a replacement API for
	43	places where the versioned-file api is used. We may well want to keep a
	44	concept of 'a file over time' or 'inventories over time', so the existing
	45	repository model of exposing versioned file objects may be ok; what we need
	46	to ensure we do is remove the places in the code base where you create or
	47	remove or otherwise describe manipulation of the storage by knit rather than
	48	talking at the level of file ids and revision ids. The current
	49	versioned-file API would be a burden for implementors of a blob based
	50	repository format, so the removal of callers, and deprecation of those parts
	51	of the API should be done before creating a blob based repository format.
	52
	53	* Creating a revision validator: Revision validators may depend on storage
	54	layer changes to inventories so while we can create a revision validator
	55	API, we cannot create the final one until we have the inventory structural
	56	changes completed.
4853.1.1 by Patrick Regan Removed trailing whitespace from files in doc directory	57
2522.3.1 by Robert Collins Draft proposed integration order for performance changes.	58	* Annotation caching API: This API is a prerequisite for new repository
	59	formats. If written after they are introduced we may find that the
	60	repository is lacking in functionality, so the API should be implemented
	61	first.
	62
	63	* _iter_changes based merging: If the current _iter_changes_ API is
	64	insufficient, we should know about that before designing the disk format for
	65	generating fast _iter_changes_ output.
	66
	67	* Network-efficient revision graph API: This influences what questions we will
	68	want to ask a local repository very quickly; as such it's a driver for the
	69	new repository format and should be in place first if possible. Its probably
	70	not sufficiently different to local operations to make this a hard ordering
	71	though.
	72
	73	* Working tree disk ordering: Knowing the expected order for disk operations
	74	may influence the needed use case specific APIs, so having a solid
	75	understanding of what is optimal - and why - and whether it is pessimal on
5278.1.5 by Martin Pool Correct more sloppy use of the term 'Linux'	76	non-Linux-kernel platforms is rather important.
2522.3.1 by Robert Collins Draft proposed integration order for performance changes.	77
	78	* Be able to version files greater than memory in size: This cannot be
	79	achieved until all parts of the library which deal with user files are able
	80	to provide access to files larger than memory. Many strategies can be
	81	considered for this - such as temporary files on disk, memory mapping etc.
	82	We should have enough of a design laid out that developers of repository and
	83	tree logic are able to start exposing apis, and considering requirements
	84	related to them, to let this happen.
	85
	86	* Per-file graph access API: This should be implemented on top of or as part
	87	of the newer API for accessing data about a file over time. It can be a
	88	separate step easily; but as it's in the same area of the library should not
	89	be done in parallel.
4853.1.1 by Patrick Regan Removed trailing whitespace from files in doc directory	90
2522.3.1 by Robert Collins Draft proposed integration order for performance changes.	91	* Repository stacking API: The key dependency/change required for this is that
	92	repositories must individually be happy with having partial data - e.g. many
	93	ghosts. However the way the API needs to be used should be driven from the
	94	command layer in, because its unclear at the moment what will work best.
	95
	96	* Revision stream API: This API will become clear as we streamline commands.
	97	On the data insertion side commit will want to generate new data. The
	98	commands pull, bundle, merge, push, possibly uncommit will want to copy
	99	existing data in a streaming fashion.
4853.1.1 by Patrick Regan Removed trailing whitespace from files in doc directory	100
2522.3.1 by Robert Collins Draft proposed integration order for performance changes.	101	* New container format: Its hard to tell what the right way to structure the
	102	layering is. Probably having smooth layering down to the point that code
	103	wants to operate on the containers directly will make this more clear. As
	104	bundles will become a read-only branch & repository, the smart server wants
	105	streaming-containers, and we are planning a pack based repository, it
	106	appears that we will have three different direct container users. However,
	107	the bundle user may in fact be fake - because it really is a repository.
	108
	109	* Separation of annotation cache: Making the disk changes to achieve this
	110	depends on the new API being created. Bundles probably want to be
	111	annotation-free, so they are a form of implementation of this and will need
	112	the on-demand annotation facility.
	113
	114	* Repository operation disk ordering: Dramatically changing the ordering of
	115	disk operations requires a new repository format. We have most of the
	116	analysis done to be able to specify the desired ordering, so it should be
	117	possible to write such a format now based on the container logic, but
	118	without any of the inventory representation or delta representation changes.
	119	This would for instance involve pack combining ordering the existing diffs
	120	in reverse order.
	121
	122	* Inventory representation: This has a dependency on what data is
	123	dropped from the core and what is kept. Without those changes being known we
	124	can implement a new representation, but it won't be a final one. One of the
	125	services the new inventory representation is expected to deliver is one of
	126	validators for subtrees -- a means of comparing just subtrees of two
	127	inventories without comparing all the data within that subtree.
	128
	129	* Delta storage optimisation: This has a strict dependency on a new repository
	130	format. Optimisation takes many forms - we probably cannot complete the
	131	desired optimisations under knits though we could use xdelta within a
4853.1.1 by Patrick Regan Removed trailing whitespace from files in doc directory	132	knit-variation.
2522.3.1 by Robert Collins Draft proposed integration order for performance changes.	133
	134	* Greatest distance from origin cache: The potential users of this exist
	135	today, it is likely able to be implemented immediately, but we are not sure
	136	that its needed anymore, so it is being shelved.
	137
	138	* Removing derivable data: Its very hard to do this while the derived data is
	139	exposed in API's but not used by commands. Implemented the targeted API's
	140	for our core use cases should allow use to remove accidental use of derived
	141	data, making only explicit uses of it visible, and isolating the impact of
	142	removing it : allowing us to experiment sensibly. This covers both dropping
	143	the per-file merge graph and the hash-based-names proposals.