~bzr-pqm/bzr/bzr.dev : contents of doc/developers/planned-performance-changes.txt at revision 3688.3.3

~bzr-pqm/bzr/bzr.dev : (revision 3688.3.3)

Planned changes to the bzr core
===============================

Delivering the best possible performance requires changing the bzr core design
from that present in 0.16. Some of these changes are incremental and can be
done with no impact on disk format. Many of them however do require changes to
the disk format, and these can be broken into two sets of changes, those which
are sufficiently close to the model bzr uses today to interoperate with the
0.16 disk formats, and those that are not able to interoperate with the 0.16
disk formats - specifically some planned changes may result in data which
cannot be exported to bzr 0.16's disk formats and then imported back to the new
format without losing critical information. If/when this takes place it will be
essentially a migration for users to switch from their bzr 0.16 repository to a
bzr that supports them. We plan to batch all such changes into one large
'experimental' repository format, which will be complete stable and usable
before we migrate it to become a supported format. Getting new versions of bzr
in widespread use at that time will be very important, otherwise the user base
may be split in two - users that have upgraded and users that have not.

The following changes are grouped according to their compatability impact:
library only, disk format but interoperable, disk format interoperability
unknown, and disk format, not interoperable.

Library changes
---------------

These changes will change bzrlib's API but will not affect the disk format and
thus do not pose a significant migration issue.

 * For our 20 core use cases, we plan to add targeted API's to bzrlib that are
   repository-representation agnostic. These will instead reflect the shape of
   data access most optimal for that case.

 * Deprecate 'versioned files' as a library concept. Instead of asking for
   information about a file-over-time as a special case, we will move to an API
   that assumes less coupling between the historical information and the
   ability to obtain texts/deltas etc. Specifically, we need to remove all
   API's that act in terms of on disk representation except those within a
   given repository implementation.

 * Create a validator for revisions that is more amenable to use by other parts
   of the code base than just the gpg signing facility. This can be done today
   without changing disk, possibly with a performance hit until the disk
   formats match the validatory logic. It will be hard to tell if we have the
   right routine for that until all the disk changes are complete, so while
   this is a library only change, its likely one that will be delayed to near
   the end of the process.

 * Add an explicit API for managing cached annotations. While annotations are
   considered a cache this is not exposed in such a way that cache operations
   like 'drop the cache' can be performed. On current disk formats the cache is
   mandatory, but an API to manage would allow refreshing of the cache (e.g.
   after ghosts are filled in in baz conversions).

 * Use the _iter_changes API to perform merges. This is a small change that may
   remove the need to use inventories in merge, making a dramatic difference to
   merge performance once the tree shape comparison optimisations are
   implemented.

 * Create a network-efficient revision graph API. This is the logic at the
   start of push and pull operations, which currently scales O(graph size).
   Fixing the scaling can be done, but there are tradeoffs to latency and
   performance to consider, making it a little tricky to get right.

 * Working tree disk operation ordering. We plan to change the order in which
   some operations are done (specifically TreeTransform ones) to improve
   performance. There is already a 66% performance boost in that area going
   through review.

 * Stop requiring full memory copies of files. Currently bzr requires that it
   can hold 3 copies of any file its versioning in memory. Solving this is
   tricky, particularly without performance regressions on small files, but
   without solving it versioning of .iso and other large objects will continue
   to be extremely painful.

 * Add an API for per-file graph access that alllows incremental access and is
   suitable for on-demand generation if desired.

 * Repository stacking API. Allowing multiple databases to be stacked to give a
   single 'repository' will allow implementation of some long desired features
   like history horizons, and bundle usage where the bundle is not added to the
   local repository just to examine its contents.

 * Revision data manipulation API. We need a single streaming API for adding
   data to or getting it from a repository. This will need to allow hints such
   as 'optimise for size', or 'optimise for fast-addition' to meet the various
   users planned, but it is a core part of the library today, and its not
   sufficiently clean to let us simplify/remove a lot of related code today.

Interoperable disk changes
--------------------------

 * New container format to allow single-file description of multiple named
   objects. This will provide the basis for transmission of revisions over the
   network, the new bundle format, and possibly a new repository format as
   well. [Core implemented] 

 * Separate the annotation cache from the storage of actual file texts and make
   the annotation style, and when to do it, configurable. This will reduce data
   sent over the wire when repositories have had 'needs-annotations' turned
   off, which very large trees may choose to do - generating just-in-time
   annotations may be desirable for those trees (even when performing
   annotation based merges).

 * Repository disk operation ordering. The order that tasks access data within
   the repository and the layout of the data should be harmonised. This will
   require disk format changes but does not inherently alter the model, so its
   straight forward to export from a repository that has been optimised in this
   way to a 0.16 based repository.

 * Inventory representation. An inventory is a logical description of the shape
   of a version controlled tree. Currently we operate on the whole inventory as
   a tree broken down per directory, but we store it as a flat file. This scale
   very poorly as even a minor change between inventories requires us to scan
   the entire file, and in large trees this is many megabytes of data to
   consider. We are investigating the exact form, but the intent is to change
   the serialisation of inventories so that comparing two inventories can be
   done in some smaller time - e.g. O(log N) scaling. Whatever form this takes,
   a repository that can export it directly will be able to perform operations
   between two historical trees much more efficiently than the current
   repositories.

 * Delta storage optimisation. We plan to change the delta storage logic to use
   a binary delta like xdelta rather than using line based deltas from python.
   These binary deltas could be done along ancestry ordering, or other
   arbitrary patterns chosen for their intended use. Line based deltas will
   still be created for cached annotations. This is still under some discussion.
   http://bazaar-vcs.org/PerformanceRoadmap/Xdelta

 * Greatest distance from origin cache. This is a possible change to introduce,
   but it may be unnecessary - listed here for completeness till it has been
   established as [un]needed.

Possibly non-interoperable disk changes
---------------------------------------

 * Removing of derivable data from the core of bzr. Much of the data that bzr
   stores is derivable from the users source files. For instance the
   annotations that record who introduced a line. Given the full history for a
   repository we can recreate that at any time. We want to remove the
   dependence of the core of bzr on any data that is derivable, because doing
   this will give us the freedom to:

   * Improve the derivation algorithm over time.
   * Deal with bugs in the derivation algorithms without having 'corrupt
     repositories' or such things.

   However, some of the data that is technically derived, like the per-file
   merge graph, is both considered core, and can be generated differently when
   certain circumstances arive, by bzr 0.16. Any change to the 'core' status of
   that data will discard data that cannot be recreated and thus lead to the
   inability to export from a format where that is derived data to bzr 0.16's
   formats without errors occuring in those circumstances. Some of the data
   that may be considered for this includes:

   * Per file merge graphs
   * Annotations

Non-interoperable disk changes
------------------------------

 * Drop the per-file merge graph 'cache' currently held in the FILE-ID.kndx
   files. A specific case of removing derivable data, this may allow smaller
   inventory metadata and also make it easier to allow two different trees (in
   terms of last-change made, e.g. if one is a working tree) to be compared
   using a hash-tree style approach.

 * Use hash based names for some objects in the bzr database. Because it would force
   total-knowledge-of-history on the graph revision objects will not be namable
   via hash's and neither will revisio signatures. Other than that though we
   can in principle use hash's e.g. SHA1 for everything else. There are many
   unanswered questions about hash based naming related to locality of
   reference impacts, which need to be answered before this becomes a definite
   item.

2485.4.9 by Robert Collins Create the top level changes-list from the London sprint for reference.	1	Planned changes to the bzr core
2506.1.3 by Alexander Belchenko merge bzr.dev; fix ReST formatting in planned-performance-changes.txt	2	===============================
2485.4.9 by Robert Collins Create the top level changes-list from the London sprint for reference.	3
	4	Delivering the best possible performance requires changing the bzr core design
	5	from that present in 0.16. Some of these changes are incremental and can be
	6	done with no impact on disk format. Many of them however do require changes to
	7	the disk format, and these can be broken into two sets of changes, those which
	8	are sufficiently close to the model bzr uses today to interoperate with the
	9	0.16 disk formats, and those that are not able to interoperate with the 0.16
	10	disk formats - specifically some planned changes may result in data which
	11	cannot be exported to bzr 0.16's disk formats and then imported back to the new
	12	format without losing critical information. If/when this takes place it will be
	13	essentially a migration for users to switch from their bzr 0.16 repository to a
	14	bzr that supports them. We plan to batch all such changes into one large
	15	'experimental' repository format, which will be complete stable and usable
	16	before we migrate it to become a supported format. Getting new versions of bzr
	17	in widespread use at that time will be very important, otherwise the user base
	18	may be split in two - users that have upgraded and users that have not.
	19
	20	The following changes are grouped according to their compatability impact:
	21	library only, disk format but interoperable, disk format interoperability
	22	unknown, and disk format, not interoperable.
	23
	24	Library changes
2506.1.3 by Alexander Belchenko merge bzr.dev; fix ReST formatting in planned-performance-changes.txt	25	---------------
2485.4.9 by Robert Collins Create the top level changes-list from the London sprint for reference.	26
	27	These changes will change bzrlib's API but will not affect the disk format and
	28	thus do not pose a significant migration issue.
	29
	30	* For our 20 core use cases, we plan to add targeted API's to bzrlib that are
	31	repository-representation agnostic. These will instead reflect the shape of
	32	data access most optimal for that case.
	33
	34	* Deprecate 'versioned files' as a library concept. Instead of asking for
	35	information about a file-over-time as a special case, we will move to an API
	36	that assumes less coupling between the historical information and the
	37	ability to obtain texts/deltas etc. Specifically, we need to remove all
	38	API's that act in terms of on disk representation except those within a
	39	given repository implementation.
	40
	41	* Create a validator for revisions that is more amenable to use by other parts
	42	of the code base than just the gpg signing facility. This can be done today
	43	without changing disk, possibly with a performance hit until the disk
	44	formats match the validatory logic. It will be hard to tell if we have the
	45	right routine for that until all the disk changes are complete, so while
	46	this is a library only change, its likely one that will be delayed to near
	47	the end of the process.
	48
	49	* Add an explicit API for managing cached annotations. While annotations are
	50	considered a cache this is not exposed in such a way that cache operations
	51	like 'drop the cache' can be performed. On current disk formats the cache is
	52	mandatory, but an API to manage would allow refreshing of the cache (e.g.
	53	after ghosts are filled in in baz conversions).
	54
	55	* Use the _iter_changes API to perform merges. This is a small change that may
	56	remove the need to use inventories in merge, making a dramatic difference to
2485.4.10 by Robert Collins Review feedback.	57	merge performance once the tree shape comparison optimisations are
2485.4.10 by Robert Collins Review feedback.	58	implemented.
2485.4.9 by Robert Collins Create the top level changes-list from the London sprint for reference.	59
	60	* Create a network-efficient revision graph API. This is the logic at the
	61	start of push and pull operations, which currently scales O(graph size).
	62	Fixing the scaling can be done, but there are tradeoffs to latency and
	63	performance to consider, making it a little tricky to get right.
	64
	65	* Working tree disk operation ordering. We plan to change the order in which
	66	some operations are done (specifically TreeTransform ones) to improve
	67	performance. There is already a 66% performance boost in that area going
	68	through review.
	69
	70	* Stop requiring full memory copies of files. Currently bzr requires that it
	71	can hold 3 copies of any file its versioning in memory. Solving this is
	72	tricky, particularly without performance regressions on small files, but
	73	without solving it versioning of .iso and other large objects will continue
	74	to be extremely painful.
	75
	76	* Add an API for per-file graph access that alllows incremental access and is
	77	suitable for on-demand generation if desired.
	78
	79	* Repository stacking API. Allowing multiple databases to be stacked to give a
	80	single 'repository' will allow implementation of some long desired features
	81	like history horizons, and bundle usage where the bundle is not added to the
	82	local repository just to examine its contents.
	83
	84	* Revision data manipulation API. We need a single streaming API for adding
	85	data to or getting it from a repository. This will need to allow hints such
	86	as 'optimise for size', or 'optimise for fast-addition' to meet the various
	87	users planned, but it is a core part of the library today, and its not
	88	sufficiently clean to let us simplify/remove a lot of related code today.
	89
	90	Interoperable disk changes
2506.1.3 by Alexander Belchenko merge bzr.dev; fix ReST formatting in planned-performance-changes.txt	91	--------------------------
2485.4.9 by Robert Collins Create the top level changes-list from the London sprint for reference.	92
	93	* New container format to allow single-file description of multiple named
	94	objects. This will provide the basis for transmission of revisions over the
	95	network, the new bundle format, and possibly a new repository format as
2522.3.1 by Robert Collins Draft proposed integration order for performance changes.	96	well. [Core implemented]
2485.4.9 by Robert Collins Create the top level changes-list from the London sprint for reference.	97
	98	* Separate the annotation cache from the storage of actual file texts and make
	99	the annotation style, and when to do it, configurable. This will reduce data
	100	sent over the wire when repositories have had 'needs-annotations' turned
	101	off, which very large trees may choose to do - generating just-in-time
	102	annotations may be desirable for those trees (even when performing
	103	annotation based merges).
	104
	105	* Repository disk operation ordering. The order that tasks access data within
	106	the repository and the layout of the data should be harmonised. This will
	107	require disk format changes but does not inherently alter the model, so its
	108	straight forward to export from a repository that has been optimised in this
	109	way to a 0.16 based repository.
	110
	111	* Inventory representation. An inventory is a logical description of the shape
	112	of a version controlled tree. Currently we operate on the whole inventory as
	113	a tree broken down per directory, but we store it as a flat file. This scale
	114	very poorly as even a minor change between inventories requires us to scan
	115	the entire file, and in large trees this is many megabytes of data to
	116	consider. We are investigating the exact form, but the intent is to change
	117	the serialisation of inventories so that comparing two inventories can be
	118	done in some smaller time - e.g. O(log N) scaling. Whatever form this takes,
	119	a repository that can export it directly will be able to perform operations
	120	between two historical trees much more efficiently than the current
	121	repositories.
	122
	123	* Delta storage optimisation. We plan to change the delta storage logic to use
2485.4.10 by Robert Collins Review feedback.	124	a binary delta like xdelta rather than using line based deltas from python.
	125	These binary deltas could be done along ancestry ordering, or other
	126	arbitrary patterns chosen for their intended use. Line based deltas will
	127	still be created for cached annotations. This is still under some discussion.
	128	http://bazaar-vcs.org/PerformanceRoadmap/Xdelta
2485.4.9 by Robert Collins Create the top level changes-list from the London sprint for reference.	129
	130	* Greatest distance from origin cache. This is a possible change to introduce,
	131	but it may be unnecessary - listed here for completeness till it has been
	132	established as [un]needed.
	133
	134	Possibly non-interoperable disk changes
2506.1.3 by Alexander Belchenko merge bzr.dev; fix ReST formatting in planned-performance-changes.txt	135	---------------------------------------
2485.4.9 by Robert Collins Create the top level changes-list from the London sprint for reference.	136
	137	* Removing of derivable data from the core of bzr. Much of the data that bzr
	138	stores is derivable from the users source files. For instance the
	139	annotations that record who introduced a line. Given the full history for a
	140	repository we can recreate that at any time. We want to remove the
	141	dependence of the core of bzr on any data that is derivable, because doing
	142	this will give us the freedom to:
	143
	144	* Improve the derivation algorithm over time.
	145	* Deal with bugs in the derivation algorithms without having 'corrupt
	146	repositories' or such things.
	147
	148	However, some of the data that is technically derived, like the per-file
	149	merge graph, is both considered core, and can be generated differently when
	150	certain circumstances arive, by bzr 0.16. Any change to the 'core' status of
	151	that data will discard data that cannot be recreated and thus lead to the
	152	inability to export from a format where that is derived data to bzr 0.16's
	153	formats without errors occuring in those circumstances. Some of the data
	154	that may be considered for this includes:
	155
	156	* Per file merge graphs
	157	* Annotations
	158
	159	Non-interoperable disk changes
2506.1.3 by Alexander Belchenko merge bzr.dev; fix ReST formatting in planned-performance-changes.txt	160	------------------------------
2485.4.9 by Robert Collins Create the top level changes-list from the London sprint for reference.	161
	162	* Drop the per-file merge graph 'cache' currently held in the FILE-ID.kndx
	163	files. A specific case of removing derivable data, this may allow smaller
	164	inventory metadata and also make it easier to allow two different trees (in
	165	terms of last-change made, e.g. if one is a working tree) to be compared
	166	using a hash-tree style approach.
	167
	168	* Use hash based names for some objects in the bzr database. Because it would force
	169	total-knowledge-of-history on the graph revision objects will not be namable
	170	via hash's and neither will revisio signatures. Other than that though we
	171	can in principle use hash's e.g. SHA1 for everything else. There are many
	172	unanswered questions about hash based naming related to locality of
	173	reference impacts, which need to be answered before this becomes a definite
	174	item.