~bzr-pqm/bzr/bzr.dev

« back to all changes in this revision

Viewing changes to doc/developers/planned-change-integration.txt

Committer: Robert Collins
Date: 2007-06-19 00:48:22 UTC
mto: This revision was merged to the branch mainline in revision 2539.
Revision ID: robertc@robertcollins.net-20070619004822-wsop5g2arwu1lti4

Draft proposed integration order for performance changes.

files added:
doc/developers/planned-change-integration.txt

files modified:
.bzrignore

Makefile

doc/developers/performance-roadmap.txt

doc/developers/performance.dot

doc/developers/planned-performance-changes.txt

Show diffs side-by-side

added added

removed removed

doc/developers/planned-change-integration.txt

Integration of performance changes

==================================

To deliver a version of bzr with all our planned changes will require

significant integration work. Minimally each change needs to integrate with

some aspect of the bzr version it's merged into, but in reality many of these

changes while conceptually independent will in fact have to integrate with the

other changes we have planned before can have a completed system.

Additionally changes that alter disk formats are inherently more tricky to

integrate because we will often need to alter apis throughout the code base to

expose the increased or reduced model of the preferred disk format.

The dot file performance.dot graphs out the dependencies to let us make

accurate assessments of the changes needed in terms of code and API, hopefully

minimising the number of different integration steps we have to take, while

giving us a broad surface area for development. Its based on a sumary in the

next section of this document of the planned changes with their expected

collaborators and dependencies. Where a command is listed, the expectation is

that all uses of that command - local, remote, dumb transport and smart

transport are being addressed together.

The following provides a summary of the planned changes and their expected

collaborators within the code base, along with an estimate of whether they are

likely to require changes to their collaborators to be considered 'finished'.

* Use case target APIs: Each of these is likely to alter the Tree interface.

Some few of them focus on Branch and will alter Branch and Repository

accordingly. As they are targeted APIs we can deep changes all the way down

the stack to the underlying representation to make it all fit well.

Presenting a top level API for many things will be possible now as long as

the exposed data is audited for things we plan to make optional, or remove:

Such things cannot be present in the final API. Writing these APIs now will

provide strong feedback to the design process for those things which are

considered optional or removable, so these APIs should be implemented

before removing or making optional existing data.

* Deprecating versioned files as a supported API: This collaborates with the

Repository API but can probably be done by adding a replacement API for

places where the versioned-file api is used. We may well want to keep a

concept of 'a file over time' or 'inventories over time', so the existing

repository model of exposing versioned file objects may be ok; what we need

to ensure we do is remove the places in the code base where you create or

remove or otherwise describe manipulation of the storage by knit rather than

talking at the level of file ids and revision ids. The current

versioned-file API would be a burden for implementors of a blob based

repository format, so the removal of callers, and deprecation of those parts

of the API should be done before creating a blob based repository format.

* Creating a revision validator: Revision validators may depend on storage

layer changes to inventories so while we can create a revision validator

API, we cannot create the final one until we have the inventory structural

changes completed.

* Annotation caching API: This API is a prerequisite for new repository

formats. If written after they are introduced we may find that the

repository is lacking in functionality, so the API should be implemented

first.

* _iter_changes based merging: If the current _iter_changes_ API is

insufficient, we should know about that before designing the disk format for

generating fast _iter_changes_ output.

* Network-efficient revision graph API: This influences what questions we will

want to ask a local repository very quickly; as such it's a driver for the

new repository format and should be in place first if possible. Its probably

not sufficiently different to local operations to make this a hard ordering

though.

* Working tree disk ordering: Knowing the expected order for disk operations

may influence the needed use case specific APIs, so having a solid

understanding of what is optimal - and why - and whether it is pessimal on

non linux platforms is rather important.

* Be able to version files greater than memory in size: This cannot be

achieved until all parts of the library which deal with user files are able

to provide access to files larger than memory. Many strategies can be

considered for this - such as temporary files on disk, memory mapping etc.

We should have enough of a design laid out that developers of repository and

tree logic are able to start exposing apis, and considering requirements

related to them, to let this happen.

* Per-file graph access API: This should be implemented on top of or as part

of the newer API for accessing data about a file over time. It can be a

separate step easily; but as it's in the same area of the library should not

be done in parallel.

* Repository stacking API: The key dependency/change required for this is that

repositories must individually be happy with having partial data - e.g. many

ghosts. However the way the API needs to be used should be driven from the

command layer in, because its unclear at the moment what will work best.

* Revision stream API: This API will become clear as we streamline commands.

On the data insertion side commit will want to generate new data. The

commands pull, bundle, merge, push, possibly uncommit will want to copy

existing data in a streaming fashion.

* New container format: Its hard to tell what the right way to structure the

100

layering is. Probably having smooth layering down to the point that code

101

wants to operate on the containers directly will make this more clear. As

102

bundles will become a read-only branch & repository, the smart server wants

103

streaming-containers, and we are planning a pack based repository, it

104

appears that we will have three different direct container users. However,

105

the bundle user may in fact be fake - because it really is a repository.

106

107

* Separation of annotation cache: Making the disk changes to achieve this

108

depends on the new API being created. Bundles probably want to be

109

annotation-free, so they are a form of implementation of this and will need

110

the on-demand annotation facility.

111

112

* Repository operation disk ordering: Dramatically changing the ordering of

113

disk operations requires a new repository format. We have most of the

114

analysis done to be able to specify the desired ordering, so it should be

115

possible to write such a format now based on the container logic, but

116

without any of the inventory representation or delta representation changes.

117

This would for instance involve pack combining ordering the existing diffs

118

in reverse order.

119

120

* Inventory representation: This has a dependency on what data is

121

dropped from the core and what is kept. Without those changes being known we

122

can implement a new representation, but it won't be a final one. One of the

123

services the new inventory representation is expected to deliver is one of

124

validators for subtrees -- a means of comparing just subtrees of two

125

inventories without comparing all the data within that subtree.

126

127

* Delta storage optimisation: This has a strict dependency on a new repository

128

format. Optimisation takes many forms - we probably cannot complete the

129

desired optimisations under knits though we could use xdelta within a

130

knit-variation.

131

132

* Greatest distance from origin cache: The potential users of this exist

133

today, it is likely able to be implemented immediately, but we are not sure

134

that its needed anymore, so it is being shelved.

135

136

* Removing derivable data: Its very hard to do this while the derived data is

137

exposed in API's but not used by commands. Implemented the targeted API's

138

for our core use cases should allow use to remove accidental use of derived

139

data, making only explicit uses of it visible, and isolating the impact of

140

removing it : allowing us to experiment sensibly. This covers both dropping

141

the per-file merge graph and the hash-based-names proposals.

Older »