2522.3.1
by Robert Collins
Draft proposed integration order for performance changes. |
1 |
Integration of performance changes |
2 |
================================== |
|
3 |
||
4 |
To deliver a version of bzr with all our planned changes will require |
|
5 |
significant integration work. Minimally each change needs to integrate with |
|
6 |
some aspect of the bzr version it's merged into, but in reality many of these |
|
7 |
changes while conceptually independent will in fact have to integrate with the |
|
8 |
other changes we have planned before can have a completed system. |
|
9 |
||
10 |
Additionally changes that alter disk formats are inherently more tricky to |
|
11 |
integrate because we will often need to alter apis throughout the code base to |
|
12 |
expose the increased or reduced model of the preferred disk format. |
|
13 |
||
14 |
The dot file performance.dot graphs out the dependencies to let us make |
|
15 |
accurate assessments of the changes needed in terms of code and API, hopefully |
|
16 |
minimising the number of different integration steps we have to take, while |
|
17 |
giving us a broad surface area for development. Its based on a sumary in the |
|
18 |
next section of this document of the planned changes with their expected |
|
19 |
collaborators and dependencies. Where a command is listed, the expectation is |
|
20 |
that all uses of that command - local, remote, dumb transport and smart |
|
21 |
transport are being addressed together. |
|
22 |
||
23 |
||
24 |
The following provides a summary of the planned changes and their expected |
|
25 |
collaborators within the code base, along with an estimate of whether they are |
|
26 |
likely to require changes to their collaborators to be considered 'finished'. |
|
27 |
||
28 |
* Use case target APIs: Each of these is likely to alter the Tree interface. |
|
29 |
Some few of them focus on Branch and will alter Branch and Repository |
|
30 |
accordingly. As they are targeted APIs we can deep changes all the way down |
|
31 |
the stack to the underlying representation to make it all fit well. |
|
32 |
Presenting a top level API for many things will be possible now as long as |
|
33 |
the exposed data is audited for things we plan to make optional, or remove: |
|
34 |
Such things cannot be present in the final API. Writing these APIs now will |
|
35 |
provide strong feedback to the design process for those things which are |
|
36 |
considered optional or removable, so these APIs should be implemented |
|
37 |
before removing or making optional existing data. |
|
38 |
||
39 |
* Deprecating versioned files as a supported API: This collaborates with the |
|
40 |
Repository API but can probably be done by adding a replacement API for |
|
41 |
places where the versioned-file api is used. We may well want to keep a |
|
42 |
concept of 'a file over time' or 'inventories over time', so the existing |
|
43 |
repository model of exposing versioned file objects may be ok; what we need |
|
44 |
to ensure we do is remove the places in the code base where you create or |
|
45 |
remove or otherwise describe manipulation of the storage by knit rather than |
|
46 |
talking at the level of file ids and revision ids. The current |
|
47 |
versioned-file API would be a burden for implementors of a blob based |
|
48 |
repository format, so the removal of callers, and deprecation of those parts |
|
49 |
of the API should be done before creating a blob based repository format. |
|
50 |
||
51 |
* Creating a revision validator: Revision validators may depend on storage |
|
52 |
layer changes to inventories so while we can create a revision validator |
|
53 |
API, we cannot create the final one until we have the inventory structural |
|
54 |
changes completed. |
|
55 |
||
56 |
* Annotation caching API: This API is a prerequisite for new repository |
|
57 |
formats. If written after they are introduced we may find that the |
|
58 |
repository is lacking in functionality, so the API should be implemented |
|
59 |
first. |
|
60 |
||
61 |
* _iter_changes based merging: If the current _iter_changes_ API is |
|
62 |
insufficient, we should know about that before designing the disk format for |
|
63 |
generating fast _iter_changes_ output. |
|
64 |
||
65 |
* Network-efficient revision graph API: This influences what questions we will |
|
66 |
want to ask a local repository very quickly; as such it's a driver for the |
|
67 |
new repository format and should be in place first if possible. Its probably |
|
68 |
not sufficiently different to local operations to make this a hard ordering |
|
69 |
though. |
|
70 |
||
71 |
* Working tree disk ordering: Knowing the expected order for disk operations |
|
72 |
may influence the needed use case specific APIs, so having a solid |
|
73 |
understanding of what is optimal - and why - and whether it is pessimal on |
|
74 |
non linux platforms is rather important. |
|
75 |
||
76 |
* Be able to version files greater than memory in size: This cannot be |
|
77 |
achieved until all parts of the library which deal with user files are able |
|
78 |
to provide access to files larger than memory. Many strategies can be |
|
79 |
considered for this - such as temporary files on disk, memory mapping etc. |
|
80 |
We should have enough of a design laid out that developers of repository and |
|
81 |
tree logic are able to start exposing apis, and considering requirements |
|
82 |
related to them, to let this happen. |
|
83 |
||
84 |
* Per-file graph access API: This should be implemented on top of or as part |
|
85 |
of the newer API for accessing data about a file over time. It can be a |
|
86 |
separate step easily; but as it's in the same area of the library should not |
|
87 |
be done in parallel. |
|
88 |
||
89 |
* Repository stacking API: The key dependency/change required for this is that |
|
90 |
repositories must individually be happy with having partial data - e.g. many |
|
91 |
ghosts. However the way the API needs to be used should be driven from the |
|
92 |
command layer in, because its unclear at the moment what will work best. |
|
93 |
||
94 |
* Revision stream API: This API will become clear as we streamline commands. |
|
95 |
On the data insertion side commit will want to generate new data. The |
|
96 |
commands pull, bundle, merge, push, possibly uncommit will want to copy |
|
97 |
existing data in a streaming fashion. |
|
98 |
||
99 |
* New container format: Its hard to tell what the right way to structure the |
|
100 |
layering is. Probably having smooth layering down to the point that code |
|
101 |
wants to operate on the containers directly will make this more clear. As |
|
102 |
bundles will become a read-only branch & repository, the smart server wants |
|
103 |
streaming-containers, and we are planning a pack based repository, it |
|
104 |
appears that we will have three different direct container users. However, |
|
105 |
the bundle user may in fact be fake - because it really is a repository. |
|
106 |
||
107 |
* Separation of annotation cache: Making the disk changes to achieve this |
|
108 |
depends on the new API being created. Bundles probably want to be |
|
109 |
annotation-free, so they are a form of implementation of this and will need |
|
110 |
the on-demand annotation facility. |
|
111 |
||
112 |
* Repository operation disk ordering: Dramatically changing the ordering of |
|
113 |
disk operations requires a new repository format. We have most of the |
|
114 |
analysis done to be able to specify the desired ordering, so it should be |
|
115 |
possible to write such a format now based on the container logic, but |
|
116 |
without any of the inventory representation or delta representation changes. |
|
117 |
This would for instance involve pack combining ordering the existing diffs |
|
118 |
in reverse order. |
|
119 |
||
120 |
* Inventory representation: This has a dependency on what data is |
|
121 |
dropped from the core and what is kept. Without those changes being known we |
|
122 |
can implement a new representation, but it won't be a final one. One of the |
|
123 |
services the new inventory representation is expected to deliver is one of |
|
124 |
validators for subtrees -- a means of comparing just subtrees of two |
|
125 |
inventories without comparing all the data within that subtree. |
|
126 |
||
127 |
* Delta storage optimisation: This has a strict dependency on a new repository |
|
128 |
format. Optimisation takes many forms - we probably cannot complete the |
|
129 |
desired optimisations under knits though we could use xdelta within a |
|
130 |
knit-variation. |
|
131 |
||
132 |
* Greatest distance from origin cache: The potential users of this exist |
|
133 |
today, it is likely able to be implemented immediately, but we are not sure |
|
134 |
that its needed anymore, so it is being shelved. |
|
135 |
||
136 |
* Removing derivable data: Its very hard to do this while the derived data is |
|
137 |
exposed in API's but not used by commands. Implemented the targeted API's |
|
138 |
for our core use cases should allow use to remove accidental use of derived |
|
139 |
data, making only explicit uses of it visible, and isolating the impact of |
|
140 |
removing it : allowing us to experiment sensibly. This covers both dropping |
|
141 |
the per-file merge graph and the hash-based-names proposals. |