1
Planned changes to the bzr core
2
===============================
4
Delivering the best possible performance requires changing the bzr core design
5
from that present in 0.16. Some of these changes are incremental and can be
6
done with no impact on disk format. Many of them however do require changes to
7
the disk format, and these can be broken into two sets of changes, those which
8
are sufficiently close to the model bzr uses today to interoperate with the
9
0.16 disk formats, and those that are not able to interoperate with the 0.16
10
disk formats - specifically some planned changes may result in data which
11
cannot be exported to bzr 0.16's disk formats and then imported back to the new
12
format without losing critical information. If/when this takes place it will be
13
essentially a migration for users to switch from their bzr 0.16 repository to a
14
bzr that supports them. We plan to batch all such changes into one large
15
'experimental' repository format, which will be complete stable and usable
16
before we migrate it to become a supported format. Getting new versions of bzr
17
in widespread use at that time will be very important, otherwise the user base
18
may be split in two - users that have upgraded and users that have not.
20
The following changes are grouped according to their compatability impact:
21
library only, disk format but interoperable, disk format interoperability
22
unknown, and disk format, not interoperable.
27
These changes will change bzrlib's API but will not affect the disk format and
28
thus do not pose a significant migration issue.
30
* For our 20 core use cases, we plan to add targeted API's to bzrlib that are
31
repository-representation agnostic. These will instead reflect the shape of
32
data access most optimal for that case.
34
* Deprecate 'versioned files' as a library concept. Instead of asking for
35
information about a file-over-time as a special case, we will move to an API
36
that assumes less coupling between the historical information and the
37
ability to obtain texts/deltas etc. Specifically, we need to remove all
38
API's that act in terms of on disk representation except those within a
39
given repository implementation.
41
* Create a validator for revisions that is more amenable to use by other parts
42
of the code base than just the gpg signing facility. This can be done today
43
without changing disk, possibly with a performance hit until the disk
44
formats match the validatory logic. It will be hard to tell if we have the
45
right routine for that until all the disk changes are complete, so while
46
this is a library only change, its likely one that will be delayed to near
47
the end of the process.
49
* Add an explicit API for managing cached annotations. While annotations are
50
considered a cache this is not exposed in such a way that cache operations
51
like 'drop the cache' can be performed. On current disk formats the cache is
52
mandatory, but an API to manage would allow refreshing of the cache (e.g.
53
after ghosts are filled in in baz conversions).
55
* Use the _iter_changes API to perform merges. This is a small change that may
56
remove the need to use inventories in merge, making a dramatic difference to
57
merge performance once the tree shape comparison optimisations are
60
* Create a network-efficient revision graph API. This is the logic at the
61
start of push and pull operations, which currently scales O(graph size).
62
Fixing the scaling can be done, but there are tradeoffs to latency and
63
performance to consider, making it a little tricky to get right.
65
* Working tree disk operation ordering. We plan to change the order in which
66
some operations are done (specifically TreeTransform ones) to improve
67
performance. There is already a 66% performance boost in that area going
70
* Stop requiring full memory copies of files. Currently bzr requires that it
71
can hold 3 copies of any file its versioning in memory. Solving this is
72
tricky, particularly without performance regressions on small files, but
73
without solving it versioning of .iso and other large objects will continue
74
to be extremely painful.
76
* Add an API for per-file graph access that alllows incremental access and is
77
suitable for on-demand generation if desired.
79
* Repository stacking API. Allowing multiple databases to be stacked to give a
80
single 'repository' will allow implementation of some long desired features
81
like history horizons, and bundle usage where the bundle is not added to the
82
local repository just to examine its contents.
84
* Revision data manipulation API. We need a single streaming API for adding
85
data to or getting it from a repository. This will need to allow hints such
86
as 'optimise for size', or 'optimise for fast-addition' to meet the various
87
users planned, but it is a core part of the library today, and its not
88
sufficiently clean to let us simplify/remove a lot of related code today.
90
Interoperable disk changes
91
--------------------------
93
* New container format to allow single-file description of multiple named
94
objects. This will provide the basis for transmission of revisions over the
95
network, the new bundle format, and possibly a new repository format as
96
well. [Core implemented]
98
* Separate the annotation cache from the storage of actual file texts and make
99
the annotation style, and when to do it, configurable. This will reduce data
100
sent over the wire when repositories have had 'needs-annotations' turned
101
off, which very large trees may choose to do - generating just-in-time
102
annotations may be desirable for those trees (even when performing
103
annotation based merges).
105
* Repository disk operation ordering. The order that tasks access data within
106
the repository and the layout of the data should be harmonised. This will
107
require disk format changes but does not inherently alter the model, so its
108
straight forward to export from a repository that has been optimised in this
109
way to a 0.16 based repository.
111
* Inventory representation. An inventory is a logical description of the shape
112
of a version controlled tree. Currently we operate on the whole inventory as
113
a tree broken down per directory, but we store it as a flat file. This scale
114
very poorly as even a minor change between inventories requires us to scan
115
the entire file, and in large trees this is many megabytes of data to
116
consider. We are investigating the exact form, but the intent is to change
117
the serialisation of inventories so that comparing two inventories can be
118
done in some smaller time - e.g. O(log N) scaling. Whatever form this takes,
119
a repository that can export it directly will be able to perform operations
120
between two historical trees much more efficiently than the current
123
* Delta storage optimisation. We plan to change the delta storage logic to use
124
a binary delta like xdelta rather than using line based deltas from python.
125
These binary deltas could be done along ancestry ordering, or other
126
arbitrary patterns chosen for their intended use. Line based deltas will
127
still be created for cached annotations. This is still under some discussion.
128
http://bazaar-vcs.org/PerformanceRoadmap/Xdelta
130
* Greatest distance from origin cache. This is a possible change to introduce,
131
but it may be unnecessary - listed here for completeness till it has been
132
established as [un]needed.
134
Possibly non-interoperable disk changes
135
---------------------------------------
137
* Removing of derivable data from the core of bzr. Much of the data that bzr
138
stores is derivable from the users source files. For instance the
139
annotations that record who introduced a line. Given the full history for a
140
repository we can recreate that at any time. We want to remove the
141
dependence of the core of bzr on any data that is derivable, because doing
142
this will give us the freedom to:
144
* Improve the derivation algorithm over time.
145
* Deal with bugs in the derivation algorithms without having 'corrupt
146
repositories' or such things.
148
However, some of the data that is technically derived, like the per-file
149
merge graph, is both considered core, and can be generated differently when
150
certain circumstances arive, by bzr 0.16. Any change to the 'core' status of
151
that data will discard data that cannot be recreated and thus lead to the
152
inability to export from a format where that is derived data to bzr 0.16's
153
formats without errors occuring in those circumstances. Some of the data
154
that may be considered for this includes:
156
* Per file merge graphs
159
Non-interoperable disk changes
160
------------------------------
162
* Drop the per-file merge graph 'cache' currently held in the FILE-ID.kndx
163
files. A specific case of removing derivable data, this may allow smaller
164
inventory metadata and also make it easier to allow two different trees (in
165
terms of last-change made, e.g. if one is a working tree) to be compared
166
using a hash-tree style approach.
168
* Use hash based names for some objects in the bzr database. Because it would force
169
total-knowledge-of-history on the graph revision objects will not be namable
170
via hash's and neither will revisio signatures. Other than that though we
171
can in principle use hash's e.g. SHA1 for everything else. There are many
172
unanswered questions about hash based naming related to locality of
173
reference impacts, which need to be answered before this becomes a definite