~bzr-pqm/bzr/bzr.dev : revision 2485.4.9

1

Planned changes to the bzr core

2

-------------------------------

3

4

Delivering the best possible performance requires changing the bzr core design

5

from that present in 0.16. Some of these changes are incremental and can be

6

done with no impact on disk format. Many of them however do require changes to

7

the disk format, and these can be broken into two sets of changes, those which

8

are sufficiently close to the model bzr uses today to interoperate with the

9

0.16 disk formats, and those that are not able to interoperate with the 0.16

10

disk formats - specifically some planned changes may result in data which

11

cannot be exported to bzr 0.16's disk formats and then imported back to the new

12

format without losing critical information. If/when this takes place it will be

13

essentially a migration for users to switch from their bzr 0.16 repository to a

14

bzr that supports them. We plan to batch all such changes into one large

15

'experimental' repository format, which will be complete stable and usable

16

before we migrate it to become a supported format. Getting new versions of bzr

17

in widespread use at that time will be very important, otherwise the user base

18

may be split in two - users that have upgraded and users that have not.

19

20

The following changes are grouped according to their compatability impact:

21

library only, disk format but interoperable, disk format interoperability

22

unknown, and disk format, not interoperable.

23

24

Library changes

25

===============

26

27

These changes will change bzrlib's API but will not affect the disk format and

28

thus do not pose a significant migration issue.

29

30

* For our 20 core use cases, we plan to add targeted API's to bzrlib that are

31

repository-representation agnostic. These will instead reflect the shape of

32

data access most optimal for that case.

33

34

* Deprecate 'versioned files' as a library concept. Instead of asking for

35

information about a file-over-time as a special case, we will move to an API

36

that assumes less coupling between the historical information and the

37

ability to obtain texts/deltas etc. Specifically, we need to remove all

38

API's that act in terms of on disk representation except those within a

39

given repository implementation.

40

41

* Create a validator for revisions that is more amenable to use by other parts

42

of the code base than just the gpg signing facility. This can be done today

43

without changing disk, possibly with a performance hit until the disk

44

formats match the validatory logic. It will be hard to tell if we have the

45

right routine for that until all the disk changes are complete, so while

46

this is a library only change, its likely one that will be delayed to near

47

the end of the process.

48

49

* Add an explicit API for managing cached annotations. While annotations are

50

considered a cache this is not exposed in such a way that cache operations

51

like 'drop the cache' can be performed. On current disk formats the cache is

52

mandatory, but an API to manage would allow refreshing of the cache (e.g.

53

after ghosts are filled in in baz conversions).

54

55

* Use the _iter_changes API to perform merges. This is a small change that may

56

remove the need to use inventories in merge, making a dramatic difference to

57

merge performance.

58

59

* Create a network-efficient revision graph API. This is the logic at the

60

start of push and pull operations, which currently scales O(graph size).

61

Fixing the scaling can be done, but there are tradeoffs to latency and

62

performance to consider, making it a little tricky to get right.

63

64

* Working tree disk operation ordering. We plan to change the order in which

65

some operations are done (specifically TreeTransform ones) to improve

66

performance. There is already a 66% performance boost in that area going

67

through review.

68

69

* Stop requiring full memory copies of files. Currently bzr requires that it

70

can hold 3 copies of any file its versioning in memory. Solving this is

71

tricky, particularly without performance regressions on small files, but

72

without solving it versioning of .iso and other large objects will continue

73

to be extremely painful.

74

75

* Add an API for per-file graph access that alllows incremental access and is

76

suitable for on-demand generation if desired.

77

78

* Repository stacking API. Allowing multiple databases to be stacked to give a

79

single 'repository' will allow implementation of some long desired features

80

like history horizons, and bundle usage where the bundle is not added to the

81

local repository just to examine its contents.

82

83

* Revision data manipulation API. We need a single streaming API for adding

84

data to or getting it from a repository. This will need to allow hints such

85

as 'optimise for size', or 'optimise for fast-addition' to meet the various

86

users planned, but it is a core part of the library today, and its not

87

sufficiently clean to let us simplify/remove a lot of related code today.

88

89

Interoperable disk changes

90

==========================

91

92

* New container format to allow single-file description of multiple named

93

objects. This will provide the basis for transmission of revisions over the

94

network, the new bundle format, and possibly a new repository format as

95

well.

96

97

* Separate the annotation cache from the storage of actual file texts and make

98

the annotation style, and when to do it, configurable. This will reduce data

99

sent over the wire when repositories have had 'needs-annotations' turned

100

off, which very large trees may choose to do - generating just-in-time

101

annotations may be desirable for those trees (even when performing

102

annotation based merges).

103

104

* Repository disk operation ordering. The order that tasks access data within

105

the repository and the layout of the data should be harmonised. This will

106

require disk format changes but does not inherently alter the model, so its

107

straight forward to export from a repository that has been optimised in this

108

way to a 0.16 based repository.

109

110

* Inventory representation. An inventory is a logical description of the shape

111

of a version controlled tree. Currently we operate on the whole inventory as

112

a tree broken down per directory, but we store it as a flat file. This scale

113

very poorly as even a minor change between inventories requires us to scan

114

the entire file, and in large trees this is many megabytes of data to

115

consider. We are investigating the exact form, but the intent is to change

116

the serialisation of inventories so that comparing two inventories can be

117

done in some smaller time - e.g. O(log N) scaling. Whatever form this takes,

118

a repository that can export it directly will be able to perform operations

119

between two historical trees much more efficiently than the current

120

repositories.

121

122

* Delta storage optimisation. We plan to change the delta storage logic to use

123

a binary delta like xdelta rather than using ancestry-graph driven line

124

based deltas. Line based deltas will still be created for cached

125

annotations.

126

127

* Greatest distance from origin cache. This is a possible change to introduce,

128

but it may be unnecessary - listed here for completeness till it has been

129

established as [un]needed.

130

131

Possibly non-interoperable disk changes

132

=======================================

133

134

* Removing of derivable data from the core of bzr. Much of the data that bzr

135

stores is derivable from the users source files. For instance the

136

annotations that record who introduced a line. Given the full history for a

137

repository we can recreate that at any time. We want to remove the

138

dependence of the core of bzr on any data that is derivable, because doing

139

this will give us the freedom to:

140

141

* Improve the derivation algorithm over time.

142

* Deal with bugs in the derivation algorithms without having 'corrupt

143

repositories' or such things.

144

145

However, some of the data that is technically derived, like the per-file

146

merge graph, is both considered core, and can be generated differently when

147

certain circumstances arive, by bzr 0.16. Any change to the 'core' status of

148

that data will discard data that cannot be recreated and thus lead to the

149

inability to export from a format where that is derived data to bzr 0.16's

150

formats without errors occuring in those circumstances. Some of the data

151

that may be considered for this includes:

152

153

* Per file merge graphs

154

* Annotations

155

156

Non-interoperable disk changes

157

==============================

158

159

* Drop the per-file merge graph 'cache' currently held in the FILE-ID.kndx

160

files. A specific case of removing derivable data, this may allow smaller

161

inventory metadata and also make it easier to allow two different trees (in

162

terms of last-change made, e.g. if one is a working tree) to be compared

163

using a hash-tree style approach.

164

165

* Use hash based names for some objects in the bzr database. Because it would force

166

total-knowledge-of-history on the graph revision objects will not be namable

167

via hash's and neither will revisio signatures. Other than that though we

168

can in principle use hash's e.g. SHA1 for everything else. There are many

169

unanswered questions about hash based naming related to locality of

170

reference impacts, which need to be answered before this becomes a definite

171

item.