3350.3.1
by Robert Collins
Draft up an interface for repository streams that is more capable than the |
1 |
================== |
2 |
Repository Streams |
|
3 |
================== |
|
4 |
||
5 |
Status |
|
6 |
====== |
|
7 |
||
8 |
:Date: 2008-04-11 |
|
9 |
||
10 |
This document describes the proposed programming interface for streaming |
|
11 |
data from and into repositories. This programming interface should allow |
|
12 |
a single interface for pulling data from and inserting data into a Bazaar |
|
13 |
repository. |
|
14 |
||
15 |
.. contents:: |
|
16 |
||
17 |
||
18 |
Motivation |
|
19 |
========== |
|
20 |
||
21 |
To eliminate the current requirement that extracting data from a |
|
22 |
repository requires either using a slow format, or knowing the format of |
|
23 |
both the source repository and the target repository. |
|
24 |
||
25 |
||
26 |
Use Cases |
|
27 |
========= |
|
28 |
||
29 |
Here's a brief description of use cases this interface is intended to |
|
30 |
support. |
|
31 |
||
32 |
Fetch operations |
|
33 |
---------------- |
|
34 |
||
35 |
We fetch data between repositories as part of push/pull/branch operations. |
|
36 |
Fetching data is currently an very interactive process with lots of |
|
37 |
requests. For performance having the data be supplied in a stream will |
|
38 |
improve push and pull to remote servers. For purely local operations the |
|
39 |
streaming logic should help reduce memory pressure. In fetch operations |
|
40 |
we always know the formats of both the source and target. |
|
41 |
||
42 |
Smart server operations |
|
43 |
~~~~~~~~~~~~~~~~~~~~~~~ |
|
44 |
||
45 |
With the smart server we support one streaming format, but this is only |
|
46 |
usable when both the client and server have the same model of data, and |
|
47 |
requires non-optimal IO ordering for pack to pack operations. Ideally we |
|
3350.3.2
by Robert Collins
Finishing sentences is a good thing. |
48 |
can both provide optimal IO ordering the pack to pack case, and correct |
49 |
ordering for pack to knits. |
|
3350.3.1
by Robert Collins
Draft up an interface for repository streams that is more capable than the |
50 |
|
51 |
Bundles |
|
52 |
------- |
|
53 |
||
54 |
Bundles also create a stream of data for revisions from a repository. |
|
55 |
Unlike fetch operations we do not know the format of the target at the |
|
56 |
time the stream is created. It would be good to be able to treat bundles |
|
57 |
as frozen branches and repositories, so a serialised stream should be |
|
58 |
suitable for this. |
|
59 |
||
60 |
Data conversion |
|
61 |
--------------- |
|
62 |
||
63 |
At this point we are not trying to integrate data conversion into this |
|
64 |
interface, though it is likely possible. |
|
65 |
||
66 |
||
67 |
Characteristics |
|
68 |
=============== |
|
69 |
||
70 |
Some key aspects of the described interface are discussed in this section. |
|
71 |
||
72 |
Single round trip |
|
73 |
----------------- |
|
74 |
||
75 |
All users of this should be able to create an appropriate stream from a |
|
76 |
single round trip. |
|
77 |
||
78 |
Forward-only reads |
|
79 |
------------------ |
|
80 |
||
81 |
There should be no need to seek in a stream when inserting data from it |
|
82 |
into a repository. This places an ordering constraint on streams which |
|
83 |
some repositories do not need. |
|
84 |
||
85 |
||
86 |
Serialisation |
|
87 |
============= |
|
88 |
||
89 |
At this point serialisation of a repository stream has not been specified. |
|
90 |
Some considerations to bear in mind about serialisation are worth noting |
|
91 |
however. |
|
92 |
||
93 |
Weaves |
|
94 |
------ |
|
95 |
||
96 |
While there shouldn't be too many users of weave repositories anymore, |
|
97 |
avoiding pathological behaviour when a weave is being read is a good idea. |
|
98 |
Having the weave itself embedded in the stream is very straight forward |
|
99 |
and does not need expensive on the fly extraction and re-diffing to take |
|
100 |
place. |
|
101 |
||
102 |
Bundles |
|
103 |
------- |
|
104 |
||
105 |
Being able to perform random reads from a repository stream which is a |
|
106 |
bundle would allow stacking a bundle and a real repository together. This |
|
107 |
will need the pack container format to be used in such a way that we can |
|
108 |
avoid reading more data than needed within the pack container's readv |
|
109 |
interface. |
|
110 |
||
111 |
||
112 |
Specification |
|
113 |
============= |
|
114 |
||
115 |
This describes the interface for requesting a stream, and the programming |
|
116 |
interface a stream must provide. Streams that have been serialised should |
|
117 |
expose the same interface. |
|
118 |
||
119 |
Requesting a stream |
|
120 |
------------------- |
|
121 |
||
122 |
To request a stream, three parameters are needed: |
|
123 |
||
124 |
* A revision search to select the revisions to include. |
|
125 |
* A data ordering flag. There are two values for this - 'unordered' and |
|
126 |
'topological'. 'unordered' streams are useful when inserting into |
|
127 |
repositories that have the ability to perform atomic insertions. |
|
128 |
'topological' streams are useful when converting data, or when |
|
129 |
inserting into repositories that cannot perform atomic insertions (such |
|
130 |
as knit or weave based repositories). |
|
131 |
* A complete_inventory flag. When provided this flag signals the stream |
|
132 |
generator to include all the data needed to construct the inventory of |
|
133 |
each revision included in the stream, rather than just deltas. This is |
|
134 |
useful when converting data from a repository with a different |
|
135 |
inventory serialisation, as pure deltas would not be able to be |
|
136 |
reconstructed. |
|
137 |
||
138 |
||
139 |
Structure of a stream |
|
140 |
--------------------- |
|
141 |
||
142 |
A stream is an object. It can be consistency checked via the ``check`` |
|
143 |
method (which consumes the stream). The ``iter_contents`` method can be |
|
144 |
used to iterate the contents of the stream. The contents of the stream are |
|
145 |
a series of top level records, each of which contains one or more |
|
146 |
bytestrings (potentially as a delta against another item in the |
|
147 |
repository) and some optional metadata. |
|
148 |
||
149 |
||
150 |
Consuming a stream |
|
151 |
------------------ |
|
152 |
||
153 |
To consume a stream, obtain an iterator from the streams |
|
154 |
``iter_contents`` method. This iterator will yield the top level records. |
|
155 |
Each record has two attributes. One is ``key_prefix`` which is a tuple key |
|
156 |
prefix for the names of each of the bytestrings in the record. The other |
|
157 |
attribute is ``entries``, an iterator of the individual items in the |
|
3350.3.3
by Robert Collins
Functional get_record_stream interface tests covering full interface. |
158 |
record. Each item that the iterator yields is a factory which has metadata |
159 |
about the entry and the ability to return the compressed bytes. This |
|
160 |
factory can be decorated to allow obtaining different representations (for |
|
4853.1.1
by Patrick Regan
Removed trailing whitespace from files in doc directory |
161 |
example from a compressed knit fulltext to a plain fulltext). |
3350.3.1
by Robert Collins
Draft up an interface for repository streams that is more capable than the |
162 |
|
163 |
In pseudocode:: |
|
164 |
||
165 |
stream = repository.get_repository_stream(search, UNORDERED, False) |
|
166 |
for record in stream.iter_contents(): |
|
3350.3.3
by Robert Collins
Functional get_record_stream interface tests covering full interface. |
167 |
for factory in record.entries: |
168 |
compression = factory.storage_kind |
|
3350.3.1
by Robert Collins
Draft up an interface for repository streams that is more capable than the |
169 |
print "Object %s, compression type %s, %d bytes long." % ( |
3350.3.3
by Robert Collins
Functional get_record_stream interface tests covering full interface. |
170 |
record.key_prefix + factory.key, |
171 |
compression, len(factory.get_bytes_as(compression))) |
|
3350.3.1
by Robert Collins
Draft up an interface for repository streams that is more capable than the |
172 |
|
173 |
This structure should allow stream adapters to be written which can coerce |
|
174 |
all records to the type of compression that a particular client needs. For |
|
3350.3.3
by Robert Collins
Functional get_record_stream interface tests covering full interface. |
175 |
instance, inserting into weaves requires fulltexts, so a stream would be |
176 |
adapted for weaves by an adapter that takes a stream, and the target |
|
177 |
weave, and then uses the target weave to reconstruct full texts (which is |
|
178 |
all that the weave inserter would ask for). In a similar approach, a |
|
179 |
stream could internally delta compress many fulltexts and be able to |
|
180 |
answer both fulltext and compressed record requests without extra IO. |
|
181 |
||
182 |
factory metadata |
|
183 |
~~~~~~~~~~~~~~~~ |
|
184 |
||
185 |
Valid attributes on the factory are: |
|
3350.3.1
by Robert Collins
Draft up an interface for repository streams that is more capable than the |
186 |
* sha1: Optional ascii representation of the sha1 of the bytestring (after |
187 |
delta reconstruction). |
|
188 |
* storage_kind: Required kind of storage compression that has been used |
|
189 |
on the bytestring. One of ``mpdiff``, ``knit-annotated-ft``, |
|
190 |
``knit-annotated-delta``, ``knit-ft``, ``knit-delta``, ``fulltext``. |
|
191 |
* parents: Required graph parents to associate with this bytestring. |
|
192 |
* compressor_data: Required opaque data relevant to the storage_kind. |
|
193 |
(This is set to None when the compressor has no special state needed) |
|
194 |
* key: The key for this bytestring. Like each parent this is a tuple that |
|
195 |
should have the key_prefix prepended to it to give the unified |
|
196 |
repository key name. |
|
3350.3.3
by Robert Collins
Functional get_record_stream interface tests covering full interface. |
197 |
|
3350.3.1
by Robert Collins
Draft up an interface for repository streams that is more capable than the |
198 |
.. |
199 |
vim: ft=rst tw=74 ai |
|
200 |