199
199
than repeated reentry into the index - so having indices that support
200
200
iteration in such a style would be useful eventually.
205
An initial implementation of indexing can probably get away with a small
206
number of primitives. Assuming we have write once index files:
211
This should be done by creating an ``IndexBuilder`` and then calling
212
``insert(key, value)`` many times. (Indices that support sorting,
213
topological sorting etc, will want specialised insert methods).
215
When the keys have all been added, a ``finish`` method should be called,
216
which will return a file stream to read the index data from.
218
Retrieve entries from the index
219
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
221
This should allow random access to the index using readv, so we probably
222
want to open the index on a ``Transport``, then use ``iter_entries(keys)``,
223
which can return an iterator that yields ``(key, value)`` pairs in
224
whatever order makes sense for the index.
229
Merging of N indices requires a concordance of the keys of the index. So
230
we should offer a ``iter_all_entries`` call that has the same return type
231
as the ``iter_entries`` call.
233
202
Changing our current indexes
234
203
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
249
218
We could discard the per-knit .kndx by writing a new index at the end of
250
219
every bzr transaction indexing the new data introduced by the bzr
251
operation. e.g. at the end of fetch.
253
We can keep the knit data file if the new index was a specialised index
254
with parent pointers that are native pointers inside the index values -
256
* list of byte locations for the parent keys entries in this index or
257
[-1] for not present in the index (its just a name to be pointed at)
258
* byte offset for the data record in the knit
259
* byte length for the data record in the knit
260
* byte locations for parent key it is compressed against, -1 for full
262
* sha1sum ? (Do we have sufficient sha1 pointers to not need this in the
264
* noeol will need a flag too as that does not appear to be in the zip
267
Separation of concerns, and having something that can be used outside
268
knits suggests splitting this differently. Lets build an index that can
269
store a graph efficiently. So the index itself understands:
273
And then in the value we can serialise:
274
* byte offset for the data record in the knit
275
* byte length for the data record in the knit
276
* full text/not full text. (no less general than knit indices).
277
* sha1sum ? (Do we have sufficient sha1 pointers to not need this in the
279
* noeol will need a flag too as that does not appear to be in the zip
282
Trading off some complexity we could have the index understand:
284
* A list of node-referencing lists (e.g. 2 lists of parents)
286
And then in the value we serialise:
287
* byte offset for the data record in the knit
288
* byte length for the data record in the knit
289
* sha1sum ? (Do we have sufficient sha1 pointers to not need this in the
291
* noeol will need a flag too as that does not appear to be in the zip
293
In this scenario we will have the first parents list be the graph
294
parents, and the second parents list be the compression parents. (empty
297
Index merging can take place easily because all the data that we may
298
choose to dictionary compress within the index is maintained by the index,
299
the only data in the value for each entry is data solely relevant to the
302
We could map knit indices to this by:
303
- giving ghosts their own record with -1 as the byte offset
304
- making operations like get_parents resolve pointers
220
operation. e.g. at the end of fetch. This can be based on the new
221
``GraphIndex`` index type.
223
Encoding a knit entry into a ``GraphIndex`` can be done as follows:
225
* Change the key to include a prefix of the knit name, to allow filtering
226
out of data from different knits.
227
* Encode the parents from the knit as the zeroth node reference list.
228
* If the knit hunk was delta compressed encode the node it was delta
229
compressed against as the 1st node reference list (otherwise the 1st
230
node reference list will be empty to indicate no compression parents).
231
* For the value encode similarly to the current knit format the byte
232
offset for the data record in the knit, the byte length for the data
233
record in the knit and the no-end-of-line flag.
306
235
Its important to note that knit repositories cannot be regenerated by
307
scanning .knits, data from .kndx is needed too, so a .knit based store still
308
requires all the information that the current .kndx contains.
236
scanning .knits, so a mapped index is still irreplaceable and must be
237
transmitted on push/pull.
310
239
A potential improvement exists by specialising this further to not record
311
240
data that is not needed - e.g. an index of revisions does not need to
312
241
support a pointer to a parent compressed text as revisions.knit is not
313
242
delta-compressed ever. Likewise signatures do not need the parent pointers
314
as there is no 'signature graph'.
243
at all as there is no 'signature graph'.