~bzr-pqm/bzr/bzr.dev

« back to all changes in this revision

Viewing changes to doc/developers/repository.txt

Committer: Robert Collins
Date: 2007-07-13 15:05:36 UTC
mto: (2592.5.3 pack-repository)
mto: This revision was merged to the branch mainline in revision 2624.
Revision ID: robertc@robertcollins.net-20070713150536-hqtkufys7aiqxl1t

Cleanup docs.

files added:
doc/developers/indices.txt

files modified:
NEWS

bzrlib/index.py

doc/developers/index.txt

doc/developers/repository.txt

Show diffs side-by-side

added added

removed removed

doc/developers/repository.txt

199

than repeated reentry into the index - so having indices that support

200

iteration in such a style would be useful eventually.

201

202

Services

203

~~~~~~~~

204

205

An initial implementation of indexing can probably get away with a small

206

number of primitives. Assuming we have write once index files:

207

208

Build index

209

^^^^^^^^^^^

210

211

This should be done by creating an ``IndexBuilder`` and then calling

212

``insert(key, value)`` many times. (Indices that support sorting,

213

topological sorting etc, will want specialised insert methods).

214

215

When the keys have all been added, a ``finish`` method should be called,

216

which will return a file stream to read the index data from.

217

218

Retrieve entries from the index

219

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

220

221

This should allow random access to the index using readv, so we probably

222

want to open the index on a ``Transport``, then use ``iter_entries(keys)``,

223

which can return an iterator that yields ``(key, value)`` pairs in

224

whatever order makes sense for the index.

225

226

Merging of indices

227

^^^^^^^^^^^^^^^^^^

228

229

Merging of N indices requires a concordance of the keys of the index. So

230

we should offer a ``iter_all_entries`` call that has the same return type

231

as the ``iter_entries`` call.

232

233

202

Changing our current indexes

234

203

~~~~~~~~~~~~~~~~~~~~~~~~~~~~

235

204

248

217

249

218

We could discard the per-knit .kndx by writing a new index at the end of

250

219

every bzr transaction indexing the new data introduced by the bzr

251

operation. e.g. at the end of fetch.

252

253

We can keep the knit data file if the new index was a specialised index

254

with parent pointers that are native pointers inside the index values -

255

something like:

256

* list of byte locations for the parent keys entries in this index or

257

[-1] for not present in the index (its just a name to be pointed at)

258

* byte offset for the data record in the knit

259

* byte length for the data record in the knit

260

* byte locations for parent key it is compressed against, -1 for full

261

text

262

* sha1sum ? (Do we have sufficient sha1 pointers to not need this in the

263

index?)

264

* noeol will need a flag too as that does not appear to be in the zip

265

data.

266

267

Separation of concerns, and having something that can be used outside

268

knits suggests splitting this differently. Lets build an index that can

269

store a graph efficiently. So the index itself understands:

270

* key

271

* parents list

272

* value

273

And then in the value we can serialise:

274

* byte offset for the data record in the knit

275

* byte length for the data record in the knit

276

* full text/not full text. (no less general than knit indices).

277

* sha1sum ? (Do we have sufficient sha1 pointers to not need this in the

278

index?)

279

* noeol will need a flag too as that does not appear to be in the zip

280

data.

281

282

Trading off some complexity we could have the index understand:

283

* key

284

* A list of node-referencing lists (e.g. 2 lists of parents)

285

* value

286

And then in the value we serialise:

287

* byte offset for the data record in the knit

288

* byte length for the data record in the knit

289

* sha1sum ? (Do we have sufficient sha1 pointers to not need this in the

290

index?)

291

* noeol will need a flag too as that does not appear to be in the zip

292

data.

293

In this scenario we will have the first parents list be the graph

294

parents, and the second parents list be the compression parents. (empty

295

for full text)

296

297

Index merging can take place easily because all the data that we may

298

choose to dictionary compress within the index is maintained by the index,

299

the only data in the value for each entry is data solely relevant to the

300

knit data file.

301

302

We could map knit indices to this by:

303

- giving ghosts their own record with -1 as the byte offset

304

- making operations like get_parents resolve pointers

220

operation. e.g. at the end of fetch. This can be based on the new

221

``GraphIndex`` index type.

222

223

Encoding a knit entry into a ``GraphIndex`` can be done as follows:

224

225

* Change the key to include a prefix of the knit name, to allow filtering

226

out of data from different knits.

227

* Encode the parents from the knit as the zeroth node reference list.

228

* If the knit hunk was delta compressed encode the node it was delta

229

compressed against as the 1st node reference list (otherwise the 1st

230

node reference list will be empty to indicate no compression parents).

231

* For the value encode similarly to the current knit format the byte

232

offset for the data record in the knit, the byte length for the data

233

record in the knit and the no-end-of-line flag.

305

234

306

235

Its important to note that knit repositories cannot be regenerated by

307

scanning .knits, data from .kndx is needed too, so a .knit based store still

308

requires all the information that the current .kndx contains.

236

scanning .knits, so a mapped index is still irreplaceable and must be

237

transmitted on push/pull.

309

238

310

239

A potential improvement exists by specialising this further to not record

311

240

data that is not needed - e.g. an index of revisions does not need to

312

241

support a pointer to a parent compressed text as revisions.knit is not

313

242

delta-compressed ever. Likewise signatures do not need the parent pointers

314

as there is no 'signature graph'.

243

at all as there is no 'signature graph'.

315

244

316

245

Data

317

246

----

Older »