~bzr-pqm/bzr/bzr.dev : contents of doc/developers/knitpack.txt at revision 2960

~bzr-pqm/bzr/bzr.dev : (revision 2960)

==========================
KnitPack repository format
==========================

.. contents::

Using KnitPack repositories
===========================

Motivation
----------

KnitPack is a new repository format for Bazaar, which is expected to be
faster both locally and over the network, is usually more compact, and
will work with more FTP servers.

Our benchmarking results to date have been very promising. We fully expect
to make a pack-based format the default in the near future.  We would
therefore like as many people as possible using KnitPack repositories,
benchmarking the results and telling us where improvements are still needed.

Preparation
-----------

A small percentage of existing repositories may have some inconsistent
data within them. It's is a good idea to check the integrity of your
repositories before migrating them to knitpack format. To do this, run::

  bzr check

If that reports a problem, run this command::

  bzr reconcile

Note that this can take many hours for repositories with deep history
so be sure to set aside some time for this if it is required.

Creating a new knitpack branch
------------------------------

If you're starting a project from scratch, it's easy to make it a
``knitpack`` one. Here's how::

  cd my-stuff
  bzr init --knitpack-experimental
  bzr add
  bzr commit -m "initial import"

In other words, use the normal sequence of commands but add the
``--knitpack-experimental`` option to the ``init`` command.

Creating a new knitpack repository
----------------------------------

If you're starting a project from scratch and wish to use a shared repository
for branches, you can make it a ``knitpack`` repository like this::

  cd my-repo
  bzr init-repo --knitpack-experimental .
  cd my-stuff
  bzr init
  bzr add
  bzr commit -m "initial import"

In other words, use the normal sequence of commands but add the
``--knitpack-experimental`` option to the ``init-repo`` command.

Upgrading an existing branch or repository to knitpack format
-------------------------------------------------------------

If you have an existing branch and wish to migrate it to
a ``knitpack`` format, use the ``upgrade`` command like this::

  bzr upgrade --knitpack-experimental path-to-my-branch

If you are using a shared repository, run::

  bzr upgrade --knitpack-experimental ROOT_OF_REPOSITORY

to upgrade the history database. Note that this will not
alter the branch format of each branch, so
you will need to also upgrade each branch individually
if you are upgrading from an old (e.g. < 0.17) bzr.
More modern bzr's will already have the branch format at
our latest branch format which adds support for tags.

Starting a new knitpack branch from one in an older format
----------------------------------------------------------

This can be done in one of several ways:

1. Create a new branch and pull into it
2. Create a standalone branch and upgrade its format
3. Create a knitpack shared repository and branch into it

Here are the commands for using the ``pull`` approach::

    bzr init --knitpack-experimental my-new-branch
    cd my-new-branch
    bzr pull my-source-branch

Here are the commands for using the ``upgrade`` approach::

    bzr branch my-source-branch my-new-branch
    cd my-new-branch
    bzr upgrade --knitpack-experimental .

Here are the commands for the shared repository approach::

  cd my-repo
  bzr init-repo --knitpack-experimental .
  bzr branch my-source-branch my-new-branch
  cd my-new-branch
 
As a reminder, any of the above approaches can fail if the source branch
has inconsistent data within it and hasn't been reconciled yet. Please
be sure to check that before reporting problems.

Testing packs for bzr-svn users
-------------------------------

If you are using ``bzr-svn`` or are testing the prototype subtree support,
you can still use and assist in testing KnitPacks. The commands to use
are identical to the ones given above except that the name of the format
to use is ``knitpack-subtree-experimental``.

WARNING: Note that the subtree formats, ``distate-subtree`` and
``knitpack-subtree-experimental``, are **not** production strength yet and
may cause unexpected problems. They are required for the bzr-svn
plug-in but should otherwise ony be used by people happy to live on the
bleeding edge. If you are using bzr-svn, you're on the bleeding edge anyway.
:-)

Reporting problems
------------------

If you need any help or encounter any problems, please contact the developers
via the usual ways, i.e. chat to us on IRC or send a message to our mailing
list. See http://bazaar-vcs.org/BzrSupport for contact details.


Technical notes
===============

Bazaar 0.92 adds a new format (experimental at first) implemented in
``bzrlib.repofmt.pack_repo.py``.  

This format provides a knit-like interface which is quite compatible
with knit format repositories: you can get a VersionedFile for a
particular file-id, or for revisions, or for the inventory, even though
these do not correspond to single files on disk.

The on-disk format is that the repository directory contains these
files and subdirectories:

==================== =============================================
packs/               completed readonly packs
indices/             indices for completed packs
upload/              temporary files for packs currently being 
                     written
obsolete_packs/      packs that have been repacked and are no 
                     longer normally needed
pack-names           index of all live packs
lock/                lockdir
==================== =============================================

Note that for consistency we always write "indices" not "indexes".

This is implemented on top of pack files, which are written once from
start to end, then left alone.  A pack consists of a body file, plus
several index files.  There are four index files for each pack, which
have the same basename and an extension indicating the purpose of the
index:

======== ========== ======================== ==========================
extn     Purpose    Key                      References
======== ========== ======================== ==========================
``.tix`` File texts ``file_id, revision_id`` per-file parents,
                                             compression basis
                                             per-file parents
``.six`` Signatures ``revision_id,``         -
``.rix`` Revisions  ``revision_id,``         revision parents
``.iix`` Inventory  ``revision_id,``         revision parents,
                                             compression base
======== ========== ======================== ==========================

Indices are accessed through the ``bzrlib.index.GraphIndex`` class.  
Indices are stored as sorted files on disk.  Each line is one record,
and contains:

 * key fields
 * a value string - for all these indices, this is an ascii decimal pair
   of "offset length" giving the position of the refenced data within 
   the pack body file
 * a list of zero or more reference lists

The reference lists let a graph be stored within the index.  Each
reference list entry points to another entry in the same index.  The
references are represented as a byte offset for the target within the
index file.

When a compression base is given, it indicates that the body of the text
or inventory is a forward delta from the referenced revision.  The
compression base list must have length 0 or 1.

Like packs, indexes are written only once and then unmodified.  A
GraphIndex builder is a mutable in-memory graph that can be sorted,
cross-referenced and written out when the write group completes.

There can also be index entries with a value of 'a' for absent.  These
records exist just to be pointed to in a graph.  This is used, for
example, to give the revision-parent pointer when the parent revision is
in a previous pack.

The data content for each record is a knit data chunk.  The knits are
always unannotated - the annotations must be generated when needed.
(We'd like to cache/memoize the annotations.)  The data hunks can be
moved between packs without needing to recompress them.

It is not possible to regenerate an index from the body file, because it
contains information stored in the knit index that's not in the body.
(In particular, the per-file graph is only stored in the index.) 
We would like to change this in a future format.

The lock is a regular LockDir lock.  The lock is only held for a much
reduced scope, while updating the pack-names file.  The bulk of the
insertion can be done without the repository locked.  This is an
implementation detail; the repository user should still call
``repository.lock_write`` at the regular time but be aware this does not
correspond to a physical mutex. 

Read locks control caching but do not affect writers.

The newly-added repository write group concept is very important to
KnitPack repositories.  When ``start_write_group`` is called, a new
temporary pack is created and all modifications to the repository will 
go into it until either ``commit_write_group`` or ``abort_write_group``
is called, at which time it is either finished and moved into place or
discarded respectively.  Write groups cannot be nested, only one can be
underway at a time on a Repository instance and they must occur within a
write lock.

Normally the data for each revision will be entirely within a single
pack but this is not required.

When a pack is finished, it gets a final name based on the md5 of all
the data written into the pack body file.

The ``pack-names`` file gives the list of all finished non-obsolete
packs.  (This should always be the same as the list of files in the
``packs/`` directory, but the file is needed for readonly http clients
that can't easily list directories, and it includes other information.)
The constraint on the ``pack-names`` list is that every file mentioned
must exist in the ``packs/`` directory.  

In rare cases, when a writer is interrupted, about-to-be-removed packs
may still be present in the directory but removed from the list.

As well as the list of names, the pack-names file also contains the
size, in bytes, of each of the four indices.  This is used to bootstrap
bisection search within the indices.

In normal use, one pack will be created for each commit to a repository.
This would build up to an inefficient number of files over time, so a
``repack`` operation is available to recombine them, by producing larger
files containing data on multiple revisions.  This can be done manually
by running ``bzr pack``, and it also may happen automatically when a
write group is committed.

The repacking strategy used at the moment tries to balance not doing too
much work during commit with not having too many small files left in the
repository.  The algorithm is roughly this: the total number of
revisions in the repository is expressed as a decimal number, e.g.
"532".  Then we'll repack until we have five packs containing a hundred
revisions each, three packs containing ten revisions each, and two packs
with single revisions.  This means that each revision will normally
initially be created in a single-revision pack, then moved to a
ten-revision pack, then to a 100-pack, and so on.

As with other repositories, in normal use data is only inserted.
However, in some circumstances we may want to garbage-collect or prune
existing data, or reconcile indexes.

  vim: tw=72 ft=rest expandtab

2940.2.1 by Ian Clatworthy initial user doc for KnitPack repositories	1	==========================
2592.3.229 by Martin Pool Initial pack format documentation	2	KnitPack repository format
	3	==========================
	4
2940.2.1 by Ian Clatworthy initial user doc for KnitPack repositories	5	.. contents::
	6
	7	Using KnitPack repositories
	8	===========================
	9
2940.2.2 by Ian Clatworthy review feedback from lifeless	10	Motivation
	11	----------
	12
	13	KnitPack is a new repository format for Bazaar, which is expected to be
	14	faster both locally and over the network, is usually more compact, and
	15	will work with more FTP servers.
	16
	17	Our benchmarking results to date have been very promising. We fully expect
	18	to make a pack-based format the default in the near future. We would
	19	therefore like as many people as possible using KnitPack repositories,
	20	benchmarking the results and telling us where improvements are still needed.
	21
2940.2.1 by Ian Clatworthy initial user doc for KnitPack repositories	22	Preparation
	23	-----------
	24
	25	A small percentage of existing repositories may have some inconsistent
	26	data within them. It's is a good idea to check the integrity of your
	27	repositories before migrating them to knitpack format. To do this, run::
	28
	29	bzr check
	30
	31	If that reports a problem, run this command::
	32
	33	bzr reconcile
	34
	35	Note that this can take many hours for repositories with deep history
	36	so be sure to set aside some time for this if it is required.
	37
	38	Creating a new knitpack branch
	39	------------------------------
	40
	41	If you're starting a project from scratch, it's easy to make it a
	42	``knitpack`` one. Here's how::
	43
	44	cd my-stuff
	45	bzr init --knitpack-experimental
	46	bzr add
	47	bzr commit -m "initial import"
	48
	49	In other words, use the normal sequence of commands but add the
	50	``--knitpack-experimental`` option to the ``init`` command.
	51
	52	Creating a new knitpack repository
	53	----------------------------------
	54
	55	If you're starting a project from scratch and wish to use a shared repository
	56	for branches, you can make it a ``knitpack`` repository like this::
	57
	58	cd my-repo
	59	bzr init-repo --knitpack-experimental .
	60	cd my-stuff
	61	bzr init
	62	bzr add
	63	bzr commit -m "initial import"
	64
	65	In other words, use the normal sequence of commands but add the
	66	``--knitpack-experimental`` option to the ``init-repo`` command.
	67
	68	Upgrading an existing branch or repository to knitpack format
	69	-------------------------------------------------------------
	70
2940.2.2 by Ian Clatworthy review feedback from lifeless	71	If you have an existing branch and wish to migrate it to
2940.2.1 by Ian Clatworthy initial user doc for KnitPack repositories	72	a ``knitpack`` format, use the ``upgrade`` command like this::
	73
2940.2.3 by Ian Clatworthy more feedback from lifeless	74	bzr upgrade --knitpack-experimental path-to-my-branch
	75
	76	If you are using a shared repository, run::
	77
	78	bzr upgrade --knitpack-experimental ROOT_OF_REPOSITORY
	79
	80	to upgrade the history database. Note that this will not
	81	alter the branch format of each branch, so
	82	you will need to also upgrade each branch individually
	83	if you are upgrading from an old (e.g. < 0.17) bzr.
	84	More modern bzr's will already have the branch format at
	85	our latest branch format which adds support for tags.
2940.2.1 by Ian Clatworthy initial user doc for KnitPack repositories	86
	87	Starting a new knitpack branch from one in an older format
	88	----------------------------------------------------------
	89
	90	This can be done in one of several ways:
	91
	92	1. Create a new branch and pull into it
	93	2. Create a standalone branch and upgrade its format
	94	3. Create a knitpack shared repository and branch into it
	95
	96	Here are the commands for using the ``pull`` approach::
	97
	98	bzr init --knitpack-experimental my-new-branch
	99	cd my-new-branch
	100	bzr pull my-source-branch
	101
	102	Here are the commands for using the ``upgrade`` approach::
	103
	104	bzr branch my-source-branch my-new-branch
	105	cd my-new-branch
2940.2.2 by Ian Clatworthy review feedback from lifeless	106	bzr upgrade --knitpack-experimental .
2940.2.1 by Ian Clatworthy initial user doc for KnitPack repositories	107
	108	Here are the commands for the shared repository approach::
	109
	110	cd my-repo
	111	bzr init-repo --knitpack-experimental .
	112	bzr branch my-source-branch my-new-branch
	113	cd my-new-branch
	114
	115	As a reminder, any of the above approaches can fail if the source branch
	116	has inconsistent data within it and hasn't been reconciled yet. Please
	117	be sure to check that before reporting problems.
	118
2940.2.3 by Ian Clatworthy more feedback from lifeless	119	Testing packs for bzr-svn users
2940.2.3 by Ian Clatworthy more feedback from lifeless	120	-------------------------------
2940.2.1 by Ian Clatworthy initial user doc for KnitPack repositories	121
	122	If you are using ``bzr-svn`` or are testing the prototype subtree support,
	123	you can still use and assist in testing KnitPacks. The commands to use
	124	are identical to the ones given above except that the name of the format
	125	to use is ``knitpack-subtree-experimental``.
	126
	127	WARNING: Note that the subtree formats, ``distate-subtree`` and
	128	``knitpack-subtree-experimental``, are not production strength yet and
	129	may cause unexpected problems. They are required for the bzr-svn
	130	plug-in but should otherwise ony be used by people happy to live on the
	131	bleeding edge. If you are using bzr-svn, you're on the bleeding edge anyway.
	132	:-)
	133
	134	Reporting problems
	135	------------------
	136
	137	If you need any help or encounter any problems, please contact the developers
	138	via the usual ways, i.e. chat to us on IRC or send a message to our mailing
	139	list. See http://bazaar-vcs.org/BzrSupport for contact details.
	140
	141
	142	Technical notes
	143	===============
	144
2592.3.229 by Martin Pool Initial pack format documentation	145	Bazaar 0.92 adds a new format (experimental at first) implemented in
	146	``bzrlib.repofmt.pack_repo.py``.
	147
	148	This format provides a knit-like interface which is quite compatible
	149	with knit format repositories: you can get a VersionedFile for a
	150	particular file-id, or for revisions, or for the inventory, even though
	151	these do not correspond to single files on disk.
	152
	153	The on-disk format is that the repository directory contains these
	154	files and subdirectories:
	155
	156	==================== =============================================
	157	packs/ completed readonly packs
	158	indices/ indices for completed packs
	159	upload/ temporary files for packs currently being
	160	written
	161	obsolete_packs/ packs that have been repacked and are no
	162	longer normally needed
	163	pack-names index of all live packs
	164	lock/ lockdir
	165	==================== =============================================
	166
2592.3.230 by Martin Pool Review comments on knitpack docs	167	Note that for consistency we always write "indices" not "indexes".
2592.3.230 by Martin Pool Review comments on knitpack docs	168
2592.3.229 by Martin Pool Initial pack format documentation	169	This is implemented on top of pack files, which are written once from
	170	start to end, then left alone. A pack consists of a body file, plus
	171	several index files. There are four index files for each pack, which
	172	have the same basename and an extension indicating the purpose of the
	173	index:
	174
2592.3.230 by Martin Pool Review comments on knitpack docs	175	======== ========== ======================== ==========================
	176	extn Purpose Key References
	177	======== ========== ======================== ==========================
	178	``.tix`` File texts ``file_id, revision_id`` per-file parents,
	179	compression basis
	180	per-file parents
	181	``.six`` Signatures ``revision_id,`` -
	182	``.rix`` Revisions ``revision_id,`` revision parents
	183	``.iix`` Inventory ``revision_id,`` revision parents,
	184	compression base
	185	======== ========== ======================== ==========================
2592.3.229 by Martin Pool Initial pack format documentation	186
2592.3.230 by Martin Pool Review comments on knitpack docs	187	Indices are accessed through the ``bzrlib.index.GraphIndex`` class.
2592.3.229 by Martin Pool Initial pack format documentation	188	Indices are stored as sorted files on disk. Each line is one record,
	189	and contains:
	190
	191	* key fields
	192	* a value string - for all these indices, this is an ascii decimal pair
	193	of "offset length" giving the position of the refenced data within
	194	the pack body file
	195	* a list of zero or more reference lists
	196
	197	The reference lists let a graph be stored within the index. Each
	198	reference list entry points to another entry in the same index. The
	199	references are represented as a byte offset for the target within the
	200	index file.
	201
	202	When a compression base is given, it indicates that the body of the text
	203	or inventory is a forward delta from the referenced revision. The
	204	compression base list must have length 0 or 1.
	205
2592.3.230 by Martin Pool Review comments on knitpack docs	206	Like packs, indexes are written only once and then unmodified. A
	207	GraphIndex builder is a mutable in-memory graph that can be sorted,
	208	cross-referenced and written out when the write group completes.
	209
	210	There can also be index entries with a value of 'a' for absent. These
	211	records exist just to be pointed to in a graph. This is used, for
	212	example, to give the revision-parent pointer when the parent revision is
	213	in a previous pack.
	214
2592.3.229 by Martin Pool Initial pack format documentation	215	The data content for each record is a knit data chunk. The knits are
	216	always unannotated - the annotations must be generated when needed.
	217	(We'd like to cache/memoize the annotations.) The data hunks can be
	218	moved between packs without needing to recompress them.
	219
	220	It is not possible to regenerate an index from the body file, because it
	221	contains information stored in the knit index that's not in the body.
	222	(In particular, the per-file graph is only stored in the index.)
2592.3.230 by Martin Pool Review comments on knitpack docs	223	We would like to change this in a future format.
2592.3.229 by Martin Pool Initial pack format documentation	224
	225	The lock is a regular LockDir lock. The lock is only held for a much
	226	reduced scope, while updating the pack-names file. The bulk of the
	227	insertion can be done without the repository locked. This is an
	228	implementation detail; the repository user should still call
	229	``repository.lock_write`` at the regular time but be aware this does not
	230	correspond to a physical mutex.
	231
	232	Read locks control caching but do not affect writers.
	233
	234	The newly-added repository write group concept is very important to
	235	KnitPack repositories. When ``start_write_group`` is called, a new
	236	temporary pack is created and all modifications to the repository will
	237	go into it until either ``commit_write_group`` or ``abort_write_group``
	238	is called, at which time it is either finished and moved into place or
	239	discarded respectively. Write groups cannot be nested, only one can be
	240	underway at a time on a Repository instance and they must occur within a
	241	write lock.
	242
	243	Normally the data for each revision will be entirely within a single
	244	pack but this is not required.
	245
	246	When a pack is finished, it gets a final name based on the md5 of all
	247	the data written into the pack body file.
	248
	249	The ``pack-names`` file gives the list of all finished non-obsolete
	250	packs. (This should always be the same as the list of files in the
	251	``packs/`` directory, but the file is needed for readonly http clients
	252	that can't easily list directories, and it includes other information.)
2592.3.230 by Martin Pool Review comments on knitpack docs	253	The constraint on the ``pack-names`` list is that every file mentioned
	254	must exist in the ``packs/`` directory.
	255
	256	In rare cases, when a writer is interrupted, about-to-be-removed packs
	257	may still be present in the directory but removed from the list.
	258
	259	As well as the list of names, the pack-names file also contains the
	260	size, in bytes, of each of the four indices. This is used to bootstrap
	261	bisection search within the indices.
2592.3.229 by Martin Pool Initial pack format documentation	262
	263	In normal use, one pack will be created for each commit to a repository.
	264	This would build up to an inefficient number of files over time, so a
	265	``repack`` operation is available to recombine them, by producing larger
	266	files containing data on multiple revisions. This can be done manually
	267	by running ``bzr pack``, and it also may happen automatically when a
	268	write group is committed.
	269
	270	The repacking strategy used at the moment tries to balance not doing too
	271	much work during commit with not having too many small files left in the
	272	repository. The algorithm is roughly this: the total number of
	273	revisions in the repository is expressed as a decimal number, e.g.
	274	"532". Then we'll repack until we have five packs containing a hundred
	275	revisions each, three packs containing ten revisions each, and two packs
	276	with single revisions. This means that each revision will normally
	277	initially be created in a single-revision pack, then moved to a
	278	ten-revision pack, then to a 100-pack, and so on.
	279
2592.3.230 by Martin Pool Review comments on knitpack docs	280	As with other repositories, in normal use data is only inserted.
	281	However, in some circumstances we may want to garbage-collect or prune
	282	existing data, or reconcile indexes.
2592.3.229 by Martin Pool Initial pack format documentation	283
2592.3.229 by Martin Pool Initial pack format documentation	284	vim: tw=72 ft=rest expandtab