~bzr-pqm/bzr/bzr.dev : contents of doc/developers/repository.txt at revision 6079

~bzr-pqm/bzr/bzr.dev : (revision 6079)

============
Repositories
============

Status
======

:Date: 2007-07-08

This document describes the services repositories offer and need to offer
within bzrlib.


.. contents::


Motivation
==========

To provide clarity to API and performance tradeoff decisions by
centralising the requirements placed upon repositories.


Terminology
===========

A **repository** is a store of historical data for bzr.


Command Requirements
====================

==================  ====================
Command             Needed services
==================  ====================
Add                 None
Annotate            Annotated file texts, revision details
Branch              Fetch, Revision parents, Inventory contents, All file texts
Bundle              Maximally compact diffs (file and inventory), Revision graph
                    difference, Revision texts.
Commit              Insert new texts, insert new inventory via delta, insert
                    revision, insert signature
Fetching            Revision graph difference, ghost identification, stream data
                    introduced by a set of revisions in some cheap form, insert
                    data from a stream, validate data during insertion.
Garbage Collection  Exclusive lock the repository preventing readers.
Revert              Delta from working tree to historical tree, and then
                    arbitrary file access to obtain the texts of differing
                    files.
Uncommit            Revision graph access.
Status              Revision graph access, revision text access, file
                    fingerprint information, inventory differencing.
Diff                As status but also file text access.
Merge               As diff but needs up to twice as many file texts -
                    base and other for each changed file. Also an initial
                    fetch is needed.
Log                 Revision graph (entire at the moment) access,
                    sometimes status between adjacent revisions. Log of a
                    file needs per-file-graph. Dominator caching or
                    similar tools may be needed to prevent entire graph
                    access.
Missing             Revision graph access, and revision texts to show
                    output.
Update              As for merge, but twice.
==================  ====================

Data access patterns
====================

Ideally we can make our data access for commands such as branch to
dovetail well with the native storage in the repository, in the common
case. Doing this may require choosing the behaviour of some commands to
allow us to have a smaller range of access patterns which we can optimise
more heavily. Alternatively if each command is very predicable in its
data access pattern we may be able to hint to the low level layers which
pattern is needed on a per command basis to get efficient behaviour.

===================  ===================================================
Command              Data access pattern
===================  ===================================================
Annotate-cached      Find text name in an inventory, Recreate one text,
                     recreate annotation regions
Annotate-on demand   Find file id from name, then breadth-first pre-order
                     traversal of versions-of-the-file until the annotation
                     is complete.
Branch               Fetch, possibly taking a copy of any file present in a
                     nominated revision when it is validated during fetch.
Bundle               Revision-graph as for fetch; then inventories for
                     selected revision_ids to determine file texts, then
                     mp-parent deltas for all determined file texts.
Commit               Something like basis-inventories read to determine
                     per-file graphs, insertion of new texts (which may
                     be delta compressed), generation of annotation
                     regions if the repository is configured to do so,
                     finalisation of the inventory pointing at all the new
                     texts and finally a revision and possibly signature.
Fetching             Revision-graph searching to find the graph difference.
                     Scan the inventory data introduced during the selected
                     revisions, and grab the on disk data for the found
                     file texts, annotation region data, per-file-graph
                     data, piling all this into a stream.
Garbage Collection   Basically a mass fetch of all the revisions which
                     branches point at, then a bait and switch with the old
                     repository thus removing unreferenced data.
Revert               Revision graph access for the revision being reverted
                     to, inventory extraction of that revision,
                     dirblock-order file text extract for files that were
                     different.
Uncommit             Revision graph access to synthesise pending-merges
                     linear access down left-hand-side, with is_ancestor
                     checks between all the found non-left-hand-side
                     parents.
Status               Lookup the revisions added by pending merges and their
                     commit messages. Then an inventory difference between
                     the trees involved, which may include a working tree.
                     If there is a working tree involved then the file
                     fingerprint for cache-misses on files will be needed.
                     Note that dirstate caches most of this making
                     repository performance largely irrelevant: but if it
                     was fast enough dirstate might be able to be simpler/
Diff                 As status but also file text access for every file
                     that is different - either one text (working tree
                     diff) or a diff of two (revision to revision diff).
Merge                As diff but needs up to twice as many file texts -
                     base and other for each changed file. Also an initial
                     fetch is needed. Note that the access pattern is
                     probably id-based at the moment, but that may be
                     'fixed' with the iter_changes based merge. Also note
                     that while the texts from OTHER are the ones accessed,
                     this is equivalent to the **newest** form of each text
                     changed from BASE to OTHER. And as the repository
                     looks at when data is introduced, this should be the
                     pattern we focus on for merge.
Log                  Revision graph (entire at the moment) access, log of a
                     file wants a per-file-graph. Log -v will want
                     newest-first inventory deltas between revisions.
Missing              Revision graph access, breadth-first pre-order.
Update               As for merge, but twice.
===================  ===================================================

Patterns used
-------------

Note that these are able to be changed by changing what we store. For
instance if the repository satisfies mpdiff requests, then bundle can be
defined in terms of mpdiff lookups rather than file text lookups
appropriate to create mpdiffs. If the repository satisfies full text
requests only, then you need the topological access to build up the
desired mpdiffs.

=========================================== =========
Pattern                                     Commands
=========================================== =========
Single file text                            annotate, diff
Files present in one revision               branch
Newest form of files altered by revisions   merge, update?
Topological access to file versions/deltas  annotate-uncached
Stream all data required to recreate revs   branch (lightweight)
Stream file texts in topological order      bundle
Write full versions of files, inv, rev, sig commit
Write deltas of files, inv for one tree     commit
Stream all data introduced by revs          fetch
Regenerate/combine deltas of many trees     fetch, pack
Reconstruct all texts and validate trees    check, fetch
Revision graph walk                         fetch, pack, uncommit,
                                            annotate-uncached,
                                            merge, log, missing
Top down access multiple invs concurrently  status, diff, merge?, update?
Concurrent access to N file texts           diff, merge
Iteration of inventory deltas               log -v, fetch?
=========================================== =========

Facilities to scale well
========================

Indices
-------

We want < linear access to all data in the repository. This suggests
everything is indexed to some degree.

Often we know the kind of data we are accessing; which allows us to
partition our indices if that will help (e.g. by reducing the total index
size for queries that only care about the revision graph).

Indices that support our data access patterns will usually display
increased locality of reference, reducing the impact of a large indices
without needing careful page size management or other tricks.

We need repository wide indices. For the current repositories this is
achieved by dividing the keyspace (revisions, signatures, inventories,
per-fileid) and then having an append only index within each keyspace.
For pack based repositories we will want some means to query the index of
each component pack, presumably as a single logical index.

It would be nice if indexing was made cleanly separate from storage. So
that suggests indices don't know the meaning of the lookup; indices which
offer particular ordering, or graph walking facilities will clearly need
that information, but perhaps they don't need to know the semantics ?

Index size
~~~~~~~~~~

Smaller indexes are good. We could go with one big index, or a different
index for different operation styles. As multiple indices will occupy more
space in total we should consider carefully about adding indices.

Index ordering
~~~~~~~~~~~~~~

Looking at the data access patterns some operations such as graph walking
can clearly be made more efficient by offering direct iteration rather
than repeated reentry into the index - so having indices that support
iteration in such a style would be useful eventually.

Changing our current indexes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

We can consider introducing cleaner indices in advance of a full pack
based repository.

There are many possibilities for this, but I've chosen one that seems ok
to me for illustration.

A key element is to consider when indices are updated. I think that the
update style proposed for pack based repositories - write once, then when
we group data again rewrite a new single index - is sufficent.

Replace .kndx
^^^^^^^^^^^^^

We could discard the per-knit .kndx by writing a new index at the end of
every bzr transaction indexing the new data introduced by the bzr
operation. e.g. at the end of fetch. This can be based on the new
``GraphIndex`` index type.

Encoding a knit entry into a ``GraphIndex`` can be done as follows:

* Change the key to include a prefix of the knit name, to allow filtering
  out of data from different knits.
* Encode the parents from the knit as the zeroth node reference list.
* If the knit hunk was delta compressed encode the node it was delta
  compressed against as the 1st node reference list (otherwise the 1st
  node reference list will be empty to indicate no compression parents).
* For the value encode similarly to the current knit format the byte
  offset for the data record in the knit, the byte length for the data
  record in the knit and the no-end-of-line flag.

It's important to note that knit repositories cannot be regenerated by
scanning .knits, so a mapped index is still irreplaceable and must be
transmitted on push/pull.

A potential improvement exists by specialising this further to not record
data that is not needed - e.g. an index of revisions does not need to
support a pointer to a parent compressed text as revisions.knit is not
delta-compressed ever. Likewise signatures do not need the parent pointers
at all as there is no 'signature graph'.

Data
----

Moving to pack based repositories
---------------------------------

We have a number of challenges to solve.

Naming of files
~~~~~~~~~~~~~~~

As long as the file name is unique it does not really matter. It might be
interesting to have it be deterministic based on content, but there are no
specific problems we have solved by doing that, and doing so would require
hashing the full file. OTOH hashing the full file is a cheap way to detect
bit-errors in transfer (such as windows corruption). Non-reused file names
are required for data integrity, as clients having read an index will
readv at arbitrary times later.

Discovery of files
~~~~~~~~~~~~~~~~~~

With non-listable transports how should the collection of pack/index files
be found ? Initially record a list of all the pack/index files from
write actions. (Require writable transports to be listable). We can then
use a heuristic to statically combine pack/index files later.

Housing files
~~~~~~~~~~~~~

Combining indices on demand
~~~~~~~~~~~~~~~~~~~~~~~~~~~

Merging data on push
~~~~~~~~~~~~~~~~~~~~

A trivial implementation would be to make a pack which has just the data
needed for the push, then send that. More sophisticated things would be
streaming single-pass creation, and also using this as an opportunity to
increase the packedness of the local repo.

Choosing compression/delta support
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Caching and writeing of data
============================

Repositories try to provide a consistent view of the data within them
within a 'lock context'.

Locks
-----

Locks come in two flavours - read locks and write locks. Read locks allow
data to be read from the repository. Write locks allow data to be read and
signal that you intend to write data at some point. The actual writing of
data must take place within a Write Group.

Write locks provide a cache of repository data during the period of the
write lock, and allow write_groups to be acquired. For some repositories
the presence of a write lock is exclusive to a single client, for others
which are lock free or use server side locks (e.g.  svn), the write lock
simply provides the cache context.

Write Groups
------------

Write groups are the only allowed means for inserting data into a
repository.  These are created by ``start_write_group``, and concluded by
either ``commit_write_group`` or ``abort_write_group``.  A write lock must
be held on the repository for the entire duration.  At most one write
group can be active on a repository at a time.

Write groups signal to the repository the window during which data is
actively being inserted. Several write groups could be committed during a
single lock.

There is no guarantee that data inserted during a write group will be
invisible in the repository if the write group is not committed.
Specifically repositories without atomic insertion facilities will be
writing data as it is inserted within the write group, and may not be able
to revert that data - e.g. in the event of a dropped SFTP connection in a
knit repository, inserted file data will be visible in the repository. Some
repositories have an atomic insertion facility, and for those
all-or-nothing will apply.

The precise meaning of a write group is format specific. For instance a
knit based repository treats the write group methods as dummy calls,
simply meeting the api that clients will use. A pack based repository will
open a new pack container at the start of a write group, and rename it
into place at commit time.


..
   vim: ft=rst tw=74 ai


2592.1.1 by Robert Collins Some repository needs documentation.	1	============
	2	Repositories
	3	============
	4
	5	Status
	6	======
	7
	8	:Date: 2007-07-08
	9
	10	This document describes the services repositories offer and need to offer
3293.3.1 by Alexander Belchenko typo fixed in doc/developers/repository.txt	11	within bzrlib.
2592.1.1 by Robert Collins Some repository needs documentation.	12
	13
	14	.. contents::
	15
	16
	17	Motivation
	18	==========
	19
	20	To provide clarity to API and performance tradeoff decisions by
	21	centralising the requirements placed upon repositories.
	22
	23
	24	Terminology
	25	===========
	26
	27	A repository is a store of historical data for bzr.
	28
	29
	30	Command Requirements
	31	====================
	32
	33	================== ====================
	34	Command Needed services
	35	================== ====================
	36	Add None
	37	Annotate Annotated file texts, revision details
	38	Branch Fetch, Revision parents, Inventory contents, All file texts
	39	Bundle Maximally compact diffs (file and inventory), Revision graph
	40	difference, Revision texts.
	41	Commit Insert new texts, insert new inventory via delta, insert
	42	revision, insert signature
	43	Fetching Revision graph difference, ghost identification, stream data
	44	introduced by a set of revisions in some cheap form, insert
	45	data from a stream, validate data during insertion.
	46	Garbage Collection Exclusive lock the repository preventing readers.
4853.1.1 by Patrick Regan Removed trailing whitespace from files in doc directory	47	Revert Delta from working tree to historical tree, and then
2592.1.45 by Robert Collins Tweak documentation as per Aaron's review.	48	arbitrary file access to obtain the texts of differing
	49	files.
2592.1.1 by Robert Collins Some repository needs documentation.	50	Uncommit Revision graph access.
	51	Status Revision graph access, revision text access, file
	52	fingerprint information, inventory differencing.
	53	Diff As status but also file text access.
	54	Merge As diff but needs up to twice as many file texts -
	55	base and other for each changed file. Also an initial
	56	fetch is needed.
	57	Log Revision graph (entire at the moment) access,
	58	sometimes status between adjacent revisions. Log of a
2592.1.45 by Robert Collins Tweak documentation as per Aaron's review.	59	file needs per-file-graph. Dominator caching or
	60	similar tools may be needed to prevent entire graph
	61	access.
	62	Missing Revision graph access, and revision texts to show
	63	output.
2592.1.1 by Robert Collins Some repository needs documentation.	64	Update As for merge, but twice.
	65	================== ====================
	66
	67	Data access patterns
	68	====================
	69
	70	Ideally we can make our data access for commands such as branch to
	71	dovetail well with the native storage in the repository, in the common
2592.1.45 by Robert Collins Tweak documentation as per Aaron's review.	72	case. Doing this may require choosing the behaviour of some commands to
	73	allow us to have a smaller range of access patterns which we can optimise
	74	more heavily. Alternatively if each command is very predicable in its
	75	data access pattern we may be able to hint to the low level layers which
	76	pattern is needed on a per command basis to get efficient behaviour.
2592.1.1 by Robert Collins Some repository needs documentation.	77
	78	=================== ===================================================
	79	Command Data access pattern
	80	=================== ===================================================
	81	Annotate-cached Find text name in an inventory, Recreate one text,
	82	recreate annotation regions
	83	Annotate-on demand Find file id from name, then breadth-first pre-order
	84	traversal of versions-of-the-file until the annotation
	85	is complete.
	86	Branch Fetch, possibly taking a copy of any file present in a
	87	nominated revision when it is validated during fetch.
	88	Bundle Revision-graph as for fetch; then inventories for
	89	selected revision_ids to determine file texts, then
	90	mp-parent deltas for all determined file texts.
	91	Commit Something like basis-inventories read to determine
	92	per-file graphs, insertion of new texts (which may
	93	be delta compressed), generation of annotation
	94	regions if the repository is configured to do so,
	95	finalisation of the inventory pointing at all the new
	96	texts and finally a revision and possibly signature.
	97	Fetching Revision-graph searching to find the graph difference.
	98	Scan the inventory data introduced during the selected
	99	revisions, and grab the on disk data for the found
	100	file texts, annotation region data, per-file-graph
4853.1.1 by Patrick Regan Removed trailing whitespace from files in doc directory	101	data, piling all this into a stream.
2592.1.1 by Robert Collins Some repository needs documentation.	102	Garbage Collection Basically a mass fetch of all the revisions which
	103	branches point at, then a bait and switch with the old
	104	repository thus removing unreferenced data.
	105	Revert Revision graph access for the revision being reverted
	106	to, inventory extraction of that revision,
	107	dirblock-order file text extract for files that were
	108	different.
4853.1.1 by Patrick Regan Removed trailing whitespace from files in doc directory	109	Uncommit Revision graph access to synthesise pending-merges
2592.1.1 by Robert Collins Some repository needs documentation.	110	linear access down left-hand-side, with is_ancestor
	111	checks between all the found non-left-hand-side
	112	parents.
	113	Status Lookup the revisions added by pending merges and their
	114	commit messages. Then an inventory difference between
	115	the trees involved, which may include a working tree.
4853.1.1 by Patrick Regan Removed trailing whitespace from files in doc directory	116	If there is a working tree involved then the file
2592.1.1 by Robert Collins Some repository needs documentation.	117	fingerprint for cache-misses on files will be needed.
	118	Note that dirstate caches most of this making
	119	repository performance largely irrelevant: but if it
	120	was fast enough dirstate might be able to be simpler/
	121	Diff As status but also file text access for every file
	122	that is different - either one text (working tree
	123	diff) or a diff of two (revision to revision diff).
	124	Merge As diff but needs up to twice as many file texts -
	125	base and other for each changed file. Also an initial
	126	fetch is needed. Note that the access pattern is
	127	probably id-based at the moment, but that may be
	128	'fixed' with the iter_changes based merge. Also note
	129	that while the texts from OTHER are the ones accessed,
	130	this is equivalent to the newest form of each text
	131	changed from BASE to OTHER. And as the repository
	132	looks at when data is introduced, this should be the
	133	pattern we focus on for merge.
	134	Log Revision graph (entire at the moment) access, log of a
	135	file wants a per-file-graph. Log -v will want
	136	newest-first inventory deltas between revisions.
	137	Missing Revision graph access, breadth-first pre-order.
	138	Update As for merge, but twice.
	139	=================== ===================================================
	140
	141	Patterns used
	142	-------------
	143
2592.1.45 by Robert Collins Tweak documentation as per Aaron's review.	144	Note that these are able to be changed by changing what we store. For
	145	instance if the repository satisfies mpdiff requests, then bundle can be
	146	defined in terms of mpdiff lookups rather than file text lookups
	147	appropriate to create mpdiffs. If the repository satisfies full text
	148	requests only, then you need the topological access to build up the
	149	desired mpdiffs.
	150
2592.1.1 by Robert Collins Some repository needs documentation.	151	=========================================== =========
	152	Pattern Commands
	153	=========================================== =========
	154	Single file text annotate, diff
	155	Files present in one revision branch
	156	Newest form of files altered by revisions merge, update?
	157	Topological access to file versions/deltas annotate-uncached
	158	Stream all data required to recreate revs branch (lightweight)
	159	Stream file texts in topological order bundle
	160	Write full versions of files, inv, rev, sig commit
	161	Write deltas of files, inv for one tree commit
	162	Stream all data introduced by revs fetch
	163	Regenerate/combine deltas of many trees fetch, pack
	164	Reconstruct all texts and validate trees check, fetch
	165	Revision graph walk fetch, pack, uncommit,
	166	annotate-uncached,
	167	merge, log, missing
	168	Top down access multiple invs concurrently status, diff, merge?, update?
	169	Concurrent access to N file texts diff, merge
	170	Iteration of inventory deltas log -v, fetch?
	171	=========================================== =========
	172
	173	Facilities to scale well
	174	========================
	175
	176	Indices
	177	-------
	178
	179	We want < linear access to all data in the repository. This suggests
	180	everything is indexed to some degree.
	181
	182	Often we know the kind of data we are accessing; which allows us to
	183	partition our indices if that will help (e.g. by reducing the total index
	184	size for queries that only care about the revision graph).
	185
	186	Indices that support our data access patterns will usually display
	187	increased locality of reference, reducing the impact of a large indices
	188	without needing careful page size management or other tricks.
	189
2592.1.2 by Robert Collins More repository doco.	190	We need repository wide indices. For the current repositories this is
	191	achieved by dividing the keyspace (revisions, signatures, inventories,
	192	per-fileid) and then having an append only index within each keyspace.
	193	For pack based repositories we will want some means to query the index of
	194	each component pack, presumably as a single logical index.
	195
	196	It would be nice if indexing was made cleanly separate from storage. So
	197	that suggests indices don't know the meaning of the lookup; indices which
	198	offer particular ordering, or graph walking facilities will clearly need
	199	that information, but perhaps they don't need to know the semantics ?
	200
	201	Index size
	202	~~~~~~~~~~
	203
	204	Smaller indexes are good. We could go with one big index, or a different
	205	index for different operation styles. As multiple indices will occupy more
	206	space in total we should consider carefully about adding indices.
	207
	208	Index ordering
	209	~~~~~~~~~~~~~~
	210
	211	Looking at the data access patterns some operations such as graph walking
	212	can clearly be made more efficient by offering direct iteration rather
	213	than repeated reentry into the index - so having indices that support
	214	iteration in such a style would be useful eventually.
	215
	216	Changing our current indexes
	217	~~~~~~~~~~~~~~~~~~~~~~~~~~~~
	218
2592.1.3 by Robert Collins More speculation and repository docs.	219	We can consider introducing cleaner indices in advance of a full pack
	220	based repository.
2592.1.2 by Robert Collins More repository doco.	221
	222	There are many possibilities for this, but I've chosen one that seems ok
	223	to me for illustration.
	224
	225	A key element is to consider when indices are updated. I think that the
	226	update style proposed for pack based repositories - write once, then when
2592.1.3 by Robert Collins More speculation and repository docs.	227	we group data again rewrite a new single index - is sufficent.
	228
	229	Replace .kndx
	230	^^^^^^^^^^^^^
	231
	232	We could discard the per-knit .kndx by writing a new index at the end of
	233	every bzr transaction indexing the new data introduced by the bzr
2592.1.34 by Robert Collins Cleanup docs.	234	operation. e.g. at the end of fetch. This can be based on the new
	235	``GraphIndex`` index type.
	236
	237	Encoding a knit entry into a ``GraphIndex`` can be done as follows:
	238
	239	* Change the key to include a prefix of the knit name, to allow filtering
	240	out of data from different knits.
	241	* Encode the parents from the knit as the zeroth node reference list.
	242	* If the knit hunk was delta compressed encode the node it was delta
	243	compressed against as the 1st node reference list (otherwise the 1st
	244	node reference list will be empty to indicate no compression parents).
	245	* For the value encode similarly to the current knit format the byte
	246	offset for the data record in the knit, the byte length for the data
	247	record in the knit and the no-end-of-line flag.
2592.1.2 by Robert Collins More repository doco.	248
5538.1.1 by Zearin Fixed “its” vs “it's”.	249	It's important to note that knit repositories cannot be regenerated by
2592.1.34 by Robert Collins Cleanup docs.	250	scanning .knits, so a mapped index is still irreplaceable and must be
2592.1.34 by Robert Collins Cleanup docs.	251	transmitted on push/pull.
2592.1.3 by Robert Collins More speculation and repository docs.	252
	253	A potential improvement exists by specialising this further to not record
	254	data that is not needed - e.g. an index of revisions does not need to
	255	support a pointer to a parent compressed text as revisions.knit is not
	256	delta-compressed ever. Likewise signatures do not need the parent pointers
2592.1.34 by Robert Collins Cleanup docs.	257	at all as there is no 'signature graph'.
2592.1.2 by Robert Collins More repository doco.	258
4853.1.1 by Patrick Regan Removed trailing whitespace from files in doc directory	259	Data
2592.1.2 by Robert Collins More repository doco.	260	----
	261
	262	Moving to pack based repositories
	263	---------------------------------
	264
	265	We have a number of challenges to solve.
	266
	267	Naming of files
	268	~~~~~~~~~~~~~~~
	269
	270	As long as the file name is unique it does not really matter. It might be
2592.1.3 by Robert Collins More speculation and repository docs.	271	interesting to have it be deterministic based on content, but there are no
	272	specific problems we have solved by doing that, and doing so would require
	273	hashing the full file. OTOH hashing the full file is a cheap way to detect
2592.3.84 by Robert Collins Start of autopacking logic.	274	bit-errors in transfer (such as windows corruption). Non-reused file names
2592.3.84 by Robert Collins Start of autopacking logic.	275	are required for data integrity, as clients having read an index will
4853.1.1 by Patrick Regan Removed trailing whitespace from files in doc directory	276	readv at arbitrary times later.
2592.1.3 by Robert Collins More speculation and repository docs.	277
	278	Discovery of files
	279	~~~~~~~~~~~~~~~~~~
	280
2592.1.44 by Robert Collins Remove some unneeded index iteration by checking if we have found all keys, and grammar improvements from Aaron's review.	281	With non-listable transports how should the collection of pack/index files
2592.1.3 by Robert Collins More speculation and repository docs.	282	be found ? Initially record a list of all the pack/index files from
	283	write actions. (Require writable transports to be listable). We can then
	284	use a heuristic to statically combine pack/index files later.
2592.1.2 by Robert Collins More repository doco.	285
	286	Housing files
	287	~~~~~~~~~~~~~
	288
	289	Combining indices on demand
	290	~~~~~~~~~~~~~~~~~~~~~~~~~~~
	291
	292	Merging data on push
	293	~~~~~~~~~~~~~~~~~~~~
	294
	295	A trivial implementation would be to make a pack which has just the data
	296	needed for the push, then send that. More sophisticated things would be
	297	streaming single-pass creation, and also using this as an opportunity to
	298	increase the packedness of the local repo.
	299
	300	Choosing compression/delta support
	301	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
	302
2617.6.8 by Robert Collins Review feedback and documentation.	303	Caching and writeing of data
	304	============================
	305
	306	Repositories try to provide a consistent view of the data within them
4853.1.1 by Patrick Regan Removed trailing whitespace from files in doc directory	307	within a 'lock context'.
2617.6.8 by Robert Collins Review feedback and documentation.	308
	309	Locks
	310	-----
	311
	312	Locks come in two flavours - read locks and write locks. Read locks allow
	313	data to be read from the repository. Write locks allow data to be read and
	314	signal that you intend to write data at some point. The actual writing of
	315	data must take place within a Write Group.
	316
	317	Write locks provide a cache of repository data during the period of the
	318	write lock, and allow write_groups to be acquired. For some repositories
	319	the presence of a write lock is exclusive to a single client, for others
	320	which are lock free or use server side locks (e.g. svn), the write lock
4853.1.1 by Patrick Regan Removed trailing whitespace from files in doc directory	321	simply provides the cache context.
2617.6.8 by Robert Collins Review feedback and documentation.	322
	323	Write Groups
	324	------------
	325
	326	Write groups are the only allowed means for inserting data into a
	327	repository. These are created by ``start_write_group``, and concluded by
	328	either ``commit_write_group`` or ``abort_write_group``. A write lock must
	329	be held on the repository for the entire duration. At most one write
	330	group can be active on a repository at a time.
	331
	332	Write groups signal to the repository the window during which data is
	333	actively being inserted. Several write groups could be committed during a
	334	single lock.
	335
	336	There is no guarantee that data inserted during a write group will be
	337	invisible in the repository if the write group is not committed.
	338	Specifically repositories without atomic insertion facilities will be
	339	writing data as it is inserted within the write group, and may not be able
	340	to revert that data - e.g. in the event of a dropped SFTP connection in a
	341	knit repository, inserted file data will be visible in the repository. Some
	342	repositories have an atomic insertion facility, and for those
	343	all-or-nothing will apply.
	344
	345	The precise meaning of a write group is format specific. For instance a
	346	knit based repository treats the write group methods as dummy calls,
	347	simply meeting the api that clients will use. A pack based repository will
	348	open a new pack container at the start of a write group, and rename it
	349	into place at commit time.
2592.1.2 by Robert Collins More repository doco.	350
2592.1.1 by Robert Collins Some repository needs documentation.	351
	352	..
	353	vim: ft=rst tw=74 ai
	354