~bzr-pqm/bzr/bzr.dev : contents of doc/formats.txt at revision 1064

~bzr-pqm/bzr/bzr.dev : (revision 1064)

*****************
Bazaar-NG formats
*****************

.. contents::

Since branches are working directories there is just a single
directory format.

There is one metadata directory called ``.bzr`` at the top of each
tree.  Control files inside ``.bzr`` are never touched by patches and
should not normally be edited by the user.

These files are designed so that repository-level operations are ACID
without depending on atomic operations spanning multiple files.  There
are two particular cases: aborting a transaction in the middle, and
contention from multiple processes.  We also need to be careful to
flush files to disk at appropriate points; even this may not be
totally safe if the filesystem does not guarantee ordering between
multiple file changes, so we need to be sure to roll back.

The design must also be such that the directory can simply be copied
and that hardlinked directories will work.  (So we must always replace
files, never just append.)

A cache is kept under here of easily-accessible information about
previous revisions.  This should be under a single directory so that
it can be easily identified, excluded from backups, removed, etc.
This might contain pristine tree from previous revisions, manifests
and inventories, etc.  It might also contain working directories when
building a commit, etc.  Call this maybe ``cache`` or ``tmp``.

I wonder if we should use .zip files for revisions and cacherevs
rather than tar files so that random access is easier/more efficient.
There is a Python library ``zipfile``.


Signing XML files
*****************

bzr relies on storing hashes or GPG signatures of various XML files.
There can be multiple equivalent representations of the same XML tree,
but these will have different byte-by-byte hashes.

Once signed files are written out, they must be stored byte-for-byte
and never re-encoded or renormalized, because that would break their
hash or signature.




Branch metadata
***************

All inside ``.bzr``

``README``
  Tells people not to touch anything here.

``branch-format``
  Identifies the parent as a Bazaar-NG branch; contains the overall
  branch metadata format as a string.

``pristine-directory``
  Identifies that this is a pristine directory and may not be
  committed to.

``patches/``
  Directory containing all patches applied to this branch, one per
  file.  Patches are stored as compressed deltas.  We also store the
  hash of the delta, hash of the before and after manifests, and
  optionally a GPG signature.

``cache/``
  Contains various cached data that can be destroyed and will be
  recreated.  (It should not be modified.)

``cache/pristine/``
  Contains cached full trees for selected previous revisions, used
  when generating diffs, etc.

``cache/inventory/``
  Contains cached inventories of previous revisions.

``cache/snapshot/``
  Contains tarballs of cached revisions of the tree, named by their
  revision id.  These can also be removed, but 

``patch-history``
  File containing the UUIDs of all patches taken in this branch,
  in the order they were taken.
  Each commit adds exactly one line to this file; lines are
  never removed or reordered.

``merged-patches``
  List of foreign patches that have been merged into this branch.
  Must have no entries in common with ``patch-history``.  Commits that
  include merges add to this file; lines are never removed or
  reordered.

``pending-merges`` 
  List (one per line) of non-mainline revisions that
  have been merged and are waiting to be committed.

``branch-name``
  User-qualified name of the branch, for the purpose of describing the
  origin of patches, e.g. ``mbp@sourcefrog.net/distcc--main``.

``friends``
  List of branches from which we have pulled; file containing a list
  of pairs of branch-name and location.

``parent``
  Default pull/push target.

``pending-inventory``
  Mapping from UUIDs to file name in the current working directory.  

``branch-lock``
  Lock held while modifying the branch, to protect against clashing
  updates.


Locking
*******

Is locking a good strategy?  Perhaps somekind of read-copy-update or
seq-lock based mechanism would work better?

If we do use a locking algorithm, is it OK to rely on filesystem
locking or do we need our own mechanism?  I think most hosts should
have reasonable ``flock()`` or equivalent, even on NFS.  One risk is
that on NFS it is easy to have broken locking and not know it, so it
might be better to have something that will fail safe.

Filesystem locks go away if the machine crashes or the process is
terminated; this can be a feature in that we do not need to deal with
stale locks but also a feature in that the lock itself does not
indicate cleanup may be needed.

robertc points out that tla converged on renaming a directory as a
mechanism: this is one thing which is known to be atomic on almost all
filesystems.  Apparently renaming files, creating directories, making
symlinks etc are not good enough.



Delta
*****

XML document plus a bag of patches, expressing the difference between
two revisions.  May be a partial delta.

* list of entries

  * entry
  
    * parent directory (if any)
    * before-name or null if new
    * after-name or null if deleted
    * uuid
    * type (dir, file, symlink, ...)
    * patch type (patch, full-text, xdelta, ...)
    * patch filename (?)


Inventory
*********

XML document; series of entries.  (Quite similar to the svn
``entries`` file; perhaps should even have that name.)
Stored identified by its hash.

An inventory is stored for recorded revisions, also a
``pending-inventory`` for a working directory.

Inventories always have the same id as the revision they correspond
to.  bzr up to 0.0.5 explicitly stores an inventory-id; in future
versions this may be implied.



Revision
********

XML document.  Stored identified by its hash.

committer
  RFC-2822-style name of the committer.  Should match the key used to
  sign the revision.

comment
  multi-line free-form text; whitespace and line breaks preserved

timestamp
  As floating-point seconds since epoch.

branch name
  Name of the branch to which this was originally committed.    

  (I'm not totally satisfied that this is the right way to do it; the
  results will be a bit weird when a series of revisions pass through
  variously named branches.)

inventory_hash
  Acts as a pointer to the inventory for this revision.

parents
  Zero, one, or more references to parent revisions.   For each 
  the revision-id and the revision file's hash are given.  The first
  parent is by convention the revision in whose working tree the
  new revision was created.

precursor
  Must be equal to the first parent, if any are given.  For
  compatibility with bzr 0.0.5 and earlier; eventually will be
  removed. 

merged-branches
  Revision ids of complete branches merged into this revision.  If a
  revision is listed, that revision and transitively its predecessor
  and all other merged-branches are merged.  This is empty except
  where cherry-picks have occurred.

merged-patches
  Revision ids of cherry-picked patches.  Patches whose branches are
  merged need not be listed here.  Listing a revision ID implies that
  only the change of that particular revision from its predecessor has
  been merged in.   This is empty except where cherry-picks have
  occurred.

The transitive closure avoids Arch's problem of needing to list a
large number of previous revisions.  As ddaa writes:

    Continuation revisions (created by tla tag or baz branch) are associated
    to a patchlog whose New-patches header lists the revisions associated to
    all the patchlogs present in the tree. That was introduced as an
    optimisation so the set of patchlogs in any revision could be determined
    solely by examining the patchlogs of ancestor revisions in the same
    branch. This behaves well as long as the total count of patchlog is
    reasonably small or new branches are not very frequent.

    A continuation revision on $tree currently creates a patchlog of
    about 500K. This patchlog is present in all descendent of the revision,
    and all revisions that merges it.

It may be useful at some times to keep a cache of all the branches, or
all the revisions, present in the history of a branch, so that we do
need to walk the whole history of the branch to build this list.
  

----

Proposed changes
****************

* Don't store parent-id in all revisions, but rather have <DIRECTORY>
  nodes that contain entries for children?

* Assign an id to the root of the tree, perhaps listed in the top of
  the inventory?

6 by mbp at sourcefrog import all docs from arch	1	*****************
	2	Bazaar-NG formats
	3	*****************
	4
	5	.. contents::
	6
	7	Since branches are working directories there is just a single
	8	directory format.
	9
	10	There is one metadata directory called ``.bzr`` at the top of each
	11	tree. Control files inside ``.bzr`` are never touched by patches and
	12	should not normally be edited by the user.
	13
	14	These files are designed so that repository-level operations are ACID
	15	without depending on atomic operations spanning multiple files. There
	16	are two particular cases: aborting a transaction in the middle, and
	17	contention from multiple processes. We also need to be careful to
	18	flush files to disk at appropriate points; even this may not be
	19	totally safe if the filesystem does not guarantee ordering between
	20	multiple file changes, so we need to be sure to roll back.
	21
	22	The design must also be such that the directory can simply be copied
	23	and that hardlinked directories will work. (So we must always replace
	24	files, never just append.)
	25
	26	A cache is kept under here of easily-accessible information about
	27	previous revisions. This should be under a single directory so that
	28	it can be easily identified, excluded from backups, removed, etc.
	29	This might contain pristine tree from previous revisions, manifests
	30	and inventories, etc. It might also contain working directories when
	31	building a commit, etc. Call this maybe ``cache`` or ``tmp``.
	32
	33	I wonder if we should use .zip files for revisions and cacherevs
	34	rather than tar files so that random access is easier/more efficient.
	35	There is a Python library ``zipfile``.
	36
	37
	38	Signing XML files
	39	*****************
	40
	41	bzr relies on storing hashes or GPG signatures of various XML files.
	42	There can be multiple equivalent representations of the same XML tree,
	43	but these will have different byte-by-byte hashes.
	44
	45	Once signed files are written out, they must be stored byte-for-byte
	46	and never re-encoded or renormalized, because that would break their
	47	hash or signature.
	48
	49
	50
	51
	52	Branch metadata
	53	***************
	54
	55	All inside ``.bzr``
	56
	57	``README``
	58	Tells people not to touch anything here.
	59
	60	``branch-format``
	61	Identifies the parent as a Bazaar-NG branch; contains the overall
	62	branch metadata format as a string.
	63
	64	``pristine-directory``
65	Identifies that this is a pristine directory and may not be
66	committed to.
67
68	``patches/``
69	Directory containing all patches applied to this branch, one per
70	file. Patches are stored as compressed deltas. We also store the
71	hash of the delta, hash of the before and after manifests, and
72	optionally a GPG signature.
73
74	``cache/``
75	Contains various cached data that can be destroyed and will be
76	recreated. (It should not be modified.)
77
78	``cache/pristine/``
79	Contains cached full trees for selected previous revisions, used
80	when generating diffs, etc.
81
82	``cache/inventory/``
83	Contains cached inventories of previous revisions.
84
85	``cache/snapshot/``
86	Contains tarballs of cached revisions of the tree, named by their
87	revision id. These can also be removed, but
88
89	``patch-history``
90	File containing the UUIDs of all patches taken in this branch,
91	in the order they were taken.
92	Each commit adds exactly one line to this file; lines are
93	never removed or reordered.
94
95	``merged-patches``
96	List of foreign patches that have been merged into this branch.
97	Must have no entries in common with ``patch-history``. Commits that
98	include merges add to this file; lines are never removed or
99	reordered.
100
812 by Martin Pool - rename control file to pending-merges	101	``pending-merges``
	102	List (one per line) of non-mainline revisions that
	103	have been merged and are waiting to be committed.
6 by mbp at sourcefrog import all docs from arch	104
	105	``branch-name``
	106	User-qualified name of the branch, for the purpose of describing the
	107	origin of patches, e.g. ``mbp@sourcefrog.net/distcc--main``.
	108
	109	``friends``
	110	List of branches from which we have pulled; file containing a list
	111	of pairs of branch-name and location.
	112
	113	``parent``
	114	Default pull/push target.
	115
	116	``pending-inventory``
	117	Mapping from UUIDs to file name in the current working directory.
	118
	119	``branch-lock``
	120	Lock held while modifying the branch, to protect against clashing
	121	updates.
	122
	123
	124	Locking
	125	*******
	126
	127	Is locking a good strategy? Perhaps somekind of read-copy-update or
	128	seq-lock based mechanism would work better?
	129
	130	If we do use a locking algorithm, is it OK to rely on filesystem
	131	locking or do we need our own mechanism? I think most hosts should
	132	have reasonable ``flock()`` or equivalent, even on NFS. One risk is
	133	that on NFS it is easy to have broken locking and not know it, so it
	134	might be better to have something that will fail safe.
	135
	136	Filesystem locks go away if the machine crashes or the process is
	137	terminated; this can be a feature in that we do not need to deal with
	138	stale locks but also a feature in that the lock itself does not
	139	indicate cleanup may be needed.
	140
	141	robertc points out that tla converged on renaming a directory as a
	142	mechanism: this is one thing which is known to be atomic on almost all
	143	filesystems. Apparently renaming files, creating directories, making
	144	symlinks etc are not good enough.
	145
	146
	147
	148	Delta
	149	*****
	150
	151	XML document plus a bag of patches, expressing the difference between
	152	two revisions. May be a partial delta.
	153
	154	* list of entries
	155
	156	* entry
	157
	158	* parent directory (if any)
	159	* before-name or null if new
	160	* after-name or null if deleted
	161	* uuid
	162	* type (dir, file, symlink, ...)
	163	* patch type (patch, full-text, xdelta, ...)
	164	* patch filename (?)
	165
	166
	167	Inventory
168	*********
169
170	XML document; series of entries. (Quite similar to the svn
171	``entries`` file; perhaps should even have that name.)
172	Stored identified by its hash.
173
174	An inventory is stored for recorded revisions, also a
175	``pending-inventory`` for a working directory.
176
819 by Martin Pool - check command checks that all inventory-ids are the same as in the revision.	177	Inventories always have the same id as the revision they correspond
	178	to. bzr up to 0.0.5 explicitly stores an inventory-id; in future
	179	versions this may be implied.
	180
6 by mbp at sourcefrog import all docs from arch	181
	182
	183	Revision
	184	********
	185
	186	XML document. Stored identified by its hash.
	187
	188	committer
	189	RFC-2822-style name of the committer. Should match the key used to
	190	sign the revision.
	191
	192	comment
	193	multi-line free-form text; whitespace and line breaks preserved
	194
	195	timestamp
	196	As floating-point seconds since epoch.
	197
	198	branch name
	199	Name of the branch to which this was originally committed.
	200
	201	(I'm not totally satisfied that this is the right way to do it; the
254 by Martin Pool - Doc cleanups from Magnus Therning	202	results will be a bit weird when a series of revisions pass through
6 by mbp at sourcefrog import all docs from arch	203	variously named branches.)
	204
	205	inventory_hash
	206	Acts as a pointer to the inventory for this revision.
	207
693 by Martin Pool - notes on tracking multiple parents	208	parents
	209	Zero, one, or more references to parent revisions. For each
	210	the revision-id and the revision file's hash are given. The first
	211	parent is by convention the revision in whose working tree the
	212	new revision was created.
	213
	214	precursor
	215	Must be equal to the first parent, if any are given. For
	216	compatibility with bzr 0.0.5 and earlier; eventually will be
	217	removed.
	218
6 by mbp at sourcefrog import all docs from arch	219	merged-branches
	220	Revision ids of complete branches merged into this revision. If a
	221	revision is listed, that revision and transitively its predecessor
	222	and all other merged-branches are merged. This is empty except
	223	where cherry-picks have occurred.
	224
	225	merged-patches
	226	Revision ids of cherry-picked patches. Patches whose branches are
	227	merged need not be listed here. Listing a revision ID implies that
	228	only the change of that particular revision from its predecessor has
	229	been merged in. This is empty except where cherry-picks have
	230	occurred.
	231
	232	The transitive closure avoids Arch's problem of needing to list a
	233	large number of previous revisions. As ddaa writes:
	234
	235	Continuation revisions (created by tla tag or baz branch) are associated
	236	to a patchlog whose New-patches header lists the revisions associated to
	237	all the patchlogs present in the tree. That was introduced as an
	238	optimisation so the set of patchlogs in any revision could be determined
	239	solely by examining the patchlogs of ancestor revisions in the same
	240	branch. This behaves well as long as the total count of patchlog is
	241	reasonably small or new branches are not very frequent.
	242
	243	A continuation revision on $tree currently creates a patchlog of
	244	about 500K. This patchlog is present in all descendent of the revision,
	245	and all revisions that merges it.
	246
	247	It may be useful at some times to keep a cache of all the branches, or
	248	all the revisions, present in the history of a branch, so that we do
	249	need to walk the whole history of the branch to build this list.
	250
	251
54 by mbp at sourcefrog suggestions from robert about the inventory format	252	----
	253
	254	Proposed changes
	255	****************
	256
	257	* Don't store parent-id in all revisions, but rather have <DIRECTORY>
	258	nodes that contain entries for children?
	259
	260	* Assign an id to the root of the tree, perhaps listed in the top of
	261	the inventory?