~bzr-pqm/bzr/bzr.dev : contents of doc/formats.txt at revision 1185.33.86

~bzr-pqm/bzr/bzr.dev : (revision 1185.33.86)

*****************
Bazaar-NG formats
*****************

.. contents::

Since branches are working directories there is just a single
directory format.

There is one metadata directory called ``.bzr`` at the top of each
tree.  Control files inside ``.bzr`` are never touched by patches and
should not normally be edited by the user.

These files are designed so that repository-level operations are ACID
without depending on atomic operations spanning multiple files.  There
are two particular cases: aborting a transaction in the middle, and
contention from multiple processes.  We also need to be careful to
flush files to disk at appropriate points; even this may not be
totally safe if the filesystem does not guarantee ordering between
multiple file changes, so we need to be sure to roll back.

The design must also be such that the directory can simply be copied
and that hardlinked directories will work.  (So we must always replace
files, never just append.)

A cache is kept under here of easily-accessible information about
previous revisions.  This should be under a single directory so that
it can be easily identified, excluded from backups, removed, etc.
This might contain pristine tree from previous revisions, manifests
and inventories, etc.  It might also contain working directories when
building a commit, etc.  Call this maybe ``cache`` or ``tmp``.

I wonder if we should use .zip files for revisions and cacherevs
rather than tar files so that random access is easier/more efficient.
There is a Python library ``zipfile``.


Signing XML files
*****************

bzr relies on storing hashes or GPG signatures of various XML files.
There can be multiple equivalent representations of the same XML tree,
but these will have different byte-by-byte hashes.

Once signed files are written out, they must be stored byte-for-byte
and never re-encoded or renormalized, because that would break their
hash or signature.




Branch metadata
***************

All inside ``.bzr``

``README``
  Tells people not to touch anything here.

``branch-format``
  Identifies the parent as a Bazaar-NG branch; contains the overall
  branch metadata format as a string.

``pristine-directory``
  Identifies that this is a pristine directory and may not be
  committed to.

``patches/``
  Directory containing all patches applied to this branch, one per
  file.  Patches are stored as compressed deltas.  We also store the
  hash of the delta, hash of the before and after manifests, and
  optionally a GPG signature.

``cache/``
  Contains various cached data that can be destroyed and will be
  recreated.  (It should not be modified.)

``cache/pristine/``
  Contains cached full trees for selected previous revisions, used
  when generating diffs, etc.

``cache/inventory/``
  Contains cached inventories of previous revisions.

``cache/snapshot/``
  Contains tarballs of cached revisions of the tree, named by their
  revision id.  These can also be removed, but 

``patch-history``
  File containing the UUIDs of all patches taken in this branch,
  in the order they were taken.
  Each commit adds exactly one line to this file; lines are
  never removed or reordered.

``merged-patches``
  List of foreign patches that have been merged into this branch.
  Must have no entries in common with ``patch-history``.  Commits that
  include merges add to this file; lines are never removed or
  reordered.

``pending-merges`` 
  List (one per line) of non-mainline revisions that
  have been merged and are waiting to be committed.

``branch-name``
  User-qualified name of the branch, for the purpose of describing the
  origin of patches, e.g. ``mbp@sourcefrog.net/distcc--main``.

``friends``
  List of branches from which we have pulled; file containing a list
  of pairs of branch-name and location.

``parent``
  Default pull/push target.

``pending-inventory``
  Mapping from UUIDs to file name in the current working directory.  

``branch-lock``
  Lock held while modifying the branch, to protect against clashing
  updates.


Locking
*******

Is locking a good strategy?  Perhaps somekind of read-copy-update or
seq-lock based mechanism would work better?

If we do use a locking algorithm, is it OK to rely on filesystem
locking or do we need our own mechanism?  I think most hosts should
have reasonable ``flock()`` or equivalent, even on NFS.  One risk is
that on NFS it is easy to have broken locking and not know it, so it
might be better to have something that will fail safe.

Filesystem locks go away if the machine crashes or the process is
terminated; this can be a feature in that we do not need to deal with
stale locks but also a feature in that the lock itself does not
indicate cleanup may be needed.

robertc points out that tla converged on renaming a directory as a
mechanism: this is one thing which is known to be atomic on almost all
filesystems.  Apparently renaming files, creating directories, making
symlinks etc are not good enough.



Delta
*****

XML document plus a bag of patches, expressing the difference between
two revisions.  May be a partial delta.

* list of entries

  * entry
  
    * parent directory (if any)
    * before-name or null if new
    * after-name or null if deleted
    * uuid
    * type (dir, file, symlink, ...)
    * patch type (patch, full-text, xdelta, ...)
    * patch filename (?)


Inventory
*********

XML document; series of entries.  (Quite similar to the svn
``entries`` file; perhaps should even have that name.)
Stored identified by its hash.

An inventory is stored for recorded revisions, also a
``pending-inventory`` for a working directory.

Inventories always have the same id as the revision they correspond
to.  bzr up to 0.0.5 explicitly stores an inventory-id; in future
versions this may be implied.



Revision
********

XML document.  Stored identified by its hash.

committer
  RFC-2822-style name of the committer.  Should match the key used to
  sign the revision.

comment
  multi-line free-form text; whitespace and line breaks preserved

timestamp
  As floating-point seconds since epoch.

branch name
  Name of the branch to which this was originally committed.    

  (I'm not totally satisfied that this is the right way to do it; the
  results will be a bit weird when a series of revisions pass through
  variously named branches.)

inventory_hash
  Acts as a pointer to the inventory for this revision.

parents
  Zero, one, or more references to parent revisions.   For each 
  the revision-id and the revision file's hash are given.  The first
  parent is by convention the revision in whose working tree the
  new revision was created.

precursor
  Must be equal to the first parent, if any are given.  For
  compatibility with bzr 0.0.5 and earlier; eventually will be
  removed. 

merged-branches
  Revision ids of complete branches merged into this revision.  If a
  revision is listed, that revision and transitively its predecessor
  and all other merged-branches are merged.  This is empty except
  where cherry-picks have occurred.

merged-patches
  Revision ids of cherry-picked patches.  Patches whose branches are
  merged need not be listed here.  Listing a revision ID implies that
  only the change of that particular revision from its predecessor has
  been merged in.   This is empty except where cherry-picks have
  occurred.

The transitive closure avoids Arch's problem of needing to list a
large number of previous revisions.  As ddaa writes:

    Continuation revisions (created by tla tag or baz branch) are associated
    to a patchlog whose New-patches header lists the revisions associated to
    all the patchlogs present in the tree. That was introduced as an
    optimisation so the set of patchlogs in any revision could be determined
    solely by examining the patchlogs of ancestor revisions in the same
    branch. This behaves well as long as the total count of patchlog is
    reasonably small or new branches are not very frequent.

    A continuation revision on $tree currently creates a patchlog of
    about 500K. This patchlog is present in all descendent of the revision,
    and all revisions that merges it.

It may be useful at some times to keep a cache of all the branches, or
all the revisions, present in the history of a branch, so that we do
need to walk the whole history of the branch to build this list.
  

----

Proposed changes
****************

* Don't store parent-id in all revisions, but rather have <DIRECTORY>
  nodes that contain entries for children?

* Assign an id to the root of the tree, perhaps listed in the top of
  the inventory?

1185.1.29 by Robert Collins merge merge tweaks from aaron, which includes latest .dev	1	*****************
	2	Bazaar-NG formats
	3	*****************
	4
	5	.. contents::
	6
	7	Since branches are working directories there is just a single
	8	directory format.
	9
	10	There is one metadata directory called ``.bzr`` at the top of each
	11	tree. Control files inside ``.bzr`` are never touched by patches and
	12	should not normally be edited by the user.
	13
	14	These files are designed so that repository-level operations are ACID
	15	without depending on atomic operations spanning multiple files. There
	16	are two particular cases: aborting a transaction in the middle, and
	17	contention from multiple processes. We also need to be careful to
	18	flush files to disk at appropriate points; even this may not be
	19	totally safe if the filesystem does not guarantee ordering between
	20	multiple file changes, so we need to be sure to roll back.
	21
	22	The design must also be such that the directory can simply be copied
	23	and that hardlinked directories will work. (So we must always replace
	24	files, never just append.)
	25
	26	A cache is kept under here of easily-accessible information about
	27	previous revisions. This should be under a single directory so that
	28	it can be easily identified, excluded from backups, removed, etc.
	29	This might contain pristine tree from previous revisions, manifests
	30	and inventories, etc. It might also contain working directories when
	31	building a commit, etc. Call this maybe ``cache`` or ``tmp``.
	32
	33	I wonder if we should use .zip files for revisions and cacherevs
	34	rather than tar files so that random access is easier/more efficient.
	35	There is a Python library ``zipfile``.
	36
	37
	38	Signing XML files
	39	*****************
	40
	41	bzr relies on storing hashes or GPG signatures of various XML files.
	42	There can be multiple equivalent representations of the same XML tree,
	43	but these will have different byte-by-byte hashes.
	44
	45	Once signed files are written out, they must be stored byte-for-byte
	46	and never re-encoded or renormalized, because that would break their
	47	hash or signature.
	48
	49
	50
	51
	52	Branch metadata
	53	***************
	54
	55	All inside ``.bzr``
	56
	57	``README``
	58	Tells people not to touch anything here.
	59
	60	``branch-format``
	61	Identifies the parent as a Bazaar-NG branch; contains the overall
	62	branch metadata format as a string.
	63
	64	``pristine-directory``
65	Identifies that this is a pristine directory and may not be
66	committed to.
67
68	``patches/``
69	Directory containing all patches applied to this branch, one per
70	file. Patches are stored as compressed deltas. We also store the
71	hash of the delta, hash of the before and after manifests, and
72	optionally a GPG signature.
73
74	``cache/``
75	Contains various cached data that can be destroyed and will be
76	recreated. (It should not be modified.)
77
78	``cache/pristine/``
79	Contains cached full trees for selected previous revisions, used
80	when generating diffs, etc.
81
82	``cache/inventory/``
83	Contains cached inventories of previous revisions.
84
85	``cache/snapshot/``
86	Contains tarballs of cached revisions of the tree, named by their
87	revision id. These can also be removed, but
88
89	``patch-history``
90	File containing the UUIDs of all patches taken in this branch,
91	in the order they were taken.
92	Each commit adds exactly one line to this file; lines are
93	never removed or reordered.
94
95	``merged-patches``
96	List of foreign patches that have been merged into this branch.
97	Must have no entries in common with ``patch-history``. Commits that
98	include merges add to this file; lines are never removed or
99	reordered.
100
101	``pending-merges``
102	List (one per line) of non-mainline revisions that
103	have been merged and are waiting to be committed.
104
105	``branch-name``
106	User-qualified name of the branch, for the purpose of describing the
107	origin of patches, e.g. ``mbp@sourcefrog.net/distcc--main``.
108
109	``friends``
110	List of branches from which we have pulled; file containing a list
111	of pairs of branch-name and location.
112
113	``parent``
114	Default pull/push target.
115
116	``pending-inventory``
117	Mapping from UUIDs to file name in the current working directory.
118
119	``branch-lock``
120	Lock held while modifying the branch, to protect against clashing
121	updates.
122
123
124	Locking
125	*******
126
127	Is locking a good strategy? Perhaps somekind of read-copy-update or
128	seq-lock based mechanism would work better?
129
130	If we do use a locking algorithm, is it OK to rely on filesystem
131	locking or do we need our own mechanism? I think most hosts should
132	have reasonable ``flock()`` or equivalent, even on NFS. One risk is
133	that on NFS it is easy to have broken locking and not know it, so it
134	might be better to have something that will fail safe.
135
136	Filesystem locks go away if the machine crashes or the process is
137	terminated; this can be a feature in that we do not need to deal with
138	stale locks but also a feature in that the lock itself does not
139	indicate cleanup may be needed.
140
141	robertc points out that tla converged on renaming a directory as a
142	mechanism: this is one thing which is known to be atomic on almost all
143	filesystems. Apparently renaming files, creating directories, making
144	symlinks etc are not good enough.
145
146
147
148	Delta
149	*****
150
151	XML document plus a bag of patches, expressing the difference between
152	two revisions. May be a partial delta.
153
154	* list of entries
155
156	* entry
157
158	* parent directory (if any)
159	* before-name or null if new
160	* after-name or null if deleted
161	* uuid
162	* type (dir, file, symlink, ...)
163	* patch type (patch, full-text, xdelta, ...)
164	* patch filename (?)
165
166
167	Inventory
168	*********
169
170	XML document; series of entries. (Quite similar to the svn
171	``entries`` file; perhaps should even have that name.)
172	Stored identified by its hash.
173
174	An inventory is stored for recorded revisions, also a
175	``pending-inventory`` for a working directory.
176
177	Inventories always have the same id as the revision they correspond
178	to. bzr up to 0.0.5 explicitly stores an inventory-id; in future
179	versions this may be implied.
180
181
182
183	Revision
184	********
185
186	XML document. Stored identified by its hash.
187
188	committer
189	RFC-2822-style name of the committer. Should match the key used to
190	sign the revision.
191
192	comment
193	multi-line free-form text; whitespace and line breaks preserved
194
195	timestamp
196	As floating-point seconds since epoch.
197
198	branch name
199	Name of the branch to which this was originally committed.
200
201	(I'm not totally satisfied that this is the right way to do it; the
202	results will be a bit weird when a series of revisions pass through
203	variously named branches.)
204
205	inventory_hash
206	Acts as a pointer to the inventory for this revision.
207
208	parents
209	Zero, one, or more references to parent revisions. For each
210	the revision-id and the revision file's hash are given. The first
211	parent is by convention the revision in whose working tree the
212	new revision was created.
213
214	precursor
215	Must be equal to the first parent, if any are given. For
216	compatibility with bzr 0.0.5 and earlier; eventually will be
217	removed.
218
219	merged-branches
220	Revision ids of complete branches merged into this revision. If a
221	revision is listed, that revision and transitively its predecessor
222	and all other merged-branches are merged. This is empty except
223	where cherry-picks have occurred.
224
225	merged-patches
226	Revision ids of cherry-picked patches. Patches whose branches are
227	merged need not be listed here. Listing a revision ID implies that
228	only the change of that particular revision from its predecessor has
229	been merged in. This is empty except where cherry-picks have
230	occurred.
231
232	The transitive closure avoids Arch's problem of needing to list a
233	large number of previous revisions. As ddaa writes:
234
235	Continuation revisions (created by tla tag or baz branch) are associated
236	to a patchlog whose New-patches header lists the revisions associated to
237	all the patchlogs present in the tree. That was introduced as an
238	optimisation so the set of patchlogs in any revision could be determined
239	solely by examining the patchlogs of ancestor revisions in the same
240	branch. This behaves well as long as the total count of patchlog is
241	reasonably small or new branches are not very frequent.
242
243	A continuation revision on $tree currently creates a patchlog of
244	about 500K. This patchlog is present in all descendent of the revision,
245	and all revisions that merges it.
246
247	It may be useful at some times to keep a cache of all the branches, or
248	all the revisions, present in the history of a branch, so that we do
249	need to walk the whole history of the branch to build this list.
250
251
252	----
253
254	Proposed changes
255	****************
256
257	* Don't store parent-id in all revisions, but rather have <DIRECTORY>
258	nodes that contain entries for children?
259
260	* Assign an id to the root of the tree, perhaps listed in the top of
261	the inventory?