~bzr-pqm/bzr/bzr.dev : contents of doc/interrupted.txt at revision 1416

~bzr-pqm/bzr/bzr.dev : (revision 1416)

Interrupted operations
**********************

Problem: interrupted operations
===============================

Many version control systems tend to have trouble when operations are
interrupted.  This can happen in various ways:

 * user hits Ctrl-C

 * program hits a bug and aborts

 * machine crashes

 * network goes down

 * tree is naively copied (e.g. by cp/tar) while an operation is in
   progress

We can reduce the window during which operations can be interrupted:
most importantly, by receiving everything off the network into a
staging area, so that network interruptions won't leave a job half
complete.  But it is not possible to totally avoid this, because the
power can always fail.

I think we can reasonably rely on flushing to stable storage at
various points, and trust that such files will be accessible when we
come back up.

I think by using this and building from the bottom up there are never
any broken pointers in the branch metadata: first we add the file
versions, then the inventory, then the revision and signature, then
link them into the revision history.  The worst that can happen is
that there will be some orphaned files if this is interrupted at any
point. 

rsync is just impossible in the general case: it reads the files in a
fairly unpredictable order, so what it copies may not be a tree that
existed at any particular point in time.  If people want to make
backups or replicate using rsync they need to treat it like any other
database and either

 * make a copy which will not be updated, and rsync from that

 * lock the database while rsyncing

The operating system facilities are not sufficient to protect against
all of these.  We cannot satisfactorily commit a whole atomic
transaction in one step.

Operations might be updating either the metadata or the working copy.

The working copy is in some ways more difficult:

 * Other processes are allowed to modify it from time to time in
   arbitrary ways.

   If they modify it while bazaar is working then they will lose, but
   we should at least try to make sure there is no corruption.

 * We can't atomically replace the whole working copy.  We can
   (semi) atomically updated particular files.

 * If the working copy files are in a weird state it is hard to know
   whether that occurred because bzr's work was interrupted or because
   the user changed them.

   (A reasonable user might run ``bzr revert`` if they notice
   something like this has happened, but it would be nice to avoid
   it.)

We don't want to leave things in a broken state.


Solution: write-ahead journaling?
=================================

One possibly solution might be write-ahead journaling:

  Before beginning a change, write and flush to disk a description of
  what change will be made.

  Every bzr operation checks this journal; if there are any pending
  operations waiting then they are completed first, before proceeding
  with whatever the user wanted.  (Perhaps this should be in a
  separate ``bzr recover``, but I think it's better to just do it,
  perhaps with a warning.)

  The descriptions written into the journal need to be simple enough
  that they can safely be re-run in a totally different context.  They
  must not depend on any external resources which might have gone
  away.

  If we can do anything without depending on journalling we should.

  It may be that the only case where we cannot get by with just
  ordering is in updating the working copy; the user might get into a
  difficult situation where they have pulled in a change and only half
  the working copy has been updated.  One solution would be to remove
  the working copy files, or mark them readonly, while this is in
  progress.  We don't want people accidentally writing to a file that
  needs to be overwritten.

  Or perhaps, in this particular case, it is OK to leave them in
  pointing to an old state, and let people revert if they're sure they
  want the new one?  Sounds dangerous.

Aaron points out that this basically sounds like changesets.  So
before updating the history, we first calculate the changeset and
write it out to stable storage as a single file.  We then apply the
changeset, possibly updating several files.  Each command should check
whether such an application was in progress.

6 by mbp at sourcefrog import all docs from arch	1	Interrupted operations
	2	**********************
	3
	4	Problem: interrupted operations
	5	===============================
	6
	7	Many version control systems tend to have trouble when operations are
	8	interrupted. This can happen in various ways:
	9
	10	* user hits Ctrl-C
	11
	12	* program hits a bug and aborts
	13
	14	* machine crashes
	15
	16	* network goes down
	17
	18	* tree is naively copied (e.g. by cp/tar) while an operation is in
	19	progress
	20
	21	We can reduce the window during which operations can be interrupted:
	22	most importantly, by receiving everything off the network into a
	23	staging area, so that network interruptions won't leave a job half
	24	complete. But it is not possible to totally avoid this, because the
	25	power can always fail.
	26
	27	I think we can reasonably rely on flushing to stable storage at
	28	various points, and trust that such files will be accessible when we
	29	come back up.
	30
	31	I think by using this and building from the bottom up there are never
	32	any broken pointers in the branch metadata: first we add the file
	33	versions, then the inventory, then the revision and signature, then
	34	link them into the revision history. The worst that can happen is
	35	that there will be some orphaned files if this is interrupted at any
	36	point.
	37
	38	rsync is just impossible in the general case: it reads the files in a
	39	fairly unpredictable order, so what it copies may not be a tree that
	40	existed at any particular point in time. If people want to make
	41	backups or replicate using rsync they need to treat it like any other
	42	database and either
	43
	44	* make a copy which will not be updated, and rsync from that
	45
	46	* lock the database while rsyncing
	47
	48	The operating system facilities are not sufficient to protect against
	49	all of these. We cannot satisfactorily commit a whole atomic
	50	transaction in one step.
	51
	52	Operations might be updating either the metadata or the working copy.
	53
	54	The working copy is in some ways more difficult:
	55
	56	* Other processes are allowed to modify it from time to time in
	57	arbitrary ways.
	58
	59	If they modify it while bazaar is working then they will lose, but
	60	we should at least try to make sure there is no corruption.
	61
	62	* We can't atomically replace the whole working copy. We can
	63	(semi) atomically updated particular files.
	64
254 by Martin Pool - Doc cleanups from Magnus Therning	65	* If the working copy files are in a weird state it is hard to know
6 by mbp at sourcefrog import all docs from arch	66	whether that occurred because bzr's work was interrupted or because
	67	the user changed them.
	68
	69	(A reasonable user might run ``bzr revert`` if they notice
	70	something like this has happened, but it would be nice to avoid
	71	it.)
	72
	73	We don't want to leave things in a broken state.
	74
	75
	76	Solution: write-ahead journaling?
	77	=================================
	78
	79	One possibly solution might be write-ahead journaling:
	80
	81	Before beginning a change, write and flush to disk a description of
	82	what change will be made.
	83
	84	Every bzr operation checks this journal; if there are any pending
	85	operations waiting then they are completed first, before proceeding
	86	with whatever the user wanted. (Perhaps this should be in a
	87	separate ``bzr recover``, but I think it's better to just do it,
	88	perhaps with a warning.)
	89
	90	The descriptions written into the journal need to be simple enough
	91	that they can safely be re-run in a totally different context. They
	92	must not depend on any external resources which might have gone
	93	away.
	94
	95	If we can do anything without depending on journalling we should.
	96
	97	It may be that the only case where we cannot get by with just
	98	ordering is in updating the working copy; the user might get into a
	99	difficult situation where they have pulled in a change and only half
	100	the working copy has been updated. One solution would be to remove
	101	the working copy files, or mark them readonly, while this is in
	102	progress. We don't want people accidentally writing to a file that
	103	needs to be overwritten.
	104
	105	Or perhaps, in this particular case, it is OK to leave them in
	106	pointing to an old state, and let people revert if they're sure they
	107	want the new one? Sounds dangerous.
	108
	109	Aaron points out that this basically sounds like changesets. So
	110	before updating the history, we first calculate the changeset and
	111	write it out to stable storage as a single file. We then apply the
	112	changeset, possibly updating several files. Each command should check
	113	whether such an application was in progress.