~bzr-pqm/bzr/bzr.dev : contents of doc/developers/content-filtering.txt at revision 6613

~bzr-pqm/bzr/bzr.dev : (revision 6613)

*****************
Content Filtering
*****************

Content filtering is the feature by which Bazaar can do line-ending
conversion or keyword expansion so that the files that appear in the
working tree are not precisely the same as the files stored in the
repository.

This document describes the implementation; see the user guide for how to
use it.


We distinguish between the *canonical form* which is stored in the
repository and the *convenient form* which is stored in the working tree.
The convenient form will for example use OS-local newline conventions or
have keywords expanded, and the canonical form will not.  We use these
names rather than eg "filtered" and "unfiltered" because filters are
applied when both reading and writing so those names might cause
confusion.

Content filtering is only active on working trees that support it, which
is format 2a and later.

Content filtering is configured by rules that match file patterns.

Filters
*******

Filters come in pairs: a read filter (reading convenient->canonical) and
a write filter.  There is no requirement that they be symmetric or that
they be deterministic from the input, though in general both these
properties will be true.  Filters are allowed to change the size of the
content, and things like line-ending conversion commonly will.

Filters are fed a sequence of byte chunks (so that they don't have to
hold the whole file in memory).  There is no guarantee that the chunks
will be aligned with line endings.  Write filters are passed a context
object through which they can obtain some information about eg which
file they're working on.  (See ``bzrlib.filters`` docstring.)

These are at the moment strictly *content* filters: they can't make
changes to the tree like changing the execute bit, file types, or
adding/removing entries.

Conventions
***********

bzrlib interfaces that aren't explicitly specified to deal with the
convenient form should return the canonical form.  Whenever we have the
SHA1 hash of a file, it's the hash of the canonical form.


Dirstate interactions
*********************

The dirstate file should store, in the column for the working copy, the cached
hash and size of the canonical form, and the packed stat fingerprint for
which that cache is valid.  This implies that the stored size will
in general be different to the size in the packed stat.  (However, it
may not always do this correctly - see
<https://bugs.launchpad.net/bzr/+bug/418439>.)

The dirstate is given a SHA1Provider instance by its tree.  This class
can calculate the (canonical) hash and size given a filename.  This
provides a hook by which the working tree can make sure that when the
dirstate needs to get the hash of the file, it takes the filters into
account.


User interface
**************

Most commands that deal with the text of files present the
canonical form.  Some have options to choose.


Performance considerations
**************************

Content filters can have serious performance implications.  For example,
getting the size of (the canonical form of) a file is easy and fast when
there are no content filters: we simply stat it.  However, when there
are filters that might change the size of the file, determining the
length of the canonical form requires reading in and filtering the whole
file.

Formats from 1.14 onwards support content filtering, so having fast
paths for the case where content filtering is not possible is not
generally worthwhile.  In fact, they're probably harmful by causing
extra edges in test coverage and performance.

We need to have things be fast even when filters are in use and then
possibly do a bit less work when there are no filters configured.


Future ideas and open issues
****************************

* We might benefit from having filters declare some of their properties
  statically, for example that they're deterministic or can round-trip
  or won't change the length of the file.  However, common cases like
  crlf conversion are not guaranteed to round-trip and may change the
  length, so perhaps adding separate cases will just complicate the code
  and tests.  So overall this does not seem worthwhile.

* In a future workingtree format, it might be better not to separately
  store the working-copy hash and size, but rather just a stat fingerprint
  at which point it was known to have the same canonical form as the
  basis tree.

* It may be worthwhile to have a virtual Tree-like object that does
  filtering, so there's a clean separation of filtering from the on-disk
  state and the meaning of any object is clear.  This would have some
  risk of bugs where either code holds the wrong object, or their state
  becomes inconsistent.

  This would be useful in allowing you to get a filtered view of a
  historical tree, eg to export it or diff it.  At the moment export
  needs to have its own code to do the filtering.

  The convenient-form tree would talk to disk, and the convenient-form
  tree would sit on top of that and be used by most other bzr code.

  If we do this, we'd need to handle the fact that the on-disk tree,
  which generally deals with all of the IO and generally works entirely
  in convenient form, would also need to be told the canonical hash to
  store in the dirstate.  This can perhaps be handled by the
  SHA1Provider or a similar hook.

* Content filtering at the moment is a bit specific to on-disk trees:
  for instance ``SHA1Provider`` goes directly to disk, but it seems like
  this is not necessary.


See also
********

* http://wiki.bazaar.canonical.com/LineEndings

* http://wiki.bazaar.canonical.com/LineEndings/Roadmap

* `Developer Documentation <index.html>`_

* ``bzrlib.filters``

.. vim: ft=rst tw=72

4632.2.4 by Martin Pool Some developer docs about content filtering	1	*****************
	2	Content Filtering
	3	*****************
	4
	5	Content filtering is the feature by which Bazaar can do line-ending
	6	conversion or keyword expansion so that the files that appear in the
	7	working tree are not precisely the same as the files stored in the
	8	repository.
	9
	10	This document describes the implementation; see the user guide for how to
	11	use it.
	12
	13
	14	We distinguish between the canonical form which is stored in the
	15	repository and the convenient form which is stored in the working tree.
	16	The convenient form will for example use OS-local newline conventions or
	17	have keywords expanded, and the canonical form will not. We use these
	18	names rather than eg "filtered" and "unfiltered" because filters are
	19	applied when both reading and writing so those names might cause
	20	confusion.
	21
	22	Content filtering is only active on working trees that support it, which
	23	is format 2a and later.
	24
	25	Content filtering is configured by rules that match file patterns.
	26
	27	Filters
	28	*******
	29
	30	Filters come in pairs: a read filter (reading convenient->canonical) and
	31	a write filter. There is no requirement that they be symmetric or that
	32	they be deterministic from the input, though in general both these
	33	properties will be true. Filters are allowed to change the size of the
	34	content, and things like line-ending conversion commonly will.
	35
	36	Filters are fed a sequence of byte chunks (so that they don't have to
	37	hold the whole file in memory). There is no guarantee that the chunks
	38	will be aligned with line endings. Write filters are passed a context
	39	object through which they can obtain some information about eg which
	40	file they're working on. (See ``bzrlib.filters`` docstring.)
	41
	42	These are at the moment strictly content filters: they can't make
	43	changes to the tree like changing the execute bit, file types, or
	44	adding/removing entries.
	45
	46	Conventions
	47	***********
	48
	49	bzrlib interfaces that aren't explicitly specified to deal with the
	50	convenient form should return the canonical form. Whenever we have the
	51	SHA1 hash of a file, it's the hash of the canonical form.
	52
	53
	54	Dirstate interactions
	55	*********************
	56
4632.2.5 by Martin Pool Review updates to content-filtering developer docs	57	The dirstate file should store, in the column for the working copy, the cached
4632.2.4 by Martin Pool Some developer docs about content filtering	58	hash and size of the canonical form, and the packed stat fingerprint for
	59	which that cache is valid. This implies that the stored size will
4632.2.5 by Martin Pool Review updates to content-filtering developer docs	60	in general be different to the size in the packed stat. (However, it
	61	may not always do this correctly - see
4634.166.3 by Vincent Ladeuil The last remaining urls mentioning edge.	62	<https://bugs.launchpad.net/bzr/+bug/418439>.)
4632.2.4 by Martin Pool Some developer docs about content filtering	63
	64	The dirstate is given a SHA1Provider instance by its tree. This class
	65	can calculate the (canonical) hash and size given a filename. This
	66	provides a hook by which the working tree can make sure that when the
	67	dirstate needs to get the hash of the file, it takes the filters into
	68	account.
	69
	70
	71	User interface
	72	**************
	73
4632.2.5 by Martin Pool Review updates to content-filtering developer docs	74	Most commands that deal with the text of files present the
	75	canonical form. Some have options to choose.
4632.2.4 by Martin Pool Some developer docs about content filtering	76
	77
	78	Performance considerations
	79	**************************
	80
	81	Content filters can have serious performance implications. For example,
	82	getting the size of (the canonical form of) a file is easy and fast when
	83	there are no content filters: we simply stat it. However, when there
	84	are filters that might change the size of the file, determining the
	85	length of the canonical form requires reading in and filtering the whole
	86	file.
	87
	88	Formats from 1.14 onwards support content filtering, so having fast
	89	paths for the case where content filtering is not possible is not
	90	generally worthwhile. In fact, they're probably harmful by causing
	91	extra edges in test coverage and performance.
	92
	93	We need to have things be fast even when filters are in use and then
	94	possibly do a bit less work when there are no filters configured.
	95
	96
	97	Future ideas and open issues
	98	****************************
	99
	100	* We might benefit from having filters declare some of their properties
	101	statically, for example that they're deterministic or can round-trip
	102	or won't change the length of the file. However, common cases like
	103	crlf conversion are not guaranteed to round-trip and may change the
	104	length, so perhaps adding separate cases will just complicate the code
	105	and tests. So overall this does not seem worthwhile.
	106
	107	* In a future workingtree format, it might be better not to separately
	108	store the working-copy hash and size, but rather just a stat fingerprint
4853.1.1 by Patrick Regan Removed trailing whitespace from files in doc directory	109	at which point it was known to have the same canonical form as the
4632.2.4 by Martin Pool Some developer docs about content filtering	110	basis tree.
	111
	112	* It may be worthwhile to have a virtual Tree-like object that does
	113	filtering, so there's a clean separation of filtering from the on-disk
	114	state and the meaning of any object is clear. This would have some
	115	risk of bugs where either code holds the wrong object, or their state
	116	becomes inconsistent.
	117
	118	This would be useful in allowing you to get a filtered view of a
4632.2.5 by Martin Pool Review updates to content-filtering developer docs	119	historical tree, eg to export it or diff it. At the moment export
	120	needs to have its own code to do the filtering.
4632.2.4 by Martin Pool Some developer docs about content filtering	121
	122	The convenient-form tree would talk to disk, and the convenient-form
	123	tree would sit on top of that and be used by most other bzr code.
	124
	125	If we do this, we'd need to handle the fact that the on-disk tree,
	126	which generally deals with all of the IO and generally works entirely
	127	in convenient form, would also need to be told the canonical hash to
	128	store in the dirstate. This can perhaps be handled by the
	129	SHA1Provider or a similar hook.
	130
4632.2.5 by Martin Pool Review updates to content-filtering developer docs	131	* Content filtering at the moment is a bit specific to on-disk trees:
	132	for instance ``SHA1Provider`` goes directly to disk, but it seems like
	133	this is not necessary.
4632.2.4 by Martin Pool Some developer docs about content filtering	134
	135
	136	See also
	137	********
	138
5050.22.1 by John Arbash Meinel Lots of documentation updates.	139	* http://wiki.bazaar.canonical.com/LineEndings
4632.2.4 by Martin Pool Some developer docs about content filtering	140
5050.22.1 by John Arbash Meinel Lots of documentation updates.	141	* http://wiki.bazaar.canonical.com/LineEndings/Roadmap
4632.2.4 by Martin Pool Some developer docs about content filtering	142
	143	* `Developer Documentation <index.html>`_
	144
	145	* ``bzrlib.filters``
	146
	147	.. vim: ft=rst tw=72