4632.2.4
by Martin Pool
Some developer docs about content filtering |
1 |
***************** |
2 |
Content Filtering |
|
3 |
***************** |
|
4 |
||
5 |
Content filtering is the feature by which Bazaar can do line-ending |
|
6 |
conversion or keyword expansion so that the files that appear in the |
|
7 |
working tree are not precisely the same as the files stored in the |
|
8 |
repository. |
|
9 |
||
10 |
This document describes the implementation; see the user guide for how to |
|
11 |
use it. |
|
12 |
||
13 |
||
14 |
We distinguish between the *canonical form* which is stored in the |
|
15 |
repository and the *convenient form* which is stored in the working tree. |
|
16 |
The convenient form will for example use OS-local newline conventions or |
|
17 |
have keywords expanded, and the canonical form will not. We use these |
|
18 |
names rather than eg "filtered" and "unfiltered" because filters are |
|
19 |
applied when both reading and writing so those names might cause |
|
20 |
confusion. |
|
21 |
||
22 |
Content filtering is only active on working trees that support it, which |
|
23 |
is format 2a and later. |
|
24 |
||
25 |
Content filtering is configured by rules that match file patterns. |
|
26 |
||
27 |
Filters |
|
28 |
******* |
|
29 |
||
30 |
Filters come in pairs: a read filter (reading convenient->canonical) and |
|
31 |
a write filter. There is no requirement that they be symmetric or that |
|
32 |
they be deterministic from the input, though in general both these |
|
33 |
properties will be true. Filters are allowed to change the size of the |
|
34 |
content, and things like line-ending conversion commonly will. |
|
35 |
||
36 |
Filters are fed a sequence of byte chunks (so that they don't have to |
|
37 |
hold the whole file in memory). There is no guarantee that the chunks |
|
38 |
will be aligned with line endings. Write filters are passed a context |
|
39 |
object through which they can obtain some information about eg which |
|
40 |
file they're working on. (See ``bzrlib.filters`` docstring.) |
|
41 |
||
42 |
These are at the moment strictly *content* filters: they can't make |
|
43 |
changes to the tree like changing the execute bit, file types, or |
|
44 |
adding/removing entries. |
|
45 |
||
46 |
Conventions |
|
47 |
*********** |
|
48 |
||
49 |
bzrlib interfaces that aren't explicitly specified to deal with the |
|
50 |
convenient form should return the canonical form. Whenever we have the |
|
51 |
SHA1 hash of a file, it's the hash of the canonical form. |
|
52 |
||
53 |
||
54 |
Dirstate interactions |
|
55 |
********************* |
|
56 |
||
4632.2.5
by Martin Pool
Review updates to content-filtering developer docs |
57 |
The dirstate file should store, in the column for the working copy, the cached |
4632.2.4
by Martin Pool
Some developer docs about content filtering |
58 |
hash and size of the canonical form, and the packed stat fingerprint for |
59 |
which that cache is valid. This implies that the stored size will |
|
4632.2.5
by Martin Pool
Review updates to content-filtering developer docs |
60 |
in general be different to the size in the packed stat. (However, it |
61 |
may not always do this correctly - see |
|
4634.166.3
by Vincent Ladeuil
The last remaining urls mentioning edge. |
62 |
<https://bugs.launchpad.net/bzr/+bug/418439>.) |
4632.2.4
by Martin Pool
Some developer docs about content filtering |
63 |
|
64 |
The dirstate is given a SHA1Provider instance by its tree. This class |
|
65 |
can calculate the (canonical) hash and size given a filename. This |
|
66 |
provides a hook by which the working tree can make sure that when the |
|
67 |
dirstate needs to get the hash of the file, it takes the filters into |
|
68 |
account. |
|
69 |
||
70 |
||
71 |
User interface |
|
72 |
************** |
|
73 |
||
4632.2.5
by Martin Pool
Review updates to content-filtering developer docs |
74 |
Most commands that deal with the text of files present the |
75 |
canonical form. Some have options to choose. |
|
4632.2.4
by Martin Pool
Some developer docs about content filtering |
76 |
|
77 |
||
78 |
Performance considerations |
|
79 |
************************** |
|
80 |
||
81 |
Content filters can have serious performance implications. For example, |
|
82 |
getting the size of (the canonical form of) a file is easy and fast when |
|
83 |
there are no content filters: we simply stat it. However, when there |
|
84 |
are filters that might change the size of the file, determining the |
|
85 |
length of the canonical form requires reading in and filtering the whole |
|
86 |
file. |
|
87 |
||
88 |
Formats from 1.14 onwards support content filtering, so having fast |
|
89 |
paths for the case where content filtering is not possible is not |
|
90 |
generally worthwhile. In fact, they're probably harmful by causing |
|
91 |
extra edges in test coverage and performance. |
|
92 |
||
93 |
We need to have things be fast even when filters are in use and then |
|
94 |
possibly do a bit less work when there are no filters configured. |
|
95 |
||
96 |
||
97 |
Future ideas and open issues |
|
98 |
**************************** |
|
99 |
||
100 |
* We might benefit from having filters declare some of their properties |
|
101 |
statically, for example that they're deterministic or can round-trip |
|
102 |
or won't change the length of the file. However, common cases like |
|
103 |
crlf conversion are not guaranteed to round-trip and may change the |
|
104 |
length, so perhaps adding separate cases will just complicate the code |
|
105 |
and tests. So overall this does not seem worthwhile. |
|
106 |
||
107 |
* In a future workingtree format, it might be better not to separately |
|
108 |
store the working-copy hash and size, but rather just a stat fingerprint |
|
4853.1.1
by Patrick Regan
Removed trailing whitespace from files in doc directory |
109 |
at which point it was known to have the same canonical form as the |
4632.2.4
by Martin Pool
Some developer docs about content filtering |
110 |
basis tree. |
111 |
||
112 |
* It may be worthwhile to have a virtual Tree-like object that does |
|
113 |
filtering, so there's a clean separation of filtering from the on-disk |
|
114 |
state and the meaning of any object is clear. This would have some |
|
115 |
risk of bugs where either code holds the wrong object, or their state |
|
116 |
becomes inconsistent. |
|
117 |
||
118 |
This would be useful in allowing you to get a filtered view of a |
|
4632.2.5
by Martin Pool
Review updates to content-filtering developer docs |
119 |
historical tree, eg to export it or diff it. At the moment export |
120 |
needs to have its own code to do the filtering. |
|
4632.2.4
by Martin Pool
Some developer docs about content filtering |
121 |
|
122 |
The convenient-form tree would talk to disk, and the convenient-form |
|
123 |
tree would sit on top of that and be used by most other bzr code. |
|
124 |
||
125 |
If we do this, we'd need to handle the fact that the on-disk tree, |
|
126 |
which generally deals with all of the IO and generally works entirely |
|
127 |
in convenient form, would also need to be told the canonical hash to |
|
128 |
store in the dirstate. This can perhaps be handled by the |
|
129 |
SHA1Provider or a similar hook. |
|
130 |
||
4632.2.5
by Martin Pool
Review updates to content-filtering developer docs |
131 |
* Content filtering at the moment is a bit specific to on-disk trees: |
132 |
for instance ``SHA1Provider`` goes directly to disk, but it seems like |
|
133 |
this is not necessary. |
|
4632.2.4
by Martin Pool
Some developer docs about content filtering |
134 |
|
135 |
||
136 |
See also |
|
137 |
******** |
|
138 |
||
5050.22.1
by John Arbash Meinel
Lots of documentation updates. |
139 |
* http://wiki.bazaar.canonical.com/LineEndings |
4632.2.4
by Martin Pool
Some developer docs about content filtering |
140 |
|
5050.22.1
by John Arbash Meinel
Lots of documentation updates. |
141 |
* http://wiki.bazaar.canonical.com/LineEndings/Roadmap |
4632.2.4
by Martin Pool
Some developer docs about content filtering |
142 |
|
143 |
* `Developer Documentation <index.html>`_ |
|
144 |
||
145 |
* ``bzrlib.filters`` |
|
146 |
||
147 |
.. vim: ft=rst tw=72 |