~bzr-pqm/bzr/bzr.dev

2664.9.1 by Martin Pool
Add notes on file and directory last modified (with John)
1
==============================
2
Computing last_modified values
3
==============================
4
5
Introduction
6
------------
7
8
Bazaar (through at least 0.19) computes a ``last_modified``
9
attribute for all inventory entries and stores it at commit time.
10
This is the ``revision_id`` that last changed or merged the file.  It is
11
used in knit and weave repositories to look up the file text, and to index
12
into the file graph.  It's also used to determine which revisions of the
13
file text to pull during ``fetch``.
14
15
This data is not natively stored by most other systems so we need to
16
synthesize it during conversion.
17
18
This is a case of non-core data that we might wish to treat as cached,
19
rather than always stored.
20
21
Definition
22
----------
23
24
Take the set of all "heads": all the versions of these files in parent
25
trees.
26
27
Reduce the heads by eliminating any whose last_modified is an ancestor of
28
the last_modified of any other head.
29
30
If there is still more than one head, a new last_modified is assigned.
31
This points to the merge point in the file graph.
32
33
If the file text and properties are the same as the sole remaining head, 
34
its last_modified is inherited. Property changes include executable bit,
35
filename, and containing directory.
36
37
Otherwise, a new last_modified is used.
38
39
(This is meant to be the simplest statement, but it may not be the most
40
efficient algorithm; anything that gives equivalent results can be used.)
41
42
43
Generation in commit
44
--------------------
45
46
Commit and converters both need this when writing into Bazaar native
47
formats.
48
49
This is an O(tree) operation because it needs to check for files with
50
multiple heads.  It could be reduced to O(changed_or_merged_files) if that
51
was faster to determine.  So it needs to be fast.
52
53
For the single-parent commit case, we just need to determine which files have
54
changed compared to the parent.  If the file was changed, it gets the
55
revision id of the new revision; otherwise it inherits the value from the
56
parent tree.  
57
58
In the multi-parent commit case (commit of a merge), it can take the value
59
from any of the parent trees, or of the new revision.
60
61
Commit in a dirstate tree should be able to do this more easily by looking
62
at a row of the dirstate to get the per-file parents.  It still needs to
63
look at the revision or file graph information to work out whether heads can be
64
eliminated as previously merged.  At the moment ``find_previous_heads`` works on
65
inventories, so needs to spend considerable effort building whole
66
inventories, including files that are not modified or merged.  (Called
67
from ``record_entry_contents``.)  It might be better to have the commit
68
builder pass in the per-entry parents so that dirstate can generate just
69
those that are necessary.  (See also the spec for
70
``iter_changes_multiple_parents``.)
71
72
If merge used a per-file graph then it would know when one version fully
73
supersedes another, and it could emit only a single parent.  Merge could
74
in fact do this even when not using per-file graphs.  In the current
75
dirstate format we need to store the full data for all trees because they
76
can be extracted from the dirstate, but it could mark some parents as
77
already merged.  
78
79
Alternatively, we could change the dirstate to include
80
only the base and current trees, and cache the merged-in parents
81
elsewhere. 
82
83
(Offtopic other dirstate changes: we could also omit the working-copy
84
hash, and just have a stat-fingerprint of when it was last known equal to
85
the basis revision.  That reduces the amount of data stored and possibly
86
makes it simpler to update, and shouldn't penalize common cases.)
87
88
89
Generation during conversion
90
----------------------------
91
92
Accessing a foreign branch requires synthesizing this information. 
93
If last_modified is removed from a future bzr version, we will also need
94
to synthesize it to pull back to earlier formats.
95
96
Because last_modified is not natively stored in the foreign branches, we
97
want to take advantage of any conversion we've already done, so that we
98
don't need to recursively generate them on every access.  We'd
99
prefer to find a revision that's already converted to a Bazaar inventory
100
within another related repository, such as the target of a conversion.
101
102
103
Avoiding last_modified
104
----------------------
105
106
last_modified is potentially expensive to determine and we may not want to
107
store it in inventories in future.  Therefore we should use it only when
108
necessary:
109
110
* When writing out an inventory format that includes it.
111
112
* In Bazaar formats that use it as a key for the file text or file
113
  ancestry.  This should be hidden behind the Repository/RevisionTree
114
  interface.
115
116
* When a user operation specifically requires the last_modified (e.g.
117
  hypothetical annotate directory).
118
119
We already do this in most cases.
120
121
122
Compared to annotate
123
--------------------
124
125
126
Use cases
127
---------
128
129
Cases to test
130
-------------
131
132
1. Single parent, unmodified file
133
2. Single parent, modified file
134
3. Two parents, one descended from the other, modified in one parent only
135
4. Two parents, one descended from the other, modified in one parent only,
136
   but also modified locally.
137
5. Two parents, not descended from each other, modified in one parent only.
138
6. Two parents, not descended from each other, modified in one parent only,
139
   but also modified locally.
140
7. Two parents, modified in both to different values.
141
8. Two parents, modified in both to the same value.
142
9. Two parents, modified in both, and reverted in both back to the
143
   original text.
144
10. Three parents, modified in only one
145
11. Three parents, modified in only one, also modified locally.
146
12. Three parents, modified in 2
147
13. Three parents, modified in 2, and locally.
148
14. Three parents, modified in 2, but one is a descendant of the other.
149
150
151
152
Performance considerations
153
--------------------------
154
155
Often we'll want the last_modified information for multiple files, perhaps
156
everything in a directory or in a whole tree.  It may be more efficient
157
for the api to accommodate this.  Often the last_modified will be similar
158
for multiple files, and if we process them all at once we can avoid some
159
repeated work in calculating their heads.
160
161
162
Open questions
163
--------------
164
165
* How does caching ``find_heads`` interact with cherry-picks?
166
167
168
169
Possible structure
170
==================
171
172
For a single file, if I am different from all parents, 'new'. (Do not need
173
to evaluate last modified).
174
175
..   
176
  vim: ft=rst tw=74