2621.1.1
by Aaron Bentley
Add performance analysis of diff |
1 |
diff Performance Analysis |
2 |
========================= |
|
3 |
||
4 |
.. contents:: :local: |
|
5 |
||
6 |
Minimal Work |
|
7 |
------------ |
|
8 |
||
9 |
Reuse of historical comparisons |
|
10 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
|
11 |
A significant part of the work done by diff is sequence matching. This |
|
12 |
scales O(n^2) with the number of lines in the file. Therefore, it |
|
13 |
is worthwile to avoid content comparisons as much as possible. |
|
14 |
||
15 |
Our current knit format contains content comparisons, and this data can |
|
16 |
be converted into lists of matching blocks. Other future formats such as |
|
17 |
mpdiff may also support such conversion. So it is possible to reuse past |
|
18 |
comparisons. |
|
19 |
||
20 |
It is also possible to combine sequential comparisons. So given a comparison |
|
21 |
of "foo" to "bar", and "bar" to "baz", it is possible to derive a comparison of |
|
22 |
"foo" to "baz". |
|
23 |
||
24 |
Reuse of historical comparisons will scale with the number of uncommon |
|
25 |
build-parents between the two historical revisions. This will typically be |
|
26 |
proportional to the amount of change that the file has undergone. Therefore, |
|
27 |
in the common case, reuse of historical comparisons will scale with the |
|
28 |
amount of change. |
|
29 |
||
30 |
The downside of such reuse is that it ties the comparison to the historical |
|
31 |
data. But given the performance improvement, it seems to be worth |
|
32 |
consideration. Fresh comparisons can be performed if the user requests them. |
|
33 |
||
34 |
It may also be possible to accelerate comparisons by including annotation data, |
|
35 |
thus increasing the number of unique lines. |
|
36 |
||
37 |
Historical Tree Against Historical Tree |
|
38 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
|
39 |
This operation should be strictly proportional to the amount of change, because |
|
40 |
a comparison has already been done at commit time. Achieving that performance |
|
41 |
requires the committed data to be properly structured, so that the comparison |
|
42 |
can be extracted and combined with other comparisons. This comparision |
|
43 |
extraction should be possible at the inventory and file-content levels. |
|
44 |
||
45 |
Minimum work: |
|
46 |
||
47 |
1. Extract and combine inventory comparisons |
|
48 |
2. Extract and combine text comparisions for modified texts |
|
49 |
||
50 |
Basis Against Historical Tree |
|
51 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
|
52 |
This is another case of Historical Tree Against Historical Tree. |
|
53 |
||
54 |
Basis Against Basis |
|
55 |
~~~~~~~~~~~~~~~~~~~ |
|
56 |
This is another case of Historical Tree Against Historical Tree. |
|
57 |
||
58 |
Working Tree Against Basis |
|
59 |
~~~~~~~~~~~~~~~~~~~~~~~~~~ |
|
60 |
This must scale with the number of versioned files, unless the user indicates |
|
61 |
that only certain files should be compared. |
|
62 |
||
63 |
Performance can be further improved by caching comparisons to avoid repeating |
|
4853.1.1
by Patrick Regan
Removed trailing whitespace from files in doc directory |
64 |
them. Caching could potentially be performed by ``diff`` and perhaps by |
2621.1.1
by Aaron Bentley
Add performance analysis of diff |
65 |
``merge``. Merge is aware of the relationship of a text merge's result to |
66 |
the THIS value, and the THIS value is generally the basis value. So the |
|
67 |
comparison is latent, but present. The only issue is extracting it. |
|
68 |
||
69 |
The cache could be indexed by sha1sum pairs. It could also be indexed by |
|
70 |
file-id, to facilitate removal of stale data. |
|
71 |
||
72 |
Minimum work: |
|
73 |
||
74 |
1. Scan working tree for modified files |
|
75 |
2. Retrieve cached comparisons |
|
76 |
3. Perform comparisons on files with no cached comparisons |
|
77 |
4. Cache comparisons for files with no cached comparisons |
|
78 |
||
79 |
Working Tree Against Historical Tree |
|
80 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
|
81 |
This can be structured as a comparison of working tree against basis tree, |
|
82 |
followed by basis tree against historical tree. Therefore, it combines the |
|
83 |
performance characteristics of "Working Tree Against Basis" with "Basis Against |
|
84 |
Historical Tree". |
|
85 |
||
86 |
Working Tree Against Working Tree |
|
87 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
|
88 |
This can be structured as two comparisons against basis, and one comparison |
|
89 |
of basis against basis. Its performance is therefore similar to Working Tree |
|
90 |
Against Historical Tree. |
|
91 |
||
92 |
API Changes |
|
93 |
----------- |
|
94 |
||
95 |
Desired API: |
|
96 |
||
97 |
- Tree.get_comparision(file_id, tree) |
|
98 |
||
99 |
This probably entails: |
|
100 |
||
101 |
- WorkingTree.store_comparison(file_id, revision_id, sha1, comparison) |
|
102 |
- WorkingTree.get_comparison(file_id, revision_id, sha1) |
|
103 |
- Repository.get_comparision(file_id, revision_id, revision_id) |
|
104 |
- merge_comparisions(comparison, comparision) |
|
105 |
||
106 |
Storage considerations |
|
107 |
---------------------- |
|
108 |
It must be cheap (e.g. scale with number of intermediate revisions) to perform |
|
109 |
comparison of two historical texts. It must be cheap to perform comparison of |
|
110 |
the inventories of two historical trees. |