1185.1.29
by Robert Collins
merge merge tweaks from aaron, which includes latest .dev |
1 |
Use of hashes in Bazaar-NG |
2 |
************************** |
|
3 |
||
4 |
* http://infohost.nmt.edu/~val/review/hash.html |
|
5 |
* http://infohost.nmt.edu/~val/review/hash2.html |
|
6 |
||
7 |
The main attraction of hashes in bazaar-ng is as an easy way to get |
|
8 |
universally-unique IDs, or at least with a low chance of collision: |
|
9 |
||
10 |
The first paper is a bit paranoid; the second has some sensible |
|
11 |
advice: |
|
12 |
||
13 |
1. Will compare-by-hash provide significant benefit -- save time, |
|
14 |
bandwidth, etc? |
|
15 |
||
16 |
2. Is the system usable if hash collisions can be generated at will? |
|
17 |
||
18 |
3. Can the hashes be regenerated with a different algorithm at any |
|
19 |
time? |
|
20 |
||
21 |
We should try to abide by these rules. I think they are possibly too |
|
22 |
paranoid -- a real break of SHA-1 would have much wider security |
|
23 |
implications -- but if a design that respects them is practical, it |
|
24 |
should be preferred. |
|
25 |
||
26 |
The first is probably true; the third is just a matter of making sure |
|
27 |
we allow for the choice of hash to be varied in the format. |
|
28 |
||
29 |
There are actually two variations on the second: |
|
30 |
||
31 |
2a. Is the system safe if an attacker can generate hash collisions? |
|
32 |
||
33 |
2b. Is the system safe if a user's own files contain collisions. |
|
34 |
||
35 |
Regardless of cryptographic weakness, SHA-1 is unlikely to |
|
36 |
"accidentally" collide, but it's possible that someone will |
|
37 |
intentionally generate collisions (in research on SHA) and then want |
|
38 |
to store them. It would be unfortunate if that did not work. |
|
39 |
||
40 |
An advantage of naming by hash is that it lets us store only a single |
|
41 |
copy of identical files, but we have already decided__ that disk space |
|
42 |
is pretty cheap. It is perhaps enough to have a single copy of files |
|
43 |
that do not change from one tree revision to the next. |
|
44 |
||
45 |
__ costs.html |
|
46 |
||
47 |
As far as an attacker: we will not automatically trust that ids from |
|
48 |
one branch have the same value in another. It is possible for a |
|
49 |
branch to contain "lies" about its history or contents, but that |
|
50 |
doesn't corrupt anything else. It may confuse or mislead someone who |
|
51 |
looks at the branch, but there is no substitute for human review |
|
52 |
anyhow. |
|
53 |
||
54 |
------- |
|
55 |
||
56 |
The safest position may be to never rely on identifying content by |
|
57 |
hash. Rather, things which need a universally unique ID should get a |
|
58 |
UUID instead. |
|
59 |
||
60 |
This has a slight advantage that the id can be stored directly in the |
|
61 |
object it refers to, when that's useful. |
|
62 |
||
63 |
So a `Revision` holds a UUID for the `Inventory`. |
|
64 |
||
65 |
An inventory holds `InventoryEntry` objects, each with |
|
66 |
||
67 |
* file-id |
|
68 |
* filename (location in tree) |
|
69 |
* type (file, dir, etc) |
|
70 |
* text-id (uuid identifying the text) |
|
71 |
* text-sha1 |
|
72 |
* text-length (for catching bugs) |
|
73 |
* parent-file-id |