~bzr-pqm/bzr/bzr.dev

« back to all changes in this revision

Viewing changes to doc/hashes.txt

Committer: mbp at sourcefrog
Date: 2005-04-04 09:50:24 UTC
Revision ID: mbp@sourcefrog.net-20050404095024-4646dbcc42eada9e

workaround for python2.3 difflib bug

files added:
.bzrignore

NEWS

README

build-api

bzrlib

bzrlib/__init__.py

bzrlib/add.py

bzrlib/branch.py

bzrlib/check.py

bzrlib/commands.py

bzrlib/diff.py

bzrlib/errors.py

bzrlib/info.py

bzrlib/inventory.py

bzrlib/newinventory.py

bzrlib/osutils.py

bzrlib/revision.py

bzrlib/store.py

bzrlib/tests.py

bzrlib/textui.py

bzrlib/trace.py

bzrlib/tree.py

bzrlib/xml.py

doc/Makefile

doc/adoption.txt

doc/bitkeeper.txt

doc/changelogs.txt

doc/cherry-picking.txt

doc/cmdref.txt

doc/common-format.txt

doc/compared-aegis.txt

doc/compared-codeville.txt

doc/compared-cvsnt.txt

doc/compared-opencm.txt

doc/compared-prcs.txt

doc/compared-teamware.txt

doc/compression.txt

doc/config-specs.txt

doc/conflicts.txt

doc/costs.txt

doc/darcs.txt

doc/deadly-sins.txt

doc/default.css

doc/design.txt

doc/extra-commands.txt

doc/faq.txt

doc/formats.txt

doc/hashes.txt

doc/ignore.txt

doc/index.txt

doc/interrupted.txt

doc/intro.txt

doc/inventory.txt

doc/join-branches.txt

doc/kill-version.txt

doc/layers.txt

doc/library-interface.txt

doc/merge.txt

doc/mirroring.txt

doc/monotone.txt

doc/news.txt

doc/optional-edit.txt

doc/partial-commit.txt

doc/pool.txt

doc/purpose.txt

doc/python.txt

doc/quickref.txt

doc/quilt.txt

doc/random.txt

doc/requirements.txt

doc/revision-syntax.txt

doc/roadmap.txt

doc/rollup.txt

doc/scalability.txt

doc/security.txt

doc/shared-branches.txt

doc/short-demo.txt

doc/supportability.txt

doc/svk.txt

doc/tagging.txt

doc/taxonomy.txt

doc/testing.txt

doc/thanks.txt

doc/todo-from-arch.txt

doc/unchanged.txt

doc/unrelated-merge.txt

doc/usability.txt

doc/use-cases.txt

doc/web-interface.txt

doc/work-order.txt

doc/workflow.txt

doc/yaml.txt

elementtree

elementtree/ElementTree.py

elementtree/__init__.py

notes

notes/new-inventory-sample.xml

notes/performance.txt

setup.py

files removed:
.bzrignore

knit.py

testknit.py

testsweet.py

woolyweave.py

Show diffs side-by-side

added added

removed removed

doc/hashes.txt

Use of hashes in Bazaar-NG

**************************

* http://infohost.nmt.edu/~val/review/hash.html

* http://infohost.nmt.edu/~val/review/hash2.html

The main attraction of hashes in bazaar-ng is as an easy way to get

universally-unique IDs, or at least with a low chance of collision:

The first paper is a bit paranoid; the second has some sensible

advice:

1. Will compare-by-hash provide significant benefit -- save time,

bandwidth, etc?

2. Is the system usable if hash collisions can be generated at will?

3. Can the hashes be regenerated with a different algorithm at any

time?

We should try to abide by these rules. I think they are possibly too

paranoid -- a real break of SHA-1 would have much wider security

implications -- but if a design that respects them is practical, it

should be preferred.

The first is probably true; the third is just a matter of making sure

we allow for the choice of hash to be varied in the format.

There are actually two variations on the second:

2a. Is the system safe if an attacker can generate hash collisions?

2b. Is the system safe if a user's own files contain collisions.

Regardless of cryptographic weakness, SHA-1 is unlikely to

"accidentally" collide, but it's possible that someone will

intentionally generate collisions (in research on SHA) and then want

to store them. It would be unfortunate if that did not work.

An advantage of naming by hash is that it lets us store only a single

copy of identical files, but we have already decided__ that disk space

is pretty cheap. It is perhaps enough to have a single copy of files

that do not change from one tree revision to the next.

__ costs.html

As far as an attacker: we will not automatically trust that ids from

one branch have the same value in another. It is possible for a

branch to contain "lies" about its history or contents, but that

doesn't corrupt anything else. It may confuse or mislead someone who

looks at the branch, but there is no substitute for human review

anyhow.

-------

The safest position may be to never rely on identifying content by

hash. Rather, things which need a universally unique ID should get a

UUID instead.

This has a slight advantage that the id can be stored directly in the

object it refers to, when that's useful.

So a `Revision` holds a UUID for the `Inventory`.

An inventory holds `InventoryEntry` objects, each with

* file-id

* filename (location in tree)

* type (file, dir, etc)

* text-id (uuid identifying the text)

* text-sha1

* text-length (for catching bugs)

* parent-file-id

Older »