6
by mbp at sourcefrog
import all docs from arch |
1 |
Compression |
2 |
*********** |
|
3 |
||
4 |
At the moment bzr just stores the full text of every file state ever. |
|
5 |
(Files unmodified from one revision are stored only once.) |
|
6 |
||
7 |
This is simple to write, and adequate for even reasonably large trees |
|
8 |
with many versions. Disk is very cheap. |
|
9 |
||
10 |
Eventually we might like something more compressed, but this is |
|
11 |
neither an interesting nor urgent problem. (Not "interesting" in the |
|
12 |
sense that doing it is just a matter of coding; there is no |
|
13 |
theoretical problem or risk.) |
|
14 |
||
15 |
There are various possibilities: |
|
16 |
||
17 |
* Store history of each file in RCS, relying on RCS to do line-by-line |
|
18 |
delta compression. (Does not handle binaries very well.) |
|
19 |
||
20 |
OpenCMS paper has a horror story about using RCS for file storage. |
|
21 |
||
22 |
* Store full copies of each file in a container with gzip compression, |
|
23 |
which should fairly efficiently eliminate unchanged areas. This |
|
24 |
works on binaries, and gives compression of file text as a side |
|
25 |
benefit. |
|
26 |
||
27 |
(Note that ``.zip`` will *not* do for this, because every file is |
|
28 |
compressed independently.) |
|
29 |
||
30 |
The OpenCMS paper notes that RCS storage is only 20% more efficient |
|
31 |
than gzip'd storage of individual file versions. |
|
32 |
||
33 |
* Store history of each file in SCCS; allows quick retrieval of any |
|
34 |
previous state and may give more efficient storage than RCS. Allows |
|
35 |
for divergent branches within a single file. |
|
36 |
||
37 |
* Store xdeltas between consecutive file states. |
|
38 |
||
39 |
* Store xdeltas according to a spanning delta algorithm; this probably |
|
40 |
requires files are stored with some kind of sequence number so that |
|
41 |
we can predict related version names. |
|
42 |
||
43 |
* Store in something like XDFS. |
|
44 |
||
45 |
* Any of the above, but with the final storage in some kind of |
|
46 |
database: psql, sqlite, mysql, whatever is convenient. |
|
47 |
||
48 |
It should be something that is safe across system crashes, which |
|
49 |
rules out tdb at the moment. |
|
50 |
||
51 |
---- |
|
52 |
||
53 |
These properties are seen as desirable in darcs and arch: |
|
54 |
||
55 |
* Passive HTTP downloads: without requiring any server-side |
|
56 |
intelligence, a client can get updates from one version to the next |
|
57 |
by requesting a set of self-contained files. |
|
58 |
||
59 |
The number of files necessary to do this must not be unfeasibly |
|
60 |
large, and the size of each of those files should be proportionate |
|
61 |
to the amount of actual change. In other words the data downloaded |
|
62 |
should be of comparable size to the actual edits between the trees. |
|
63 |
||
64 |
||
65 |
* Write-once storage: once a file is written to the repository, it is |
|
66 |
not modified. |