~bzr-pqm/bzr/bzr.dev

3641.3.29 by John Arbash Meinel
Cleanup the copyright headers
1
# Copyright (C) 2008 Canonical Ltd
3641.3.1 by John Arbash Meinel
Bring in the btree_index and chunk_writer code and their tests.
2
#
3
# This program is free software; you can redistribute it and/or modify
3641.3.29 by John Arbash Meinel
Cleanup the copyright headers
4
# it under the terms of the GNU General Public License as published by
5
# the Free Software Foundation; either version 2 of the License, or
6
# (at your option) any later version.
3641.3.1 by John Arbash Meinel
Bring in the btree_index and chunk_writer code and their tests.
7
#
8
# This program is distributed in the hope that it will be useful,
9
# but WITHOUT ANY WARRANTY; without even the implied warranty of
10
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
11
# GNU General Public License for more details.
12
#
13
# You should have received a copy of the GNU General Public License
14
# along with this program; if not, write to the Free Software
3641.3.29 by John Arbash Meinel
Cleanup the copyright headers
15
# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
3641.3.1 by John Arbash Meinel
Bring in the btree_index and chunk_writer code and their tests.
16
#
17
18
"""ChunkWriter: write compressed data out with a fixed upper bound."""
19
20
import zlib
21
from zlib import Z_FINISH, Z_SYNC_FLUSH
22
23
24
class ChunkWriter(object):
25
    """ChunkWriter allows writing of compressed data with a fixed size.
26
27
    If less data is supplied than fills a chunk, the chunk is padded with
28
    NULL bytes. If more data is supplied, then the writer packs as much
29
    in as it can, but never splits any item it was given.
30
31
    The algorithm for packing is open to improvement! Current it is:
32
     - write the bytes given
33
     - if the total seen bytes so far exceeds the chunk size, flush.
3641.3.4 by John Arbash Meinel
Tweak some 'sum' lines.
34
35
    :cvar _max_repack: To fit the maximum number of entries into a node, we
36
        will sometimes start over and compress the whole list to get tighter
37
        packing. We get diminishing returns after a while, so this limits the
38
        number of times we will try.
3641.3.14 by John Arbash Meinel
Replace time/space benchmarks with real-world testing.
39
        In testing, some values for bzr.dev::
40
41
                    w/o copy    w/ copy     w/ copy ins w/ copy & save
42
            repack  time  MB    time  MB    time  MB    time  MB
43
             1       8.8  5.1    8.9  5.1    9.6  4.4   12.5  4.1
44
             2       9.6  4.4   10.1  4.3   10.4  4.2   11.1  4.1
45
             3      10.6  4.2   11.1  4.1   11.2  4.1   11.3  4.1
46
             4      12.0  4.1
47
             5      12.6  4.1
48
            20      12.9  4.1   12.2  4.1   12.3  4.1
49
50
        In testing, some values for mysql-unpacked::
51
52
                    w/o copy    w/ copy     w/ copy ins w/ copy & save
53
            repack  time  MB    time  MB    time  MB    time  MB
54
             1      56.6  16.9              60.7  14.2
55
             2      59.3  14.1              62.6  13.5  64.3  13.4
56
             3      64.4  13.5
57
            20      73.4  13.4
58
3641.3.4 by John Arbash Meinel
Tweak some 'sum' lines.
59
    :cvar _default_min_compression_size: The expected minimum compression.
60
        While packing nodes into the page, we won't Z_SYNC_FLUSH until we have
61
        received this much input data. This saves time, because we don't bloat
62
        the result with SYNC entries (and then need to repack), but if it is
3641.3.12 by John Arbash Meinel
Collect some info on the space/time tradeoff for _max_repack.
63
        set too high we will accept data that will never fit and trigger a
64
        fault later.
3641.3.1 by John Arbash Meinel
Bring in the btree_index and chunk_writer code and their tests.
65
    """
66
67
    _max_repack = 2
3641.3.4 by John Arbash Meinel
Tweak some 'sum' lines.
68
    _default_min_compression_size = 1.8
3641.3.1 by John Arbash Meinel
Bring in the btree_index and chunk_writer code and their tests.
69
70
    def __init__(self, chunk_size, reserved=0):
71
        """Create a ChunkWriter to write chunk_size chunks.
72
73
        :param chunk_size: The total byte count to emit at the end of the
74
            chunk.
75
        :param reserved: How many bytes to allow for reserved data. reserved
76
            data space can only be written to via the write_reserved method.
77
        """
78
        self.chunk_size = chunk_size
79
        self.compressor = zlib.compressobj()
80
        self.bytes_in = []
81
        self.bytes_list = []
3641.3.16 by John Arbash Meinel
Somewhat surprisingly, tracking bytes_out_len makes a
82
        self.bytes_out_len = 0
3641.3.1 by John Arbash Meinel
Bring in the btree_index and chunk_writer code and their tests.
83
        self.compressed = None
84
        self.seen_bytes = 0
85
        self.num_repack = 0
86
        self.unused_bytes = None
87
        self.reserved_size = reserved
3641.3.4 by John Arbash Meinel
Tweak some 'sum' lines.
88
        self.min_compress_size = self._default_min_compression_size
3641.3.1 by John Arbash Meinel
Bring in the btree_index and chunk_writer code and their tests.
89
90
    def finish(self):
91
        """Finish the chunk.
92
93
        This returns the final compressed chunk, and either None, or the
94
        bytes that did not fit in the chunk.
95
        """
96
        self.bytes_in = None # Free the data cached so far, we don't need it
3641.3.16 by John Arbash Meinel
Somewhat surprisingly, tracking bytes_out_len makes a
97
        out = self.compressor.flush(Z_FINISH)
98
        self.bytes_list.append(out)
99
        self.bytes_out_len += len(out)
100
        if self.bytes_out_len > self.chunk_size:
3641.3.1 by John Arbash Meinel
Bring in the btree_index and chunk_writer code and their tests.
101
            raise AssertionError('Somehow we ended up with too much'
102
                                 ' compressed data, %d > %d'
3641.3.16 by John Arbash Meinel
Somewhat surprisingly, tracking bytes_out_len makes a
103
                                 % (self.bytes_out_len, self.chunk_size))
104
        nulls_needed = self.chunk_size - self.bytes_out_len % self.chunk_size
3641.3.1 by John Arbash Meinel
Bring in the btree_index and chunk_writer code and their tests.
105
        if nulls_needed:
106
            self.bytes_list.append("\x00" * nulls_needed)
107
        return self.bytes_list, self.unused_bytes, nulls_needed
108
109
    def _recompress_all_bytes_in(self, extra_bytes=None):
3641.3.12 by John Arbash Meinel
Collect some info on the space/time tradeoff for _max_repack.
110
        """Recompress the current bytes_in, and optionally more.
111
112
        :param extra_bytes: Optional, if supplied we will try to add it with
113
            Z_SYNC_FLUSH
114
        :return: (bytes_out, compressor, alt_compressed)
115
            bytes_out   is the compressed bytes returned from the compressor
116
            compressor  An object with everything packed in so far, and
117
                        Z_SYNC_FLUSH called.
118
            alt_compressed  If the compressor supports copy(), then this is a
119
                            snapshot just before extra_bytes is added.
120
                            It is (bytes_out, compressor) as well.
121
                            The idea is if you find you cannot fit the new
122
                            bytes, you don't have to start over.
123
                            And if you *can* you don't have to Z_SYNC_FLUSH
124
                            yet.
125
        """
3641.3.1 by John Arbash Meinel
Bring in the btree_index and chunk_writer code and their tests.
126
        compressor = zlib.compressobj()
127
        bytes_out = []
3641.3.5 by John Arbash Meinel
For iter_all and three_level tests adjust spill-at.
128
        append = bytes_out.append
129
        compress = compressor.compress
3641.3.1 by John Arbash Meinel
Bring in the btree_index and chunk_writer code and their tests.
130
        for accepted_bytes in self.bytes_in:
3641.3.5 by John Arbash Meinel
For iter_all and three_level tests adjust spill-at.
131
            out = compress(accepted_bytes)
3641.3.1 by John Arbash Meinel
Bring in the btree_index and chunk_writer code and their tests.
132
            if out:
3641.3.5 by John Arbash Meinel
For iter_all and three_level tests adjust spill-at.
133
                append(out)
3641.3.1 by John Arbash Meinel
Bring in the btree_index and chunk_writer code and their tests.
134
        if extra_bytes:
3641.3.5 by John Arbash Meinel
For iter_all and three_level tests adjust spill-at.
135
            out = compress(extra_bytes)
3641.3.16 by John Arbash Meinel
Somewhat surprisingly, tracking bytes_out_len makes a
136
            out += compressor.flush(Z_SYNC_FLUSH)
137
            if out:
138
                append(out)
3641.3.27 by John Arbash Meinel
Bringing reserved in as a keyword to write() also saves some time.
139
        bytes_out_len = sum(map(len, bytes_out))
140
        return bytes_out, bytes_out_len, compressor
3641.3.1 by John Arbash Meinel
Bring in the btree_index and chunk_writer code and their tests.
141
3641.3.27 by John Arbash Meinel
Bringing reserved in as a keyword to write() also saves some time.
142
    def write(self, bytes, reserved=False):
3641.3.1 by John Arbash Meinel
Bring in the btree_index and chunk_writer code and their tests.
143
        """Write some bytes to the chunk.
144
145
        If the bytes fit, False is returned. Otherwise True is returned
146
        and the bytes have not been added to the chunk.
147
        """
148
        if reserved:
149
            capacity = self.chunk_size
150
        else:
151
            capacity = self.chunk_size - self.reserved_size
152
        # Check quickly to see if this is likely to put us outside of our
153
        # budget:
154
        next_seen_size = self.seen_bytes + len(bytes)
3641.3.27 by John Arbash Meinel
Bringing reserved in as a keyword to write() also saves some time.
155
        comp = self.compressor
3641.3.4 by John Arbash Meinel
Tweak some 'sum' lines.
156
        if (next_seen_size < self.min_compress_size * capacity):
3641.3.1 by John Arbash Meinel
Bring in the btree_index and chunk_writer code and their tests.
157
            # No need, we assume this will "just fit"
3641.3.27 by John Arbash Meinel
Bringing reserved in as a keyword to write() also saves some time.
158
            out = comp.compress(bytes)
3641.3.11 by John Arbash Meinel
Start working on an alternate way to track compressed_chunk state.
159
            if out:
160
                self.bytes_list.append(out)
3641.3.16 by John Arbash Meinel
Somewhat surprisingly, tracking bytes_out_len makes a
161
                self.bytes_out_len += len(out)
3641.3.1 by John Arbash Meinel
Bring in the btree_index and chunk_writer code and their tests.
162
            self.bytes_in.append(bytes)
163
            self.seen_bytes = next_seen_size
164
        else:
3641.3.27 by John Arbash Meinel
Bringing reserved in as a keyword to write() also saves some time.
165
            if self.num_repack >= self._max_repack and not reserved:
3641.3.15 by John Arbash Meinel
Now that we have real data, remove the copy() code.
166
                # We already know we don't want to try to fit more
3641.3.1 by John Arbash Meinel
Bring in the btree_index and chunk_writer code and their tests.
167
                return True
168
            # This may or may not fit, try to add it with Z_SYNC_FLUSH
3641.3.27 by John Arbash Meinel
Bringing reserved in as a keyword to write() also saves some time.
169
            out = comp.compress(bytes)
170
            out += comp.flush(Z_SYNC_FLUSH)
3641.3.15 by John Arbash Meinel
Now that we have real data, remove the copy() code.
171
            if out:
172
                self.bytes_list.append(out)
3641.3.16 by John Arbash Meinel
Somewhat surprisingly, tracking bytes_out_len makes a
173
                self.bytes_out_len += len(out)
174
            if self.bytes_out_len + 10 > capacity:
3641.3.1 by John Arbash Meinel
Bring in the btree_index and chunk_writer code and their tests.
175
                # We are over budget, try to squeeze this in without any
176
                # Z_SYNC_FLUSH calls
177
                self.num_repack += 1
3641.3.27 by John Arbash Meinel
Bringing reserved in as a keyword to write() also saves some time.
178
                bytes_out, this_len, compressor = self._recompress_all_bytes_in(bytes)
3641.3.16 by John Arbash Meinel
Somewhat surprisingly, tracking bytes_out_len makes a
179
                if this_len + 10 > capacity:
3641.3.1 by John Arbash Meinel
Bring in the btree_index and chunk_writer code and their tests.
180
                    # No way we can add anymore, we need to re-pack because our
3641.3.15 by John Arbash Meinel
Now that we have real data, remove the copy() code.
181
                    # compressor is now out of sync.
182
                    # This seems to be rarely triggered over
183
                    #   num_repack > _max_repack
3641.3.27 by John Arbash Meinel
Bringing reserved in as a keyword to write() also saves some time.
184
                    bytes_out, this_len, compressor = self._recompress_all_bytes_in()
3641.3.1 by John Arbash Meinel
Bring in the btree_index and chunk_writer code and their tests.
185
                    self.compressor = compressor
186
                    self.bytes_list = bytes_out
3641.3.27 by John Arbash Meinel
Bringing reserved in as a keyword to write() also saves some time.
187
                    self.bytes_out_len = this_len
3641.3.1 by John Arbash Meinel
Bring in the btree_index and chunk_writer code and their tests.
188
                    self.unused_bytes = bytes
189
                    return True
190
                else:
191
                    # This fits when we pack it tighter, so use the new packing
3641.3.15 by John Arbash Meinel
Now that we have real data, remove the copy() code.
192
                    # There is one Z_SYNC_FLUSH call in
193
                    # _recompress_all_bytes_in
3641.3.1 by John Arbash Meinel
Bring in the btree_index and chunk_writer code and their tests.
194
                    self.compressor = compressor
195
                    self.bytes_in.append(bytes)
196
                    self.bytes_list = bytes_out
3641.3.16 by John Arbash Meinel
Somewhat surprisingly, tracking bytes_out_len makes a
197
                    self.bytes_out_len = this_len
3641.3.1 by John Arbash Meinel
Bring in the btree_index and chunk_writer code and their tests.
198
            else:
199
                # It fit, so mark it added
200
                self.bytes_in.append(bytes)
201
                self.seen_bytes = next_seen_size
202
        return False
203