1
# Copyright (C) 2006 by Canonical Ltd
3
# This program is free software; you can redistribute it and/or modify
4
# it under the terms of the GNU General Public License as published by
5
# the Free Software Foundation; either version 2 of the License, or
6
# (at your option) any later version.
8
# This program is distributed in the hope that it will be useful,
9
# but WITHOUT ANY WARRANTY; without even the implied warranty of
10
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
11
# GNU General Public License for more details.
13
# You should have received a copy of the GNU General Public License
14
# along with this program; if not, write to the Free Software
15
# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
17
"""DirState objects record the state of a directory and its bzr metadata.
19
Pseduo EBNF grammar for the disk format:
20
MINIKIND = "f" | "d" | "l";
23
WHOLE NUMBER = {digit}, digit;
25
dirstate format = header line, full checksum, row count, parent details,
27
header line = "#bazaar dirstate flat format 1", NL;
28
full checksum = "adler32: ", ["-"], WHOLE NUMBER, NL;
29
row count = "num_entries: ", digit, NL;
30
parent_details = WHOLE NUMBER, NULL, NL; XXX: complete this line
31
rows = dirname, NULL, basename, NULL, MINIKIND, NULL, fileid_utf8, NULL,
32
WHOLE NUMBER (* size *), NULL, packed stat, NULL, symlink value,
34
PARENT ROW = NULL, revision_utf8, NULL, MINIKIND, NULL, dirname, NULL,
35
basename, NULL, WHOLE NUMBER (* size *), NULL, "y" | "n", NULL,
41
1) Fast end to end use for bzr's top 5 uses cases.
42
2) fall back current object model as needed.
43
3) scale usably to the largest trees known today - say 50K entries.
47
Eventually reuse dirstate objects across locks IFF the dirstate file has not
48
been modified, but will require that we flush/ignore cached stat-hit data
49
because we wont want to restat all files on disk just because a lock was
50
acquired, yet we cannot trust the data after the previous lock was released.
52
Memory representation:
54
vecter of all directories, and vector of the childen ?
56
root_row = (direntry for root, [parent_direntries_for_root]),
58
('', ['data for achild', 'data for bchild', 'data for cchild'])
59
('dir', ['achild', 'cchild', 'echild'])
61
- single bisect to find N subtrees from a path spec
62
- in-order for serialisation - this is 'dirblock' grouping.
63
- insertion of a file '/a' affects only the '/' child-vector, that is, to
64
insert 10K elements from scratch does not generates O(N^2) memoves of a
65
single vector, rather each individual, which tends to be limited to a
66
manageable number. Will scale badly on trees with 10K entries in a
67
single directory. compare with Inventory.InventoryDirectory which has
68
a dictionary for the children. No bisect capability, can only probe for
69
exact matches, or grab all elements and sorta.
70
- Whats the risk of error here? Once we have the base format being processed
71
we should have a net win regardless of optimality. So we are going to
72
go with what seems reasonably.
74
maybe we should do a test profile of these core structure - 10K simulated searches/lookups/etc?
77
The lifetime of Dirstate objects is current per lock, but see above for
78
possible extensions. The lifetime of a row from a dirstate is expected to be
79
very short in the optimistic case: which we are optimising for. For instance,
80
subtree status will determine from analysis of the disk data what rows need to
81
be examined at all, and will be able to determine from a single row whether
82
that file has altered or not, so we are aiming to process tens of thousands of
83
entries each second within the dirstate context, before exposing anything to
84
the larger codebase. This suggests we want the time for a single file
85
comparison to be < 0.1 milliseconds. That would give us 10000 paths per second
86
processed, and to scale to 100 thousand we'll another order of magnitude to do
87
that. Now, as the lifetime for all unchanged entries is the time to parse, stat
88
the file on disk, and then immediately discard, the overhead of object creation
89
becomes a significant cost.
91
Figures: Creating a tuple from from 3 elements was profiled at 0.0625
92
microseconds, whereas creating a object which is subclassed from tuple was
93
0.500 microseconds, and creating an object with 3 elements and slots was 3
94
microseconds long. 0.1 milliseconds is 100 microseconds, and ideally we'll get
95
down to 10 microseconds for the total processing - having 33% of that be object
96
creation is a huge overhead. There is a potential cost in using tuples within
97
each row which is that the conditional code to do comparisons may be slower
98
than method invocation, but method invocation is known to be slow due to stack
99
frame creation, so avoiding methods in these tight inner loops in unfortunately
100
desirable. We can consider a pyrex version of this with objects in future if
114
import bzrlib.inventory
115
from bzrlib.osutils import pathjoin, sha_file, sha_string, walkdirs
120
class DirState(object):
121
"""Record directory and metadata state for fast access.
123
A dirstate is a specialised data structure for managing local working
124
tree state information. Its not yet well defined whether it is platform
125
specific, and if it is how we detect/parameterise that.
128
_kind_to_minikind = {'file':'f', 'directory':'d', 'symlink':'l'}
129
_minikind_to_kind = {'f':'file', 'd':'directory', 'l':'symlink'}
130
_to_yesno = {True:'y', False: 'n'} # TODO profile the performance gain
131
# of using int conversion rather than a dict here. AND BLAME ANDREW IF
135
IN_MEMORY_UNMODIFIED = 1
136
IN_MEMORY_MODIFIED = 2
139
# _header_state and _dirblock_state represent the current state
140
# of the dirstate metadata and the per-row data respectiely.
141
# NOT_IN_MEMORY indicates that no data is in memory
142
# IN_MEMORY_UNMODIFIED indicates that what we have in memory
143
# is the same as is on disk
144
# IN_MEMORY_MODIFIED indicates that we have a modified version
145
# of what is on disk.
146
# In future we will add more granularity, for instance _dirblock_state
147
# will probably support partially-in-memory as a separate variable,
148
# allowing for partially-in-memory unmodified and partially-in-memory
150
self._header_state = DirState.NOT_IN_MEMORY
151
self._dirblock_state = DirState.NOT_IN_MEMORY
155
self._state_file=None
157
def add_parent_tree(self, tree_id, tree):
158
"""Add tree as a parent to this dirstate."""
159
self._read_dirblocks_if_needed()
160
self._parents.append(tree_id)
161
self._header_state = DirState.IN_MEMORY_MODIFIED
163
self._ghosts.append(tree_id)
166
def from_tree(tree, dir_state_filename):
167
"""Create a dirstate from a bzr Tree.
169
:param tree: The tree which should provide parent information and
172
# XXX: aka the big ugly.
174
result._state_file = open(dir_state_filename, 'wb+')
176
_encode = base64.encodestring
178
parent_ids = tree.get_parent_ids()
179
num_parents = len(parent_ids)
181
raise ValueError('Cannot handle more than 3 parents')
184
for parent_id in parent_ids:
185
parent_trees.append(tree.branch.repository.revision_tree(parent_id))
187
# FIXME: is this utf8 safe?
189
to_minikind = DirState._kind_to_minikind
190
to_yesno = DirState._to_yesno
192
st = os.lstat(tree.basedir)
195
, 'directory', tree.inventory.root.file_id.encode('utf8')
196
, 0 # no point having a size for dirs.
201
for parent_tree in parent_trees:
202
root_parents.append((
203
parent_tree.inventory.root.revision.encode('utf8'),
211
root_row = (root_info, root_parents)
213
for dirinfo, block in tree.walkdirs():
214
# dirinfo is path, id
216
# add the row for this block
218
dirblocks.append((dirinfo[0], block_row))
219
for relpath, name, kind, st, fileid, versionedkind in block:
221
# unversioned file, skip
223
# TODO? factor out this loop body as a helper function ?
225
dirname, basename = os.path.split(relpath.encode('utf8'))
227
s = tree.get_file_sha1(fileid, relpath)
228
elif kind == 'directory':
229
if name in ('.bzr', '.hg', 'CVS', '.svn', '_svn'):
230
raise Exception('skipping dirs not supported yet')
231
# Skip this, and all children
232
to_remove.append((relpath, name, kind, st, abspath))
236
elif kind == 'symlink':
237
# sha value of the link target ?!
238
s = os.readlink(abspath)
240
for count in xrange(num_parents):
241
parent_entry = parent_trees[count].inventory[fileid]
243
(parent_entry.revision.encode('utf8'),
245
# FIXME: set these from the parent
246
dirname.encode('utf8'), basename.encode('utf8'),
247
parent_entry.text_size,
248
parent_entry.executable,
249
parent_entry.text_sha1,
251
row_data = (dirname.encode('utf8'), basename.encode('utf8'),
252
kind, fileid.encode('utf8'), st.st_size, pack_stat(st),
254
block_row.append((row_data, parent_info))
256
# It isn't safe to remove entries while we are iterating
257
# over the same list, so remove them now
258
for entry in to_remove:
261
#lines.append(result._get_parents_line(parent_ids))
262
#lines.append(result._get_ghosts_line([]))
263
result._set_data(parent_ids, root_row, dirblocks)
267
def get_ghosts(self):
268
"""Return a list of the parent tree revision ids that are ghosts."""
269
self._read_header_if_needed()
273
"""Serialise the entire dirstate to a sequence of lines."""
274
if (self._header_state == DirState.IN_MEMORY_UNMODIFIED and
275
self._dirblock_state == DirState.IN_MEMORY_UNMODIFIED):
276
# read whats on disk.
277
self._state_file.seek(0)
278
return self._state_file.readlines()
280
lines.append(self._get_parents_line(self.get_parent_ids()))
281
lines.append(self._get_ghosts_line(self._ghosts))
282
# append the root line which is special cased
283
lines.extend(map(self._row_to_line, self._iter_rows()))
284
return self._get_output_lines(lines)
286
def _get_ghosts_line(self, ghost_ids):
287
"""Create a line for the state file for ghost information."""
288
return '\0'.join([str(len(ghost_ids))] + ghost_ids)
290
def _get_parents_line(self, parent_ids):
291
"""Create a line for the state file for parents information."""
292
return '\0'.join([str(len(parent_ids))] + parent_ids)
294
def get_parent_ids(self):
295
"""Return a list of the parent tree ids for the directory state."""
296
self._read_header_if_needed()
300
def initialize(path):
301
"""Create a new dirstate on path.
303
The new dirstate will be an empty tree - that is it has no parents,
304
and only a root node - which has id ROOT_ID.
306
:param path: The name of the file for the dirstate.
307
:return: A DirState object.
309
# This constructs a new DirState object on a path, sets the _state_file
310
# to a new empty file for that path. It then calls _set_data() with our
311
# stock empty dirstate information - a root with ROOT_ID, no children,
312
# and no parents. Finally it calls save() to ensure that this data will
315
result._state_file = open(path, 'wb+')
316
# a new root directory, with a pack_stat (the x's) that is just noise and will
317
# never match the output of base64 encode.
318
root_row_data = ('', '', 'directory', bzrlib.inventory.ROOT_ID, 0, 'x'*32, '')
320
root_row = (root_row_data, root_parents)
321
empty_tree_dirblocks = [('', [])] # root dir contents - no entries.
322
result._set_data([], root_row, empty_tree_dirblocks)
326
result._state_file.close()
330
def _iter_rows(self):
331
"""Iterate over all the row data in the dirstate.
333
Each yelt item is a tuple of (row_data, parent_data_list).
335
self._read_dirblocks_if_needed()
337
for directory in self._dirblocks:
338
for row in directory[1]:
341
def _get_output_lines(self, lines):
342
"""format lines for final output.
344
:param lines: A sequece of lines containing the parents list and the
347
output_lines = ['#bazaar dirstate flat format 1\n']
348
lines.append('') # a final newline
349
inventory_text = '\0\n\0'.join(lines)
350
output_lines.append('adler32: %s\n' % (zlib.adler32(inventory_text),))
351
# -3, 1 for num parents, 1 for ghosts, 1 for final newline
352
num_entries = len(lines)-3
353
output_lines.append('num_entries: %s\n' % (num_entries,))
354
output_lines.append(inventory_text)
359
"""Construct a DirState on the file at path path."""
361
result._state_file = open(path, 'rb+')
364
def _read_dirblocks_if_needed(self):
365
"""Read in all the dirblocks from the file if they are not in memory."""
366
self._read_header_if_needed()
367
if self._dirblock_state == DirState.NOT_IN_MEMORY:
368
# the _state_file pointer will be positioned at the start of the
370
text = self._state_file.read()
371
# TODO: check the adler checksums. adler_measured = zlib.adler32(text)
373
fields = text.split('\0')
374
# Remove the last blank entry
375
trailing = fields.pop()
376
assert trailing == ''
377
# consider turning fields into a tuple.
379
# skip the first field which is the trailing null from the header.
381
field_count = len(fields)
382
# Each line now has an extra '\n' field which is not used
383
# so we just skip over it
384
# number of fields per dir_entry + number of fields per parent_entry + newline
385
num_parents = len(self._parents)
386
entry_size = 7 + (7 * num_parents) + 1
387
expected_field_count = entry_size * self._num_entries
388
# is the file too short ?
389
assert field_count - cur == expected_field_count, \
390
'field count incorrect %s != %s' % (expected_field_count, field_count)
392
# Fast path the case where there are 1 or 2 parents
394
entries = [(fields[pos:pos+7], []) for pos in xrange(cur, field_count, entry_size)]
395
elif num_parents == 1:
396
entries = [(fields[pos:pos+7], [fields[pos+7:pos+14],])
397
for pos in xrange(cur, field_count, entry_size)]
398
elif num_parents == 2:
399
entries = [(fields[pos:pos+7], [
400
fields[pos+7:pos+14],
401
fields[pos+14:pos+21],])
402
for pos in xrange(cur, field_count, entry_size)]
404
raise NotImplementedError(self._read_dirblocks_if_needed)
406
[fields[chunk:chunk+7] for chunk in xrange(pos, pos+entry_size-1, 7)])
407
for pos in xrange(cur, field_count, entry_size)
410
assert len(entries) == self._num_entries, '%s != %s entries' % (len(entries),
412
entry_iter = iter(entries)
413
self._root_row = entry_iter.next()
414
# convert the minikind to kind
415
self._root_row[0][2] = self._minikind_to_kind[self._root_row[0][2]]
416
# convert the size to an int
417
self._root_row[0][4] = int(self._root_row[0][4])
418
# TODO parent converion
419
# TODO dirblock population
420
for entry in entry_iter:
424
def _read_header(self):
425
"""This reads in the metadata header, and the parent ids.
427
After reading in, the file should be positioned at the null
428
just before the start of the first record in the file.
430
:return: (expected adler checksum, number of entries, parent list)
433
parent_line = self._state_file.readline()
434
info = parent_line.split('\0')
435
num_parents = int(info[0])
436
assert num_parents == len(info)-2, 'incorrect parent info line'
437
self._parents = [p.decode('utf8') for p in info[1:-1]]
439
ghost_line = self._state_file.readline()
440
info = ghost_line.split('\0')
441
num_ghosts = int(info[1])
442
assert num_ghosts == len(info)-3, 'incorrect ghost info line'
443
self._ghosts = [p.decode('utf8') for p in info[2:-1]]
444
self._header_state = DirState.IN_MEMORY_UNMODIFIED
446
def _read_header_if_needed(self):
447
"""Read the header of the dirstate file if needed."""
448
if self._header_state == DirState.NOT_IN_MEMORY:
451
def _read_prelude(self):
452
"""Read in the prelude header of the dirstate file
454
This only reads in the stuff that is not connected to the adler
455
checksum. The position will be correct to read in the rest of
456
the file and check the checksum after this point.
457
The next entry in the file should be the number of parents,
458
and their ids. Followed by a newline.
460
header = self._state_file.readline()
461
assert header == '#bazaar dirstate flat format 1\n', \
462
'invalid header line: %r' % (header,)
463
adler_line = self._state_file.readline()
464
assert adler_line.startswith('adler32: '), 'missing adler32 checksum'
465
self.adler_expected = int(adler_line[len('adler32: '):-1])
466
num_entries_line = self._state_file.readline()
467
assert num_entries_line.startswith('num_entries: '), 'missing num_entries line'
468
self._num_entries = int(num_entries_line[len('num_entries: '):-1])
470
def _row_to_line(self, row):
471
"""Serialize row to a NULL delimited line ready for _get_output_lines.
473
:param row: A row_tuple as defined in the module docstring.
475
entire_row = list(row[0])
476
for parent_number, parent_data in enumerate(row[1]):
477
# (revision, kind, dirname, basename, size, executable_bool, sha1)
478
entire_row.extend(parent_data)
479
# minikind conversion of the parent
480
parent_offset = 7 + parent_number * 7
481
entire_row[parent_offset + 1] = DirState._kind_to_minikind[parent_data[1]]
482
entire_row[parent_offset + 4] = str(parent_data[4])
483
entire_row[parent_offset + 5] = DirState._to_yesno[parent_data[5]]
484
# conversion from memory to disk-ready format:
485
# minikind conversion of the current row type.
486
entire_row[2] = DirState._kind_to_minikind[entire_row[2]]
487
entire_row[4] = str(entire_row[4])
488
# minikind of parents
489
return '\0'.join(entire_row)
492
"""Save any pending changes created during this session."""
493
if (self._header_state == DirState.IN_MEMORY_MODIFIED or
494
self._dirblock_state == DirState.IN_MEMORY_MODIFIED):
495
self._state_file.seek(0)
496
self._state_file.writelines(self.get_lines())
497
self._state_file.flush()
498
self._header_state = DirState.IN_MEMORY_UNMODIFIED
499
self._dirblock_state = DirState.IN_MEMORY_UNMODIFIED
501
def _set_data(self, parent_ids, root_row, dirblocks):
502
"""Set the full dirstate data in memory.
504
This is an internal function used to completely replace the objects
505
in memory state. It puts the dirstate into state 'full-dirty'.
507
:param parent_ids: A list of parent tree revision ids.
508
:param root_row: The root row - a tuple of the root direntry and the
509
list of matching direntries from the parent_ids trees.
510
:param dirblocks: A list containing one tuple for each directory in the
511
tree. Each tuple contains the directory path and a list of
512
row data in the same format as root_row.
514
# our memory copy is now authoritative.
515
self._dirblocks = dirblocks
516
self._root_row = root_row
517
self._header_state = DirState.IN_MEMORY_MODIFIED
518
self._dirblock_state = DirState.IN_MEMORY_MODIFIED
519
self._parents = list(parent_ids)
521
def set_parent_trees(self, trees, ghosts):
522
"""Set the parent trees for the dirstate.
524
:param trees: A list of revision_id, tree tuples. tree must be provided
525
even if the revision_id refers to a ghost: supply an empty tree in
527
:param ghosts: A list of the revision_ids that are ghosts at the time
530
# TODO regenerate self._dirblocks and self._root_row
531
self._read_dirblocks_if_needed()
532
self._parents = [rev_id for rev_id, tree in trees]
533
self._ghosts = list(ghosts)
534
self._header_state = DirState.IN_MEMORY_MODIFIED
535
self._dirblock_state = DirState.IN_MEMORY_MODIFIED
538
def pack_stat(st, _encode=base64.encodestring, _pack=struct.pack):
539
"""Convert stat values into a packed representation."""
540
# jam 20060614 it isn't really worth removing more entries if we
541
# are going to leave it in packed form.
542
# With only st_mtime and st_mode filesize is 5.5M and read time is 275ms
543
# With all entries filesize is 5.9M and read time is mabye 280ms
544
# well within the noise margin
546
# base64.encode always adds a final newline, so strip it off
547
return _encode(_pack('>llllll'
548
, st.st_size, st.st_mtime, st.st_ctime
549
, st.st_dev, st.st_ino, st.st_mode))[:-1]