~bzr-pqm/bzr/bzr.dev : revision 2322

1

2

#

3

# This program is free software; you can redistribute it and/or modify

4

# it under the terms of the GNU General Public License as published by

5

# the Free Software Foundation; either version 2 of the License, or

6

# (at your option) any later version.

7

#

8

# This program is distributed in the hope that it will be useful,

9

# but WITHOUT ANY WARRANTY; without even the implied warranty of

10

# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the

11

# GNU General Public License for more details.

12

#

13

# You should have received a copy of the GNU General Public License

14

# along with this program; if not, write to the Free Software

15

# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA

16

17

"""DirState objects record the state of a directory and its bzr metadata.

18

19

Pseudo EBNF grammar for the state file. Fields are separated by NULLs, and

20

lines by NL. The field delimiters are ommitted in the grammar, line delimiters

21

are not - this is done for clarity of reading. All string data is in utf8.

22

23

MINIKIND = "f" | "d" | "l" | "a" | "r" | "t";

24

NL = "\n";

25

NULL = "\0";

26

WHOLE_NUMBER = {digit}, digit;

27

BOOLEAN = "y" | "n";

28

REVISION_ID = a non-empty utf8 string;

29

30

dirstate format = header line, full checksum, row count, parent details,

31

ghost_details, entries;

32

header line = "#bazaar dirstate flat format 2", NL;

33

full checksum = "crc32: ", ["-"], WHOLE_NUMBER, NL;

34

row count = "num_entries: ", digit, NL;

35

parent_details = WHOLE NUMBER, {REVISION_ID}* NL;

36

ghost_details = WHOLE NUMBER, {REVISION_ID}*, NL;

37

entries = {entry};

38

entry = entry_key, current_entry_details, {parent_entry_details};

39

entry_key = dirname, basename, fileid;

40

current_entry_details = common_entry_details, working_entry_details;

41

parent_entry_details = common_entry_details, history_entry_details;

42

common_entry_details = MINIKIND, fingerprint, size, executable

43

working_entry_details = packed_stat

44

history_entry_details = REVISION_ID;

45

executable = BOOLEAN;

46

size = WHOLE_NUMBER;

47

fingerprint = a nonempty utf8 sequence with meaning defined by minikind.

48

49

Given this definition, the following is useful to know:

50

entry (aka row) - all the data for a given key.

51

entry[0]: The key (dirname, basename, fileid)

52

entry[0][0]: dirname

53

entry[0][1]: basename

54

entry[0][2]: fileid

55

entry[1]: The tree(s) data for this path and id combination.

56

entry[1][0]: The current tree

57

entry[1][1]: The second tree

58

59

For an entry for a tree, we have (using tree 0 - current tree) to demonstrate:

60

entry[1][0][0]: minikind

61

entry[1][0][1]: fingerprint

62

entry[1][0][2]: size

63

entry[1][0][3]: executable

64

entry[1][0][4]: packed_stat

65

OR (for non tree-0)

66

entry[1][1][4]: revision_id

67

68

There may be multiple rows at the root, one per id present in the root, so the

69

in memory root row is now:

70

self._dirblocks[0] -> ('', [entry ...]),

71

and the entries in there are

72

entries[0][0]: ''

73

entries[0][1]: ''

74

entries[0][2]: file_id

75

entries[1][0]: The tree data for the current tree for this fileid at /

76

etc.

77

78

Kinds:

79

'r' is a relocated entry: This path is not present in this tree with this id,

80

but the id can be found at another location. The fingerprint is used to

81

point to the target location.

82

'a' is an absent entry: In that tree the id is not present at this path.

83

'd' is a directory entry: This path in this tree is a directory with the

84

current file id. There is no fingerprint for directories.

85

'f' is a file entry: As for directory, but its a file. The fingerprint is a

86

sha1 value.

87

'l' is a symlink entry: As for directory, but a symlink. The fingerprint is the

88

link target.

89

't' is a reference to a nested subtree; the fingerprint is the referenced

90

revision.

91

92

Ordering:

93

94

The entries on disk and in memory are ordered according to the following keys:

95

96

directory, as a list of components

97

filename

98

file-id

99

100

--- Format 1 had the following different definition: ---

101

rows = dirname, NULL, basename, NULL, MINIKIND, NULL, fileid_utf8, NULL,

102

WHOLE NUMBER (* size *), NULL, packed stat, NULL, sha1|symlink target,

103

{PARENT ROW}

104

PARENT ROW = NULL, revision_utf8, NULL, MINIKIND, NULL, dirname, NULL,

105

basename, NULL, WHOLE NUMBER (* size *), NULL, "y" | "n", NULL,

106

SHA1

107

108

PARENT ROW's are emitted for every parent that is not in the ghosts details

109

line. That is, if the parents are foo, bar, baz, and the ghosts are bar, then

110

each row will have a PARENT ROW for foo and baz, but not for bar.

111

112

113

In any tree, a kind of 'moved' indicates that the fingerprint field

114

(which we treat as opaque data specific to the 'kind' anyway) has the

115

details for the id of this row in that tree.

116

117

I'm strongly tempted to add a id->path index as well, but I think that

118

where we need id->path mapping; we also usually read the whole file, so

119

I'm going to skip that for the moment, as we have the ability to locate

120

via bisect any path in any tree, and if we lookup things by path, we can

121

accumulate a id->path mapping as we go, which will tend to match what we

122

looked for.

123

124

I plan to implement this asap, so please speak up now to alter/tweak the

125

design - and once we stabilise on this, I'll update the wiki page for

126

it.

127

128

The rationale for all this is that we want fast operations for the

129

common case (diff/status/commit/merge on all files) and extremely fast

130

operations for the less common but still occurs a lot status/diff/commit

131

on specific files). Operations on specific files involve a scan for all

132

the children of a path, *in every involved tree*, which the current

133

format did not accommodate.

134

----

135

136

Design priorities:

137

1) Fast end to end use for bzr's top 5 uses cases. (commmit/diff/status/merge/???)

138

2) fall back current object model as needed.

139

3) scale usably to the largest trees known today - say 50K entries. (mozilla

140

is an example of this)

141

142

143

Locking:

144

Eventually reuse dirstate objects across locks IFF the dirstate file has not

145

been modified, but will require that we flush/ignore cached stat-hit data

146

because we wont want to restat all files on disk just because a lock was

147

acquired, yet we cannot trust the data after the previous lock was released.

148

149

Memory representation:

150

vector of all directories, and vector of the childen ?

151

i.e.

152

root_entrie = (direntry for root, [parent_direntries_for_root]),

153

dirblocks = [

154

('', ['data for achild', 'data for bchild', 'data for cchild'])

155

('dir', ['achild', 'cchild', 'echild'])

156

]

157

- single bisect to find N subtrees from a path spec

158

- in-order for serialisation - this is 'dirblock' grouping.

159

- insertion of a file '/a' affects only the '/' child-vector, that is, to

160

insert 10K elements from scratch does not generates O(N^2) memoves of a

161

single vector, rather each individual, which tends to be limited to a

162

manageable number. Will scale badly on trees with 10K entries in a

163

single directory. compare with Inventory.InventoryDirectory which has

164

a dictionary for the children. No bisect capability, can only probe for

165

exact matches, or grab all elements and sorta.

166

- Whats the risk of error here? Once we have the base format being processed

167

we should have a net win regardless of optimality. So we are going to

168

go with what seems reasonably.

169

open questions:

170

171

maybe we should do a test profile of these core structure - 10K simulated searches/lookups/etc?

172

173

Objects for each row?

174

The lifetime of Dirstate objects is current per lock, but see above for

175

possible extensions. The lifetime of a row from a dirstate is expected to be

176

very short in the optimistic case: which we are optimising for. For instance,

177

subtree status will determine from analysis of the disk data what rows need to

178

be examined at all, and will be able to determine from a single row whether

179

that file has altered or not, so we are aiming to process tens of thousands of

180

entries each second within the dirstate context, before exposing anything to

181

the larger codebase. This suggests we want the time for a single file

182

comparison to be < 0.1 milliseconds. That would give us 10000 paths per second

183

processed, and to scale to 100 thousand we'll another order of magnitude to do

184

that. Now, as the lifetime for all unchanged entries is the time to parse, stat

185

the file on disk, and then immediately discard, the overhead of object creation

186

becomes a significant cost.

187

188

Figures: Creating a tuple from from 3 elements was profiled at 0.0625

189

microseconds, whereas creating a object which is subclassed from tuple was

190

0.500 microseconds, and creating an object with 3 elements and slots was 3

191

microseconds long. 0.1 milliseconds is 100 microseconds, and ideally we'll get

192

down to 10 microseconds for the total processing - having 33% of that be object

193

creation is a huge overhead. There is a potential cost in using tuples within

194

each row which is that the conditional code to do comparisons may be slower

195

than method invocation, but method invocation is known to be slow due to stack

196

frame creation, so avoiding methods in these tight inner loops in unfortunately

197

desirable. We can consider a pyrex version of this with objects in future if

198

desired.

199

200

"""

201

202

203

import base64

204

import bisect

205

import errno

206

import os

207

from stat import S_IEXEC

208

import struct

209

import sys

210

import time

211

import zlib

212

213

from bzrlib import (

214

errors,

215

inventory,

216

lock,

217

osutils,

218

trace,

219

)

220

221

222

class _Bisector(object):

223

"""This just keeps track of information as we are bisecting."""

224

225

226

class DirState(object):

227

"""Record directory and metadata state for fast access.

228

229

A dirstate is a specialised data structure for managing local working

230

tree state information. Its not yet well defined whether it is platform

231

specific, and if it is how we detect/parameterise that.

232

233

Dirstates use the usual lock_write, lock_read and unlock mechanisms.

234

Unlike most bzr disk formats, DirStates must be locked for reading, using

235

lock_read. (This is an os file lock internally.) This is necessary

236

because the file can be rewritten in place.

237

238

DirStates must be explicitly written with save() to commit changes; just

239

unlocking them does not write the changes to disk.

240

"""

241

242

_kind_to_minikind = {

243

'absent': 'a',

244

'file': 'f',

245

'directory': 'd',

246

'relocated': 'r',

247

'symlink': 'l',

248

'tree-reference': 't',

249

}

250

_minikind_to_kind = {

251

'a': 'absent',

252

'f': 'file',

253

'd': 'directory',

254

'l':'symlink',

255

'r': 'relocated',

256

't': 'tree-reference',

257

}

258

_to_yesno = {True:'y', False: 'n'} # TODO profile the performance gain

259

# of using int conversion rather than a dict here. AND BLAME ANDREW IF

260

# it is faster.

261

262

# TODO: jam 20070221 Figure out what to do if we have a record that exceeds

263

# the BISECT_PAGE_SIZE. For now, we just have to make it large enough

264

# that we are sure a single record will always fit.

265

BISECT_PAGE_SIZE = 4096

266

267

NOT_IN_MEMORY = 0

268

IN_MEMORY_UNMODIFIED = 1

269

IN_MEMORY_MODIFIED = 2

270

271

# A pack_stat (the x's) that is just noise and will never match the output

272

# of base64 encode.

273

NULLSTAT = 'x' * 32

274

NULL_PARENT_DETAILS = ('a', '', 0, False, '')

275

276

HEADER_FORMAT_2 = '#bazaar dirstate flat format 2\n'

277

HEADER_FORMAT_3 = '#bazaar dirstate flat format 3\n'

278

279

def __init__(self, path):

280

"""Create a DirState object.

281

282

Attributes of note:

283

284

:attr _root_entrie: The root row of the directory/file information,

285

- contains the path to / - '', ''

286

- kind of 'directory',

287

- the file id of the root in utf8

288

- size of 0

289

- a packed state

290

- and no sha information.

291

:param path: The path at which the dirstate file on disk should live.

292

"""

293

# _header_state and _dirblock_state represent the current state

294

# of the dirstate metadata and the per-row data respectiely.

295

# NOT_IN_MEMORY indicates that no data is in memory

296

# IN_MEMORY_UNMODIFIED indicates that what we have in memory

297

# is the same as is on disk

298

# IN_MEMORY_MODIFIED indicates that we have a modified version

299

# of what is on disk.

300

# In future we will add more granularity, for instance _dirblock_state

301

# will probably support partially-in-memory as a separate variable,

302

# allowing for partially-in-memory unmodified and partially-in-memory

303

# modified states.

304

self._header_state = DirState.NOT_IN_MEMORY

305

self._dirblock_state = DirState.NOT_IN_MEMORY

306

self._dirblocks = []

307

self._ghosts = []

308

self._parents = []

309

self._state_file = None

310

self._filename = path

311

self._lock_token = None

312

self._lock_state = None

313

self._id_index = None

314

self._end_of_header = None

315

self._cutoff_time = None

316

self._split_path_cache = {}

317

self._bisect_page_size = DirState.BISECT_PAGE_SIZE

318

319

def __repr__(self):

320

return "%s(%r)" % \

321

(self.__class__.__name__, self._filename)

322

323

def add(self, path, file_id, kind, stat, fingerprint):

324

"""Add a path to be tracked.

325

326

:param path: The path within the dirstate - '' is the root, 'foo' is the

327

path foo within the root, 'foo/bar' is the path bar within foo

328

within the root.

329

:param file_id: The file id of the path being added.

330

:param kind: The kind of the path, as a string like 'file',

331

'directory', etc.

332

:param stat: The output of os.lstat for the path.

333

:param fingerprint: The sha value of the file,

334

or the target of a symlink,

335

or the referenced revision id for tree-references,

336

or '' for directories.

337

"""

338

# adding a file:

339

# find the block its in.

340

# find the location in the block.

341

# check its not there

342

# add it.

343

#------- copied from inventory.make_entry

344

# --- normalized_filename wants a unicode basename only, so get one.

345

dirname, basename = osutils.split(path)

346

# we dont import normalized_filename directly because we want to be

347

# able to change the implementation at runtime for tests.

348

norm_name, can_access = osutils.normalized_filename(basename)

349

if norm_name != basename:

350

if can_access:

351

basename = norm_name

352

else:

353

raise errors.InvalidNormalization(path)

354

# you should never have files called . or ..; just add the directory

355

# in the parent, or according to the special treatment for the root

356

if basename == '.' or basename == '..':

357

raise errors.InvalidEntryName(path)

358

# now that we've normalised, we need the correct utf8 path and

359

# dirname and basename elements. This single encode and split should be

360

# faster than three separate encodes.

361

utf8path = (dirname + '/' + basename).strip('/').encode('utf8')

362

dirname, basename = osutils.split(utf8path)

363

assert file_id.__class__ == str, \

364

"must be a utf8 file_id not %s" % (type(file_id))

365

# Make sure the file_id does not exist in this tree

366

file_id_entry = self._get_entry(0, fileid_utf8=file_id)

367

if file_id_entry != (None, None):

368

path = osutils.pathjoin(file_id_entry[0][0], file_id_entry[0][1])

369

kind = DirState._minikind_to_kind[file_id_entry[1][0][0]]

370

info = '%s:%s' % (kind, path)

371

raise errors.DuplicateFileId(file_id, info)

372

first_key = (dirname, basename, '')

373

block_index, present = self._find_block_index_from_key(first_key)

374

if present:

375

# check the path is not in the tree

376

block = self._dirblocks[block_index][1]

377

entry_index, _ = self._find_entry_index(first_key, block)

378

while (entry_index < len(block) and

379

block[entry_index][0][0:2] == first_key[0:2]):

380

if block[entry_index][1][0][0] not in 'ar':

381

# this path is in the dirstate in the current tree.

382

raise Exception, "adding already added path!"

383

entry_index += 1

384

else:

385

# The block where we want to put the file is not present. But it

386

# might be because the directory was empty, or not loaded yet. Look

387

# for a parent entry, if not found, raise NotVersionedError

388

parent_dir, parent_base = osutils.split(dirname)

389

parent_block_idx, parent_entry_idx, _, parent_present = \

390

self._get_block_entry_index(parent_dir, parent_base, 0)

391

if not parent_present:

392

raise errors.NotVersionedError(path, str(self))

393

self._ensure_block(parent_block_idx, parent_entry_idx, dirname)

394

block = self._dirblocks[block_index][1]

395

entry_key = (dirname, basename, file_id)

396

if stat is None:

397

size = 0

398

packed_stat = DirState.NULLSTAT

399

else:

400

size = stat.st_size

401

packed_stat = pack_stat(stat)

402

parent_info = self._empty_parent_info()

403

minikind = DirState._kind_to_minikind[kind]

404

if kind == 'file':

405

entry_data = entry_key, [

406

(minikind, fingerprint, size, False, packed_stat),

407

] + parent_info

408

elif kind == 'directory':

409

entry_data = entry_key, [

410

(minikind, '', 0, False, packed_stat),

411

] + parent_info

412

elif kind == 'symlink':

413

entry_data = entry_key, [

414

(minikind, fingerprint, size, False, packed_stat),

415

] + parent_info

416

elif kind == 'tree-reference':

417

entry_data = entry_key, [

418

(minikind, fingerprint, 0, False, packed_stat),

419

] + parent_info

420

else:

421

raise errors.BzrError('unknown kind %r' % kind)

422

entry_index, present = self._find_entry_index(entry_key, block)

423

assert not present, "basename %r already added" % basename

424

block.insert(entry_index, entry_data)

425

426

if kind == 'directory':

427

# insert a new dirblock

428

self._ensure_block(block_index, entry_index, utf8path)

429

self._dirblock_state = DirState.IN_MEMORY_MODIFIED

430

if self._id_index:

431

self._id_index.setdefault(entry_key[2], set()).add(entry_key)

432

433

def _bisect(self, dir_name_list):

434

"""Bisect through the disk structure for specific rows.

435

436

:param dir_name_list: A list of (dir, name) pairs.

437

:return: A dict mapping (dir, name) => entry for found entries. Missing

438

entries will not be in the map.

439

"""

440

self._requires_lock()

441

# We need the file pointer to be right after the initial header block

442

self._read_header_if_needed()

443

# If _dirblock_state was in memory, we should just return info from

444

# there, this function is only meant to handle when we want to read

445

# part of the disk.

446

assert self._dirblock_state == DirState.NOT_IN_MEMORY

447

448

# The disk representation is generally info + '\0\n\0' at the end. But

449

# for bisecting, it is easier to treat this as '\0' + info + '\0\n'

450

# Because it means we can sync on the '\n'

451

state_file = self._state_file

452

file_size = os.fstat(state_file.fileno()).st_size

453

# We end up with 2 extra fields, we should have a trailing '\n' to

454

# ensure that we read the whole record, and we should have a precursur

455

# '' which ensures that we start after the previous '\n'

456

entry_field_count = self._fields_per_entry() + 1

457

458

low = self._end_of_header

459

high = file_size - 1 # Ignore the final '\0'

460

# Map from (dir, name) => entry

461

found = {}

462

463

# Avoid infinite seeking

464

max_count = 30*len(dir_name_list)

465

count = 0

466

# pending is a list of places to look.

467

# each entry is a tuple of low, high, dir_names

468

# low -> the first byte offset to read (inclusive)

469

# high -> the last byte offset (inclusive)

470

# dir_names -> The list of (dir, name) pairs that should be found in

471

# the [low, high] range

472

pending = [(low, high, dir_name_list)]

473

474

page_size = self._bisect_page_size

475

476

fields_to_entry = self._get_fields_to_entry()

477

478

while pending:

479

low, high, cur_files = pending.pop()

480

481

if not cur_files or low >= high:

482

# Nothing to find

483

continue

484

485

count += 1

486

if count > max_count:

487

raise errors.BzrError('Too many seeks, most likely a bug.')

488

489

mid = max(low, (low+high-page_size)/2)

490

491

state_file.seek(mid)

492

# limit the read size, so we don't end up reading data that we have

493

# already read.

494

read_size = min(page_size, (high-mid)+1)

495

block = state_file.read(read_size)

496

497

start = mid

498

entries = block.split('\n')

499

500

if len(entries) < 2:

501

# We didn't find a '\n', so we cannot have found any records.

502

# So put this range back and try again. But we know we have to

503

# increase the page size, because a single read did not contain

504

# a record break (so records must be larger than page_size)

505

page_size *= 2

506

pending.append((low, high, cur_files))

507

continue

508

509

# Check the first and last entries, in case they are partial, or if

510

# we don't care about the rest of this page

511

first_entry_num = 0

512

first_fields = entries[0].split('\0')

513

if len(first_fields) < entry_field_count:

514

# We didn't get the complete first entry

515

# so move start, and grab the next, which

516

# should be a full entry

517

start += len(entries[0])+1

518

first_fields = entries[1].split('\0')

519

first_entry_num = 1

520

521

if len(first_fields) <= 2:

522

# We didn't even get a filename here... what do we do?

523

# Try a large page size and repeat this query

524

page_size *= 2

525

pending.append((low, high, cur_files))

526

continue

527

else:

528

# Find what entries we are looking for, which occur before and

529

# after this first record.

530

after = start

531

first_dir_name = (first_fields[1], first_fields[2])

532

first_loc = bisect.bisect_left(cur_files, first_dir_name)

533

534

# These exist before the current location

535

pre = cur_files[:first_loc]

536

# These occur after the current location, which may be in the

537

# data we read, or might be after the last entry

538

post = cur_files[first_loc:]

539

540

if post and len(first_fields) >= entry_field_count:

541

# We have files after the first entry

542

543

# Parse the last entry

544

last_entry_num = len(entries)-1

545

last_fields = entries[last_entry_num].split('\0')

546

if len(last_fields) < entry_field_count:

547

# The very last hunk was not complete,

548

# read the previous hunk

549

after = mid + len(block) - len(entries[-1])

550

last_entry_num -= 1

551

last_fields = entries[last_entry_num].split('\0')

552

else:

553

after = mid + len(block)

554

555

last_dir_name = (last_fields[1], last_fields[2])

556

last_loc = bisect.bisect_right(post, last_dir_name)

557

558

middle_files = post[:last_loc]

559

post = post[last_loc:]

560

561

if middle_files:

562

# We have files that should occur in this block

563

# (>= first, <= last)

564

# Either we will find them here, or we can mark them as

565

# missing.

566

567

if middle_files[0] == first_dir_name:

568

# We might need to go before this location

569

pre.append(first_dir_name)

570

if middle_files[-1] == last_dir_name:

571

post.insert(0, last_dir_name)

572

573

# Find out what paths we have

574

paths = {first_dir_name:[first_fields]}

575

# last_dir_name might == first_dir_name so we need to be

576

# careful if we should append rather than overwrite

577

if last_entry_num != first_entry_num:

578

paths.setdefault(last_dir_name, []).append(last_fields)

579

for num in xrange(first_entry_num+1, last_entry_num):

580

# TODO: jam 20070223 We are already splitting here, so

581

# shouldn't we just split the whole thing rather

582

# than doing the split again in add_one_record?

583

fields = entries[num].split('\0')

584

dir_name = (fields[1], fields[2])

585

paths.setdefault(dir_name, []).append(fields)

586

587

for dir_name in middle_files:

588

for fields in paths.get(dir_name, []):

589

# offset by 1 because of the opening '\0'

590

# consider changing fields_to_entry to avoid the

591

# extra list slice

592

entry = fields_to_entry(fields[1:])

593

found.setdefault(dir_name, []).append(entry)

594

595

# Now we have split up everything into pre, middle, and post, and

596

# we have handled everything that fell in 'middle'.

597

# We add 'post' first, so that we prefer to seek towards the

598

# beginning, so that we will tend to go as early as we need, and

599

# then only seek forward after that.

600

if post:

601

pending.append((after, high, post))

602

if pre:

603

pending.append((low, start-1, pre))

604

605

# Consider that we may want to return the directory entries in sorted

606

# order. For now, we just return them in whatever order we found them,

607

# and leave it up to the caller if they care if it is ordered or not.

608

return found

609

610

def _bisect_dirblocks(self, dir_list):

611

"""Bisect through the disk structure to find entries in given dirs.

612

613

_bisect_dirblocks is meant to find the contents of directories, which

614

differs from _bisect, which only finds individual entries.

615

616

:param dir_list: An sorted list of directory names ['', 'dir', 'foo'].

617

:return: A map from dir => entries_for_dir

618

"""

619

# TODO: jam 20070223 A lot of the bisecting logic could be shared

620

# between this and _bisect. It would require parameterizing the

621

# inner loop with a function, though. We should evaluate the

622

# performance difference.

623

self._requires_lock()

624

# We need the file pointer to be right after the initial header block

625

self._read_header_if_needed()

626

# If _dirblock_state was in memory, we should just return info from

627

# there, this function is only meant to handle when we want to read

628

# part of the disk.

629

assert self._dirblock_state == DirState.NOT_IN_MEMORY

630

631

# The disk representation is generally info + '\0\n\0' at the end. But

632

# for bisecting, it is easier to treat this as '\0' + info + '\0\n'

633

# Because it means we can sync on the '\n'

634

state_file = self._state_file

635

file_size = os.fstat(state_file.fileno()).st_size

636

# We end up with 2 extra fields, we should have a trailing '\n' to

637

# ensure that we read the whole record, and we should have a precursur

638

# '' which ensures that we start after the previous '\n'

639

entry_field_count = self._fields_per_entry() + 1

640

641

low = self._end_of_header

642

high = file_size - 1 # Ignore the final '\0'

643

# Map from dir => entry

644

found = {}

645

646

# Avoid infinite seeking

647

max_count = 30*len(dir_list)

648

count = 0

649

# pending is a list of places to look.

650

# each entry is a tuple of low, high, dir_names

651

# low -> the first byte offset to read (inclusive)

652

# high -> the last byte offset (inclusive)

653

# dirs -> The list of directories that should be found in

654

# the [low, high] range

655

pending = [(low, high, dir_list)]

656

657

page_size = self._bisect_page_size

658

659

fields_to_entry = self._get_fields_to_entry()

660

661

while pending:

662

low, high, cur_dirs = pending.pop()

663

664

if not cur_dirs or low >= high:

665

# Nothing to find

666

continue

667

668

count += 1

669

if count > max_count:

670

raise errors.BzrError('Too many seeks, most likely a bug.')

671

672

mid = max(low, (low+high-page_size)/2)

673

674

state_file.seek(mid)

675

# limit the read size, so we don't end up reading data that we have

676

# already read.

677

read_size = min(page_size, (high-mid)+1)

678

block = state_file.read(read_size)

679

680

start = mid

681

entries = block.split('\n')

682

683

if len(entries) < 2:

684

# We didn't find a '\n', so we cannot have found any records.

685

# So put this range back and try again. But we know we have to

686

# increase the page size, because a single read did not contain

687

# a record break (so records must be larger than page_size)

688

page_size *= 2

689

pending.append((low, high, cur_dirs))

690

continue

691

692

# Check the first and last entries, in case they are partial, or if

693

# we don't care about the rest of this page

694

first_entry_num = 0

695

first_fields = entries[0].split('\0')

696

if len(first_fields) < entry_field_count:

697

# We didn't get the complete first entry

698

# so move start, and grab the next, which

699

# should be a full entry

700

start += len(entries[0])+1

701

first_fields = entries[1].split('\0')

702

first_entry_num = 1

703

704

if len(first_fields) <= 1:

705

# We didn't even get a dirname here... what do we do?

706

# Try a large page size and repeat this query

707

page_size *= 2

708

pending.append((low, high, cur_dirs))

709

continue

710

else:

711

# Find what entries we are looking for, which occur before and

712

# after this first record.

713

after = start

714

first_dir = first_fields[1]

715

first_loc = bisect.bisect_left(cur_dirs, first_dir)

716

717

# These exist before the current location

718

pre = cur_dirs[:first_loc]

719

# These occur after the current location, which may be in the

720

# data we read, or might be after the last entry

721

post = cur_dirs[first_loc:]

722

723

if post and len(first_fields) >= entry_field_count:

724

# We have records to look at after the first entry

725

726

# Parse the last entry

727

last_entry_num = len(entries)-1

728

last_fields = entries[last_entry_num].split('\0')

729

if len(last_fields) < entry_field_count:

730

# The very last hunk was not complete,

731

# read the previous hunk

732

after = mid + len(block) - len(entries[-1])

733

last_entry_num -= 1

734

last_fields = entries[last_entry_num].split('\0')

735

else:

736

after = mid + len(block)

737

738

last_dir = last_fields[1]

739

last_loc = bisect.bisect_right(post, last_dir)

740

741

middle_files = post[:last_loc]

742

post = post[last_loc:]

743

744

if middle_files:

745

# We have files that should occur in this block

746

# (>= first, <= last)

747

# Either we will find them here, or we can mark them as

748

# missing.

749

750

if middle_files[0] == first_dir:

751

# We might need to go before this location

752

pre.append(first_dir)

753

if middle_files[-1] == last_dir:

754

post.insert(0, last_dir)

755

756

# Find out what paths we have

757

paths = {first_dir:[first_fields]}

758

# last_dir might == first_dir so we need to be

759

# careful if we should append rather than overwrite

760

if last_entry_num != first_entry_num:

761

paths.setdefault(last_dir, []).append(last_fields)

762

for num in xrange(first_entry_num+1, last_entry_num):

763

# TODO: jam 20070223 We are already splitting here, so

764

# shouldn't we just split the whole thing rather

765

# than doing the split again in add_one_record?

766

fields = entries[num].split('\0')

767

paths.setdefault(fields[1], []).append(fields)

768

769

for cur_dir in middle_files:

770

for fields in paths.get(cur_dir, []):

771

# offset by 1 because of the opening '\0'

772

# consider changing fields_to_entry to avoid the

773

# extra list slice

774

entry = fields_to_entry(fields[1:])

775

found.setdefault(cur_dir, []).append(entry)

776

777

# Now we have split up everything into pre, middle, and post, and

778

# we have handled everything that fell in 'middle'.

779

# We add 'post' first, so that we prefer to seek towards the

780

# beginning, so that we will tend to go as early as we need, and

781

# then only seek forward after that.

782

if post:

783

pending.append((after, high, post))

784

if pre:

785

pending.append((low, start-1, pre))

786

787

return found

788

789

def _bisect_recursive(self, dir_name_list):

790

"""Bisect for entries for all paths and their children.

791

792

This will use bisect to find all records for the supplied paths. It

793

will then continue to bisect for any records which are marked as

794

directories. (and renames?)

795

796

:param paths: A sorted list of (dir, name) pairs

797

eg: [('', 'a'), ('', 'f'), ('a/b', 'c')]

798

:return: A dictionary mapping (dir, name, file_id) => [tree_info]

799

"""

800

# Map from (dir, name, file_id) => [tree_info]

801

found = {}

802

803

found_dir_names = set()

804

805

# Directories that have been read

806

processed_dirs = set()

807

# Get the ball rolling with the first bisect for all entries.

808

newly_found = self._bisect(dir_name_list)

809

810

while newly_found:

811

# Directories that need to be read

812

pending_dirs = set()

813

paths_to_search = set()

814

for entry_list in newly_found.itervalues():

815

for dir_name_id, trees_info in entry_list:

816

found[dir_name_id] = trees_info

817

found_dir_names.add(dir_name_id[:2])

818

is_dir = False

819

for tree_info in trees_info:

820

minikind = tree_info[0]

821

if minikind == 'd':

822

if is_dir:

823

# We already processed this one as a directory,

824

# we don't need to do the extra work again.

825

continue

826

subdir, name, file_id = dir_name_id

827

path = osutils.pathjoin(subdir, name)

828

is_dir = True

829

if path not in processed_dirs:

830

pending_dirs.add(path)

831

elif minikind == 'r':

832

# Rename, we need to directly search the target

833

# which is contained in the fingerprint column

834

dir_name = osutils.split(tree_info[1])

835

if dir_name[0] in pending_dirs:

836

# This entry will be found in the dir search

837

continue

838

# TODO: We need to check if this entry has

839

# already been found. Otherwise we might be

840

# hitting infinite recursion.

841

if dir_name not in found_dir_names:

842

paths_to_search.add(dir_name)

843

# Now we have a list of paths to look for directly, and

844

# directory blocks that need to be read.

845

# newly_found is mixing the keys between (dir, name) and path

846

# entries, but that is okay, because we only really care about the

847

# targets.

848

newly_found = self._bisect(sorted(paths_to_search))

849

newly_found.update(self._bisect_dirblocks(sorted(pending_dirs)))

850

processed_dirs.update(pending_dirs)

851

return found

852

853

def _empty_parent_info(self):

854

return [DirState.NULL_PARENT_DETAILS] * (len(self._parents) -

855

len(self._ghosts))

856

857

def _ensure_block(self, parent_block_index, parent_row_index, dirname):

858

"""Ensure a block for dirname exists.

859

860

This function exists to let callers which know that there is a

861

directory dirname ensure that the block for it exists. This block can

862

fail to exist because of demand loading, or because a directory had no

863

children. In either case it is not an error. It is however an error to

864

call this if there is no parent entry for the directory, and thus the

865

function requires the coordinates of such an entry to be provided.

866

867

The root row is special cased and can be indicated with a parent block

868

and row index of -1

869

870

:param parent_block_index: The index of the block in which dirname's row

871

exists.

872

:param parent_row_index: The index in the parent block where the row

873

exists.

874

:param dirname: The utf8 dirname to ensure there is a block for.

875

:return: The index for the block.

876

"""

877

if dirname == '' and parent_row_index == 0 and parent_block_index == 0:

878

# This is the signature of the root row, and the

879

# contents-of-root row is always index 1

880

return 1

881

# the basename of the directory must be the end of its full name.

882

if not (parent_block_index == -1 and

883

parent_block_index == -1 and dirname == ''):

884

assert dirname.endswith(

885

self._dirblocks[parent_block_index][1][parent_row_index][0][1])

886

block_index, present = self._find_block_index_from_key((dirname, '', ''))

887

if not present:

888

## In future, when doing partial parsing, this should load and

889

# populate the entire block.

890

self._dirblocks.insert(block_index, (dirname, []))

891

return block_index

892

893

def _entries_to_current_state(self, new_entries):

894

"""Load new_entries into self.dirblocks.

895

896

Process new_entries into the current state object, making them the active

897

state. The entries are grouped together by directory to form dirblocks.

898

899

:param new_entries: A sorted list of entries. This function does not sort

900

to prevent unneeded overhead when callers have a sorted list already.

901

:return: Nothing.

902

"""

903

assert new_entries[0][0][0:2] == ('', ''), \

904

"Missing root row %r" % (new_entries[0][0],)

905

# The two blocks here are deliberate: the root block and the

906

# contents-of-root block.

907

self._dirblocks = [('', []), ('', [])]

908

current_block = self._dirblocks[0][1]

909

current_dirname = ''

910

root_key = ('', '')

911

append_entry = current_block.append

912

for entry in new_entries:

913

if entry[0][0] != current_dirname:

914

# new block - different dirname

915

current_block = []

916

current_dirname = entry[0][0]

917

self._dirblocks.append((current_dirname, current_block))

918

append_entry = current_block.append

919

# append the entry to the current block

920

append_entry(entry)

921

self._split_root_dirblock_into_contents()

922

923

def _split_root_dirblock_into_contents(self):

924

"""Split the root dirblocks into root and contents-of-root.

925

926

After parsing by path, we end up with root entries and contents-of-root

927

entries in the same block. This loop splits them out again.

928

"""

929

# The above loop leaves the "root block" entries mixed with the

930

# "contents-of-root block". But we don't want an if check on

931

# all entries, so instead we just fix it up here.

932

assert self._dirblocks[1] == ('', [])

933

root_block = []

934

contents_of_root_block = []

935

for entry in self._dirblocks[0][1]:

936

if not entry[0][1]: # This is a root entry

937

root_block.append(entry)

938

else:

939

contents_of_root_block.append(entry)

940

self._dirblocks[0] = ('', root_block)

941

self._dirblocks[1] = ('', contents_of_root_block)

942

943

def _entry_to_line(self, entry):

944

"""Serialize entry to a NULL delimited line ready for _get_output_lines.

945

946

:param entry: An entry_tuple as defined in the module docstring.

947

"""

948

entire_entry = list(entry[0])

949

for tree_number, tree_data in enumerate(entry[1]):

950

# (minikind, fingerprint, size, executable, tree_specific_string)

951

entire_entry.extend(tree_data)

952

# 3 for the key, 5 for the fields per tree.

953

tree_offset = 3 + tree_number * 5

954

# minikind

955

entire_entry[tree_offset + 0] = tree_data[0]

956

# size

957

entire_entry[tree_offset + 2] = str(tree_data[2])

958

# executable

959

entire_entry[tree_offset + 3] = DirState._to_yesno[tree_data[3]]

960

return '\0'.join(entire_entry)

961

962

def _fields_per_entry(self):

963

"""How many null separated fields should be in each entry row.

964

965

Each line now has an extra '\n' field which is not used

966

so we just skip over it

967

entry size:

968

3 fields for the key

969

+ number of fields per tree_data (5) * tree count

970

+ newline

971

"""

972

tree_count = 1 + self._num_present_parents()

973

return 3 + 5 * tree_count + 1

974

975

def _find_block(self, key, add_if_missing=False):

976

"""Return the block that key should be present in.

977

978

:param key: A dirstate entry key.

979

:return: The block tuple.

980

"""

981

block_index, present = self._find_block_index_from_key(key)

982

if not present:

983

if not add_if_missing:

984

# check to see if key is versioned itself - we might want to

985

# add it anyway, because dirs with no entries dont get a

986

# dirblock at parse time.

987

# This is an uncommon branch to take: most dirs have children,

988

# and most code works with versioned paths.

989

parent_base, parent_name = osutils.split(key[0])

990

if not self._get_block_entry_index(parent_base, parent_name, 0)[3]:

991

# some parent path has not been added - its an error to add

992

# this child

993

raise errors.NotVersionedError(key[0:2], str(self))

994

self._dirblocks.insert(block_index, (key[0], []))

995

return self._dirblocks[block_index]

996

997

def _find_block_index_from_key(self, key):

998

"""Find the dirblock index for a key.

999

1000

:return: The block index, True if the block for the key is present.

1001

"""

1002

if key[0:2] == ('', ''):

1003

return 0, True

1004

block_index = bisect_dirblock(self._dirblocks, key[0], 1,

1005

cache=self._split_path_cache)

1006

# _right returns one-past-where-key is so we have to subtract

1007

# one to use it. we use _right here because there are two

1008

# '' blocks - the root, and the contents of root

1009

# we always have a minimum of 2 in self._dirblocks: root and

1010

# root-contents, and for '', we get 2 back, so this is

1011

# simple and correct:

1012

present = (block_index < len(self._dirblocks) and

1013

self._dirblocks[block_index][0] == key[0])

1014

return block_index, present

1015

1016

def _find_entry_index(self, key, block):

1017

"""Find the entry index for a key in a block.

1018

1019

:return: The entry index, True if the entry for the key is present.

1020

"""

1021

entry_index = bisect.bisect_left(block, (key, []))

1022

present = (entry_index < len(block) and

1023

block[entry_index][0] == key)

1024

return entry_index, present

1025

1026

@staticmethod

1027

def from_tree(tree, dir_state_filename):

1028

"""Create a dirstate from a bzr Tree.

1029

1030

:param tree: The tree which should provide parent information and

1031

inventory ids.

1032

:return: a DirState object which is currently locked for writing.

1033

(it was locked by DirState.initialize)

1034

"""

1035

result = DirState.initialize(dir_state_filename)

1036

try:

1037

tree.lock_read()

1038

try:

1039

parent_ids = tree.get_parent_ids()

1040

num_parents = len(parent_ids)

1041

parent_trees = []

1042

for parent_id in parent_ids:

1043

parent_tree = tree.branch.repository.revision_tree(parent_id)

1044

parent_trees.append((parent_id, parent_tree))

1045

parent_tree.lock_read()

1046

result.set_parent_trees(parent_trees, [])

1047

result.set_state_from_inventory(tree.inventory)

1048

finally:

1049

for revid, parent_tree in parent_trees:

1050

parent_tree.unlock()

1051

tree.unlock()

1052

except:

1053

# The caller won't have a chance to unlock this, so make sure we

1054

# cleanup ourselves

1055

result.unlock()

1056

raise

1057

return result

1058

1059

def update_entry(self, entry, abspath, stat_value=None):

1060

"""Update the entry based on what is actually on disk.

1061

1062

:param entry: This is the dirblock entry for the file in question.

1063

:param abspath: The path on disk for this file.

1064

:param stat_value: (optional) if we already have done a stat on the

1065

file, re-use it.

1066

:return: The sha1 hexdigest of the file (40 bytes) or link target of a

1067

symlink.

1068

"""

1069

# This code assumes that the entry passed in is directly held in one of

1070

# the internal _dirblocks. So the dirblock state must have already been

1071

# read.

1072

assert self._dirblock_state != DirState.NOT_IN_MEMORY

1073

if stat_value is None:

1074

try:

1075

# We could inline os.lstat but the common case is that

1076

# stat_value will be passed in, not read here.

1077

stat_value = self._lstat(abspath, entry)

1078

except (OSError, IOError), e:

1079

if e.errno in (errno.ENOENT, errno.EACCES,

1080

errno.EPERM):

1081

# The entry is missing, consider it gone

1082

return None

1083

raise

1084

1085

kind = osutils.file_kind_from_stat_mode(stat_value.st_mode)

1086

try:

1087

minikind = DirState._kind_to_minikind[kind]

1088

except KeyError: # Unknown kind

1089

return None

1090

packed_stat = pack_stat(stat_value)

1091

(saved_minikind, saved_link_or_sha1, saved_file_size,

1092

saved_executable, saved_packed_stat) = entry[1][0]

1093

1094

if (minikind == saved_minikind

1095

and packed_stat == saved_packed_stat

1096

# size should also be in packed_stat

1097

and saved_file_size == stat_value.st_size):

1098

# The stat hasn't changed since we saved, so we can potentially

1099

# re-use the saved sha hash.

1100

if minikind == 'd':

1101

return None

1102

1103

if self._cutoff_time is None:

1104

self._sha_cutoff_time()

1105

1106

if (stat_value.st_mtime < self._cutoff_time

1107

and stat_value.st_ctime < self._cutoff_time):

1108

# Return the existing fingerprint

1109

return saved_link_or_sha1

1110

1111

# If we have gotten this far, that means that we need to actually

1112

# process this entry.

1113

link_or_sha1 = None

1114

if minikind == 'f':

1115

link_or_sha1 = self._sha1_file(abspath, entry)

1116

executable = self._is_executable(stat_value.st_mode,

1117

saved_executable)

1118

entry[1][0] = ('f', link_or_sha1, stat_value.st_size,

1119

executable, packed_stat)

1120

elif minikind == 'd':

1121

link_or_sha1 = None

1122

entry[1][0] = ('d', '', 0, False, packed_stat)

1123

if saved_minikind != 'd':

1124

# This changed from something into a directory. Make sure we

1125

# have a directory block for it. This doesn't happen very

1126

# often, so this doesn't have to be super fast.

1127

block_index, entry_index, dir_present, file_present = \

1128

self._get_block_entry_index(entry[0][0], entry[0][1], 0)

1129

self._ensure_block(block_index, entry_index,

1130

osutils.pathjoin(entry[0][0], entry[0][1]))

1131

elif minikind == 'l':

1132

link_or_sha1 = self._read_link(abspath, saved_link_or_sha1)

1133

entry[1][0] = ('l', link_or_sha1, stat_value.st_size,

1134

False, packed_stat)

1135

self._dirblock_state = DirState.IN_MEMORY_MODIFIED

1136

return link_or_sha1

1137

1138

def _sha_cutoff_time(self):

1139

"""Return cutoff time.

1140

1141

Files modified more recently than this time are at risk of being

1142

undetectably modified and so can't be cached.

1143

"""

1144

# Cache the cutoff time as long as we hold a lock.

1145

# time.time() isn't super expensive (approx 3.38us), but

1146

# when you call it 50,000 times it adds up.

1147

# For comparison, os.lstat() costs 7.2us if it is hot.

1148

self._cutoff_time = int(time.time()) - 3

1149

return self._cutoff_time

1150

1151

def _lstat(self, abspath, entry):

1152

"""Return the os.lstat value for this path."""

1153

return os.lstat(abspath)

1154

1155

def _sha1_file(self, abspath, entry):

1156

"""Calculate the SHA1 of a file by reading the full text"""

1157

f = file(abspath, 'rb', buffering=65000)

1158

try:

1159

return osutils.sha_file(f)

1160

finally:

1161

f.close()

1162

1163

def _is_executable(self, mode, old_executable):

1164

"""Is this file executable?"""

1165

return bool(S_IEXEC & mode)

1166

1167

def _is_executable_win32(self, mode, old_executable):

1168

"""On win32 the executable bit is stored in the dirstate."""

1169

return old_executable

1170

1171

if sys.platform == 'win32':

1172

_is_executable = _is_executable_win32

1173

1174

def _read_link(self, abspath, old_link):

1175

"""Read the target of a symlink"""

1176

# TODO: jam 200700301 On Win32, this could just return the value

1177

# already in memory. However, this really needs to be done at a

1178

# higher level, because there either won't be anything on disk,

1179

# or the thing on disk will be a file.

1180

return os.readlink(abspath)

1181

1182

def get_ghosts(self):

1183

"""Return a list of the parent tree revision ids that are ghosts."""

1184

self._read_header_if_needed()

1185

return self._ghosts

1186

1187

def get_lines(self):

1188

"""Serialise the entire dirstate to a sequence of lines."""

1189

if (self._header_state == DirState.IN_MEMORY_UNMODIFIED and

1190

self._dirblock_state == DirState.IN_MEMORY_UNMODIFIED):

1191

# read whats on disk.

1192

self._state_file.seek(0)

1193

return self._state_file.readlines()

1194

lines = []

1195

lines.append(self._get_parents_line(self.get_parent_ids()))

1196

lines.append(self._get_ghosts_line(self._ghosts))

1197

# append the root line which is special cased

1198

lines.extend(map(self._entry_to_line, self._iter_entries()))

1199

return self._get_output_lines(lines)

1200

1201

def _get_ghosts_line(self, ghost_ids):

1202

"""Create a line for the state file for ghost information."""

1203

return '\0'.join([str(len(ghost_ids))] + ghost_ids)

1204

1205

def _get_parents_line(self, parent_ids):

1206

"""Create a line for the state file for parents information."""

1207

return '\0'.join([str(len(parent_ids))] + parent_ids)

1208

1209

def _get_fields_to_entry(self):

1210

"""Get a function which converts entry fields into a entry record.

1211

1212

This handles size and executable, as well as parent records.

1213

1214

:return: A function which takes a list of fields, and returns an

1215

appropriate record for storing in memory.

1216

"""

1217

# This is intentionally unrolled for performance

1218

num_present_parents = self._num_present_parents()

1219

if num_present_parents == 0:

1220

def fields_to_entry_0_parents(fields, _int=int):

1221

path_name_file_id_key = (fields[0], fields[1], fields[2])

1222

return (path_name_file_id_key, [

1223

( # Current tree

1224

fields[3], # minikind

1225

fields[4], # fingerprint

1226

_int(fields[5]), # size

1227

fields[6] == 'y', # executable

1228

fields[7], # packed_stat or revision_id

1229

)])

1230

return fields_to_entry_0_parents

1231

elif num_present_parents == 1:

1232

def fields_to_entry_1_parent(fields, _int=int):

1233

path_name_file_id_key = (fields[0], fields[1], fields[2])

1234

return (path_name_file_id_key, [

1235

( # Current tree

1236

fields[3], # minikind

1237

fields[4], # fingerprint

1238

_int(fields[5]), # size

1239

fields[6] == 'y', # executable

1240

fields[7], # packed_stat or revision_id

1241

),

1242

( # Parent 1

1243

fields[8], # minikind

1244

fields[9], # fingerprint

1245

_int(fields[10]), # size

1246

fields[11] == 'y', # executable

1247

fields[12], # packed_stat or revision_id

1248

),

1249

])

1250

return fields_to_entry_1_parent

1251

elif num_present_parents == 2:

1252

def fields_to_entry_2_parents(fields, _int=int):

1253

path_name_file_id_key = (fields[0], fields[1], fields[2])

1254

return (path_name_file_id_key, [

1255

( # Current tree

1256

fields[3], # minikind

1257

fields[4], # fingerprint

1258

_int(fields[5]), # size

1259

fields[6] == 'y', # executable

1260

fields[7], # packed_stat or revision_id

1261

),

1262

( # Parent 1

1263

fields[8], # minikind

1264

fields[9], # fingerprint

1265

_int(fields[10]), # size

1266

fields[11] == 'y', # executable

1267

fields[12], # packed_stat or revision_id

1268

),

1269

( # Parent 2

1270

fields[13], # minikind

1271

fields[14], # fingerprint

1272

_int(fields[15]), # size

1273

fields[16] == 'y', # executable

1274

fields[17], # packed_stat or revision_id

1275

),

1276

])

1277

return fields_to_entry_2_parents

1278

else:

1279

def fields_to_entry_n_parents(fields, _int=int):

1280

path_name_file_id_key = (fields[0], fields[1], fields[2])

1281

trees = [(fields[cur], # minikind

1282

fields[cur+1], # fingerprint

1283

_int(fields[cur+2]), # size

1284

fields[cur+3] == 'y', # executable

1285

fields[cur+4], # stat or revision_id

1286

) for cur in xrange(3, len(fields)-1, 5)]

1287

return path_name_file_id_key, trees

1288

return fields_to_entry_n_parents

1289

1290

def get_parent_ids(self):

1291

"""Return a list of the parent tree ids for the directory state."""

1292

self._read_header_if_needed()

1293

return list(self._parents)

1294

1295

def _get_block_entry_index(self, dirname, basename, tree_index):

1296

"""Get the coordinates for a path in the state structure.

1297

1298

:param dirname: The utf8 dirname to lookup.

1299

:param basename: The utf8 basename to lookup.

1300

:param tree_index: The index of the tree for which this lookup should

1301

be attempted.

1302

:return: A tuple describing where the path is located, or should be

1303

inserted. The tuple contains four fields: the block index, the row

1304

index, anda two booleans are True when the directory is present, and

1305

when the entire path is present. There is no guarantee that either

1306

coordinate is currently reachable unless the found field for it is

1307

True. For instance, a directory not present in the searched tree

1308

may be returned with a value one greater than the current highest

1309

block offset. The directory present field will always be True when

1310

the path present field is True. The directory present field does

1311

NOT indicate that the directory is present in the searched tree,

1312

rather it indicates that there are at least some files in some

1313

tree present there.

1314

"""

1315

self._read_dirblocks_if_needed()

1316

key = dirname, basename, ''

1317

block_index, present = self._find_block_index_from_key(key)

1318

if not present:

1319

# no such directory - return the dir index and 0 for the row.

1320

return block_index, 0, False, False

1321

block = self._dirblocks[block_index][1] # access the entries only

1322

entry_index, present = self._find_entry_index(key, block)

1323

# linear search through present entries at this path to find the one

1324

# requested.

1325

while entry_index < len(block) and block[entry_index][0][1] == basename:

1326

if block[entry_index][1][tree_index][0] not in \

1327

('a', 'r'): # absent, relocated

1328

return block_index, entry_index, True, True

1329

entry_index += 1

1330

return block_index, entry_index, True, False

1331

1332

def _get_entry(self, tree_index, fileid_utf8=None, path_utf8=None):

1333

"""Get the dirstate entry for path in tree tree_index

1334

1335

If either file_id or path is supplied, it is used as the key to lookup.

1336

If both are supplied, the fastest lookup is used, and an error is

1337

raised if they do not both point at the same row.

1338

1339

:param tree_index: The index of the tree we wish to locate this path

1340

in. If the path is present in that tree, the entry containing its

1341

details is returned, otherwise (None, None) is returned

1342

0 is the working tree, higher indexes are successive parent

1343

trees.

1344

:param fileid_utf8: A utf8 file_id to look up.

1345

:param path_utf8: An utf8 path to be looked up.

1346

:return: The dirstate entry tuple for path, or (None, None)

1347

"""

1348

self._read_dirblocks_if_needed()

1349

if path_utf8 is not None:

1350

assert path_utf8.__class__ == str, 'path_utf8 is not a str: %s %s' % (type(path_utf8), path_utf8)

1351

# path lookups are faster

1352

dirname, basename = osutils.split(path_utf8)

1353

block_index, entry_index, dir_present, file_present = \

1354

self._get_block_entry_index(dirname, basename, tree_index)

1355

if not file_present:

1356

return None, None

1357

entry = self._dirblocks[block_index][1][entry_index]

1358

assert entry[0][2] and entry[1][tree_index][0] not in ('a', 'r'), 'unversioned entry?!?!'

1359

if fileid_utf8:

1360

if entry[0][2] != fileid_utf8:

1361

raise errors.BzrError('integrity error ? : mismatching'

1362

' tree_index, file_id and path')

1363

return entry

1364

else:

1365

assert fileid_utf8 is not None

1366

possible_keys = self._get_id_index().get(fileid_utf8, None)

1367

if not possible_keys:

1368

return None, None

1369

for key in possible_keys:

1370

block_index, present = \

1371

self._find_block_index_from_key(key)

1372

# strange, probably indicates an out of date

1373

# id index - for now, allow this.

1374

if not present:

1375

continue

1376

# WARNING: DO not change this code to use _get_block_entry_index

1377

# as that function is not suitable: it does not use the key

1378

# to lookup, and thus the wront coordinates are returned.

1379

block = self._dirblocks[block_index][1]

1380

entry_index, present = self._find_entry_index(key, block)

1381

if present:

1382

entry = self._dirblocks[block_index][1][entry_index]

1383

if entry[1][tree_index][0] in 'fdlt':

1384

# this is the result we are looking for: the

1385

# real home of this file_id in this tree.

1386

return entry

1387

if entry[1][tree_index][0] == 'a':

1388

# there is no home for this entry in this tree

1389

return None, None

1390

assert entry[1][tree_index][0] == 'r', \

1391

"entry %r has invalid minikind %r for tree %r" \

1392

% (entry,

1393

entry[1][tree_index][0],

1394

tree_index)

1395

real_path = entry[1][tree_index][1]

1396

return self._get_entry(tree_index, fileid_utf8=fileid_utf8,

1397

path_utf8=real_path)

1398

return None, None

1399

1400

@classmethod

1401

def initialize(cls, path):

1402

"""Create a new dirstate on path.

1403

1404

The new dirstate will be an empty tree - that is it has no parents,

1405

and only a root node - which has id ROOT_ID.

1406

1407

The object will be write locked when returned to the caller,

1408

unless there was an exception in the writing, in which case it

1409

will be unlocked.

1410

1411

:param path: The name of the file for the dirstate.

1412

:return: A DirState object.

1413

"""

1414

# This constructs a new DirState object on a path, sets the _state_file

1415

# to a new empty file for that path. It then calls _set_data() with our

1416

# stock empty dirstate information - a root with ROOT_ID, no children,

1417

# and no parents. Finally it calls save() to ensure that this data will

1418

# persist.

1419

result = cls(path)

1420

# root dir and root dir contents with no children.

1421

empty_tree_dirblocks = [('', []), ('', [])]

1422

# a new root directory, with a NULLSTAT.

1423

empty_tree_dirblocks[0][1].append(

1424

(('', '', inventory.ROOT_ID), [

1425

('d', '', 0, False, DirState.NULLSTAT),

1426

]))

1427

result.lock_write()

1428

try:

1429

result._set_data([], empty_tree_dirblocks)

1430

result.save()

1431

except:

1432

result.unlock()

1433

raise

1434

return result

1435

1436

def _inv_entry_to_details(self, inv_entry):

1437

"""Convert an inventory entry (from a revision tree) to state details.

1438

1439

:param inv_entry: An inventory entry whose sha1 and link targets can be

1440

relied upon, and which has a revision set.

1441

:return: A details tuple - the details for a single tree at a path +

1442

id.

1443

"""

1444

kind = inv_entry.kind

1445

minikind = DirState._kind_to_minikind[kind]

1446

tree_data = inv_entry.revision

1447

assert len(tree_data) > 0, 'empty revision for the inv_entry.'

1448

if kind == 'directory':

1449

fingerprint = ''

1450

size = 0

1451

executable = False

1452

elif kind == 'symlink':

1453

fingerprint = inv_entry.symlink_target or ''

1454

size = 0

1455

executable = False

1456

elif kind == 'file':

1457

fingerprint = inv_entry.text_sha1 or ''

1458

size = inv_entry.text_size or 0

1459

executable = inv_entry.executable

1460

elif kind == 'tree-reference':

1461

fingerprint = inv_entry.reference_revision or ''

1462

size = 0

1463

executable = False

1464

else:

1465

raise Exception("can't pack %s" % inv_entry)

1466

return (minikind, fingerprint, size, executable, tree_data)

1467

1468

def _iter_entries(self):

1469

"""Iterate over all the entries in the dirstate.

1470

1471

Each yelt item is an entry in the standard format described in the

1472

docstring of bzrlib.dirstate.

1473

"""

1474

self._read_dirblocks_if_needed()

1475

for directory in self._dirblocks:

1476

for entry in directory[1]:

1477

yield entry

1478

1479

def _get_id_index(self):

1480

"""Get an id index of self._dirblocks."""

1481

if self._id_index is None:

1482

id_index = {}

1483

for key, tree_details in self._iter_entries():

1484

id_index.setdefault(key[2], set()).add(key)

1485

self._id_index = id_index

1486

return self._id_index

1487

1488

def _get_output_lines(self, lines):

1489

"""format lines for final output.

1490

1491

:param lines: A sequece of lines containing the parents list and the

1492

path lines.

1493

"""

1494

output_lines = [DirState.HEADER_FORMAT_3]

1495

lines.append('') # a final newline

1496

inventory_text = '\0\n\0'.join(lines)

1497

output_lines.append('crc32: %s\n' % (zlib.crc32(inventory_text),))

1498

# -3, 1 for num parents, 1 for ghosts, 1 for final newline

1499

num_entries = len(lines)-3

1500

output_lines.append('num_entries: %s\n' % (num_entries,))

1501

output_lines.append(inventory_text)

1502

return output_lines

1503

1504

def _make_deleted_row(self, fileid_utf8, parents):

1505

"""Return a deleted for for fileid_utf8."""

1506

return ('/', 'RECYCLED.BIN', 'file', fileid_utf8, 0, DirState.NULLSTAT,

1507

''), parents

1508

1509

def _num_present_parents(self):

1510

"""The number of parent entries in each record row."""

1511

return len(self._parents) - len(self._ghosts)

1512

1513

@staticmethod

1514

def on_file(path):

1515

"""Construct a DirState on the file at path path.

1516

1517

:return: An unlocked DirState object, associated with the given path.

1518

"""

1519

result = DirState(path)

1520

return result

1521

1522

def _read_dirblocks_if_needed(self):

1523

"""Read in all the dirblocks from the file if they are not in memory.

1524

1525

This populates self._dirblocks, and sets self._dirblock_state to

1526

IN_MEMORY_UNMODIFIED. It is not currently ready for incremental block

1527

loading.

1528

"""

1529

self._read_header_if_needed()

1530

if self._dirblock_state == DirState.NOT_IN_MEMORY:

1531

# move the _state_file pointer to after the header (in case bisect

1532

# has been called in the mean time)

1533

self._state_file.seek(self._end_of_header)

1534

text = self._state_file.read()

1535

# TODO: check the crc checksums. crc_measured = zlib.crc32(text)

1536

1537

fields = text.split('\0')

1538

# Remove the last blank entry

1539

trailing = fields.pop()

1540

assert trailing == ''

1541

# consider turning fields into a tuple.

1542

1543

# skip the first field which is the trailing null from the header.

1544

cur = 1

1545

# Each line now has an extra '\n' field which is not used

1546

# so we just skip over it

1547

# entry size:

1548

# 3 fields for the key

1549

# + number of fields per tree_data (5) * tree count

1550

# + newline

1551

num_present_parents = self._num_present_parents()

1552

tree_count = 1 + num_present_parents

1553

entry_size = self._fields_per_entry()

1554

expected_field_count = entry_size * self._num_entries

1555

field_count = len(fields)

1556

# this checks our adjustment, and also catches file too short.

1557

assert field_count - cur == expected_field_count, \

1558

'field count incorrect %s != %s, entry_size=%s, '\

1559

'num_entries=%s fields=%r' % (

1560

field_count - cur, expected_field_count, entry_size,

1561

self._num_entries, fields)

1562

1563

if num_present_parents == 1:

1564

# Bind external functions to local names

1565

_int = int

1566

# We access all fields in order, so we can just iterate over

1567

# them. Grab an straight iterator over the fields. (We use an

1568

# iterator because we don't want to do a lot of additions, nor

1569

# do we want to do a lot of slicing)

1570

next = iter(fields).next

1571

# Move the iterator to the current position

1572

for x in xrange(cur):

1573

next()

1574

# The two blocks here are deliberate: the root block and the

1575

# contents-of-root block.

1576

self._dirblocks = [('', []), ('', [])]

1577

current_block = self._dirblocks[0][1]

1578

current_dirname = ''

1579

append_entry = current_block.append

1580

for count in xrange(self._num_entries):

1581

dirname = next()

1582

name = next()

1583

file_id = next()

1584

if dirname != current_dirname:

1585

# new block - different dirname

1586

current_block = []

1587

current_dirname = dirname

1588

self._dirblocks.append((current_dirname, current_block))

1589

append_entry = current_block.append

1590

# we know current_dirname == dirname, so re-use it to avoid

1591

# creating new strings

1592

entry = ((current_dirname, name, file_id),

1593

[(# Current Tree

1594

next(), # minikind

1595

next(), # fingerprint

1596

_int(next()), # size

1597

next() == 'y', # executable

1598

next(), # packed_stat or revision_id

1599

),

1600

( # Parent 1

1601

next(), # minikind

1602

next(), # fingerprint

1603

_int(next()), # size

1604

next() == 'y', # executable

1605

next(), # packed_stat or revision_id

1606

),

1607

])

1608

trailing = next()

1609

assert trailing == '\n'

1610

# append the entry to the current block

1611

append_entry(entry)

1612

self._split_root_dirblock_into_contents()

1613

else:

1614

fields_to_entry = self._get_fields_to_entry()

1615

entries = [fields_to_entry(fields[pos:pos+entry_size])

1616

for pos in xrange(cur, field_count, entry_size)]

1617

self._entries_to_current_state(entries)

1618

# To convert from format 2 => format 3

1619

# self._dirblocks = sorted(self._dirblocks,

1620

# key=lambda blk:blk[0].split('/'))

1621

# To convert from format 3 => format 2

1622

# self._dirblocks = sorted(self._dirblocks)

1623

self._dirblock_state = DirState.IN_MEMORY_UNMODIFIED

1624

1625

def _read_header(self):

1626

"""This reads in the metadata header, and the parent ids.

1627

1628

After reading in, the file should be positioned at the null

1629

just before the start of the first record in the file.

1630

1631

:return: (expected crc checksum, number of entries, parent list)

1632

"""

1633

self._read_prelude()

1634

parent_line = self._state_file.readline()

1635

info = parent_line.split('\0')

1636

num_parents = int(info[0])

1637

assert num_parents == len(info)-2, 'incorrect parent info line'

1638

self._parents = info[1:-1]

1639

1640

ghost_line = self._state_file.readline()

1641

info = ghost_line.split('\0')

1642

num_ghosts = int(info[1])

1643

assert num_ghosts == len(info)-3, 'incorrect ghost info line'

1644

self._ghosts = info[2:-1]

1645

self._header_state = DirState.IN_MEMORY_UNMODIFIED

1646

self._end_of_header = self._state_file.tell()

1647

1648

def _read_header_if_needed(self):

1649

"""Read the header of the dirstate file if needed."""

1650

# inline this as it will be called a lot

1651

if not self._lock_token:

1652

raise errors.ObjectNotLocked(self)

1653

if self._header_state == DirState.NOT_IN_MEMORY:

1654

self._read_header()

1655

1656

def _read_prelude(self):

1657

"""Read in the prelude header of the dirstate file

1658

1659

This only reads in the stuff that is not connected to the crc

1660

checksum. The position will be correct to read in the rest of

1661

the file and check the checksum after this point.

1662

The next entry in the file should be the number of parents,

1663

and their ids. Followed by a newline.

1664

"""

1665

header = self._state_file.readline()

1666

assert header == DirState.HEADER_FORMAT_3, \

1667

'invalid header line: %r' % (header,)

1668

crc_line = self._state_file.readline()

1669

assert crc_line.startswith('crc32: '), 'missing crc32 checksum'

1670

self.crc_expected = int(crc_line[len('crc32: '):-1])

1671

num_entries_line = self._state_file.readline()

1672

assert num_entries_line.startswith('num_entries: '), 'missing num_entries line'

1673

self._num_entries = int(num_entries_line[len('num_entries: '):-1])

1674

1675

def save(self):

1676

"""Save any pending changes created during this session.

1677

1678

We reuse the existing file, because that prevents race conditions with

1679

file creation, and use oslocks on it to prevent concurrent modification

1680

and reads - because dirstates incremental data aggretation is not

1681

compatible with reading a modified file, and replacing a file in use by

1682

another process is impossible on windows.

1683

1684

A dirstate in read only mode should be smart enough though to validate

1685

that the file has not changed, and otherwise discard its cache and

1686

start over, to allow for fine grained read lock duration, so 'status'

1687

wont block 'commit' - for example.

1688

"""

1689

if (self._header_state == DirState.IN_MEMORY_MODIFIED or

1690

self._dirblock_state == DirState.IN_MEMORY_MODIFIED):

1691

1692

if self._lock_state == 'w':

1693

out_file = self._state_file

1694

wlock = None

1695

else:

1696

# Try to grab a write lock so that we can update the file.

1697

try:

1698

wlock = lock.WriteLock(self._filename)

1699

except (errors.LockError, errors.LockContention), e:

1700

# We couldn't grab the lock, so just leave things dirty in

1701

# memory.

1702

return

1703

except IOError, e:

1704

# This may be a read-only tree, or someone else may have a

1705

# ReadLock. so handle the case when we cannot grab a write

1706

# lock

1707

if e.errno in (errno.ENOENT, errno.EPERM, errno.EACCES,

1708

errno.EAGAIN):

1709

# Ignore these errors and just don't save anything

1710

return

1711

raise

1712

out_file = wlock.f

1713

try:

1714

out_file.seek(0)

1715

out_file.writelines(self.get_lines())

1716

out_file.truncate()

1717

out_file.flush()

1718

self._header_state = DirState.IN_MEMORY_UNMODIFIED

1719

self._dirblock_state = DirState.IN_MEMORY_UNMODIFIED

1720

finally:

1721

if wlock is not None:

1722

wlock.unlock()

1723

1724

def _set_data(self, parent_ids, dirblocks):

1725

"""Set the full dirstate data in memory.

1726

1727

This is an internal function used to completely replace the objects

1728

in memory state. It puts the dirstate into state 'full-dirty'.

1729

1730

:param parent_ids: A list of parent tree revision ids.

1731

:param dirblocks: A list containing one tuple for each directory in the

1732

tree. Each tuple contains the directory path and a list of entries

1733

found in that directory.

1734

"""

1735

# our memory copy is now authoritative.

1736

self._dirblocks = dirblocks

1737

self._header_state = DirState.IN_MEMORY_MODIFIED

1738

self._dirblock_state = DirState.IN_MEMORY_MODIFIED

1739

self._parents = list(parent_ids)

1740

self._id_index = None

1741

1742

def set_path_id(self, path, new_id):

1743

"""Change the id of path to new_id in the current working tree.

1744

1745

:param path: The path inside the tree to set - '' is the root, 'foo'

1746

is the path foo in the root.

1747

:param new_id: The new id to assign to the path. This must be a utf8

1748

file id (not unicode, and not None).

1749

"""

1750

assert new_id.__class__ == str, \

1751

"path_id %r is not a plain string" % (new_id,)

1752

self._read_dirblocks_if_needed()

1753

if len(path):

1754

# logic not written

1755

raise NotImplementedError(self.set_path_id)

1756

# TODO: check new id is unique

1757

entry = self._get_entry(0, path_utf8=path)

1758

if entry[0][2] == new_id:

1759

# Nothing to change.

1760

return

1761

# mark the old path absent, and insert a new root path

1762

self._make_absent(entry)

1763

self.update_minimal(('', '', new_id), 'd',

1764

path_utf8='', packed_stat=entry[1][0][4])

1765

self._dirblock_state = DirState.IN_MEMORY_MODIFIED

1766

if self._id_index is not None:

1767

self._id_index.setdefault(new_id, set()).add(entry[0])

1768

1769

def set_parent_trees(self, trees, ghosts):

1770

"""Set the parent trees for the dirstate.

1771

1772

:param trees: A list of revision_id, tree tuples. tree must be provided

1773

even if the revision_id refers to a ghost: supply an empty tree in

1774

this case.

1775

:param ghosts: A list of the revision_ids that are ghosts at the time

1776

of setting.

1777

"""

1778

self._validate()

1779

# TODO: generate a list of parent indexes to preserve to save

1780

# processing specific parent trees. In the common case one tree will

1781

# be preserved - the left most parent.

1782

# TODO: if the parent tree is a dirstate, we might want to walk them

1783

# all by path in parallel for 'optimal' common-case performance.

1784

# generate new root row.

1785

self._read_dirblocks_if_needed()

1786

# TODO future sketch: Examine the existing parents to generate a change

1787

# map and then walk the new parent trees only, mapping them into the

1788

# dirstate. Walk the dirstate at the same time to remove unreferenced

1789

# entries.

1790

# for now:

1791

# sketch: loop over all entries in the dirstate, cherry picking

1792

# entries from the parent trees, if they are not ghost trees.

1793

# after we finish walking the dirstate, all entries not in the dirstate

1794

# are deletes, so we want to append them to the end as per the design

1795

# discussions. So do a set difference on ids with the parents to

1796

# get deletes, and add them to the end.

1797

# During the update process we need to answer the following questions:

1798

# - find other keys containing a fileid in order to create cross-path

1799

# links. We dont't trivially use the inventory from other trees

1800

# because this leads to either double touching, or to accessing

1801

# missing keys,

1802

# - find other keys containing a path

1803

# We accumulate each entry via this dictionary, including the root

1804

by_path = {}

1805

id_index = {}

1806

# we could do parallel iterators, but because file id data may be

1807

# scattered throughout, we dont save on index overhead: we have to look

1808

# at everything anyway. We can probably save cycles by reusing parent

1809

# data and doing an incremental update when adding an additional

1810

# parent, but for now the common cases are adding a new parent (merge),

1811

# and replacing completely (commit), and commit is more common: so

1812

# optimise merge later.

1813

1814

# ---- start generation of full tree mapping data

1815

# what trees should we use?

1816

parent_trees = [tree for rev_id, tree in trees if rev_id not in ghosts]

1817

# how many trees do we end up with

1818

parent_count = len(parent_trees)

1819

1820

# one: the current tree

1821

for entry in self._iter_entries():

1822

# skip entries not in the current tree

1823

if entry[1][0][0] in ('a', 'r'): # absent, relocated

1824

continue

1825

by_path[entry[0]] = [entry[1][0]] + \

1826

[DirState.NULL_PARENT_DETAILS] * parent_count

1827

id_index[entry[0][2]] = set([entry[0]])

1828

1829

# now the parent trees:

1830

for tree_index, tree in enumerate(parent_trees):

1831

# the index is off by one, adjust it.

1832

tree_index = tree_index + 1

1833

# when we add new locations for a fileid we need these ranges for

1834

# any fileid in this tree as we set the by_path[id] to:

1835

# already_processed_tree_details + new_details + new_location_suffix

1836

# the suffix is from tree_index+1:parent_count+1.

1837

new_location_suffix = [DirState.NULL_PARENT_DETAILS] * (parent_count - tree_index)

1838

# now stitch in all the entries from this tree

1839

for path, entry in tree.inventory.iter_entries_by_dir():

1840

# here we process each trees details for each item in the tree.

1841

# we first update any existing entries for the id at other paths,

1842

# then we either create or update the entry for the id at the

1843

# right path, and finally we add (if needed) a mapping from

1844

# file_id to this path. We do it in this order to allow us to

1845

# avoid checking all known paths for the id when generating a

1846

# new entry at this path: by adding the id->path mapping last,

1847

# all the mappings are valid and have correct relocation

1848

# records where needed.

1849

file_id = entry.file_id

1850

path_utf8 = path.encode('utf8')

1851

dirname, basename = osutils.split(path_utf8)

1852

new_entry_key = (dirname, basename, file_id)

1853

# tree index consistency: All other paths for this id in this tree

1854

# index must point to the correct path.

1855

for entry_key in id_index.setdefault(file_id, set()):

1856

# TODO:PROFILING: It might be faster to just update

1857

# rather than checking if we need to, and then overwrite

1858

# the one we are located at.

1859

if entry_key != new_entry_key:

1860

# this file id is at a different path in one of the

1861

# other trees, so put absent pointers there

1862

# This is the vertical axis in the matrix, all pointing

1863

# tot he real path.

1864

by_path[entry_key][tree_index] = ('r', path_utf8, 0, False, '')

1865

# by path consistency: Insert into an existing path record (trivial), or

1866

# add a new one with relocation pointers for the other tree indexes.

1867

if new_entry_key in id_index[file_id]:

1868

# there is already an entry where this data belongs, just insert it.

1869

by_path[new_entry_key][tree_index] = \

1870

self._inv_entry_to_details(entry)

1871

else:

1872

# add relocated entries to the horizontal axis - this row

1873

# mapping from path,id. We need to look up the correct path

1874

# for the indexes from 0 to tree_index -1

1875

new_details = []

1876

for lookup_index in xrange(tree_index):

1877

# boundary case: this is the first occurence of file_id

1878

# so there are no id_indexs, possibly take this out of

1879

# the loop?

1880

if not len(id_index[file_id]):

1881

new_details.append(DirState.NULL_PARENT_DETAILS)

1882

else:

1883

# grab any one entry, use it to find the right path.

1884

# TODO: optimise this to reduce memory use in highly

1885

# fragmented situations by reusing the relocation

1886

# records.

1887

a_key = iter(id_index[file_id]).next()

1888

if by_path[a_key][lookup_index][0] in ('r', 'a'):

1889

# its a pointer or missing statement, use it as is.

1890

new_details.append(by_path[a_key][lookup_index])

1891

else:

1892

# we have the right key, make a pointer to it.

1893

real_path = ('/'.join(a_key[0:2])).strip('/')

1894

new_details.append(('r', real_path, 0, False, ''))

1895

new_details.append(self._inv_entry_to_details(entry))

1896

new_details.extend(new_location_suffix)

1897

by_path[new_entry_key] = new_details

1898

id_index[file_id].add(new_entry_key)

1899

# --- end generation of full tree mappings

1900

1901

# sort and output all the entries

1902

new_entries = self._sort_entries(by_path.items())

1903

self._entries_to_current_state(new_entries)

1904

self._parents = [rev_id for rev_id, tree in trees]

1905

self._ghosts = list(ghosts)

1906

self._header_state = DirState.IN_MEMORY_MODIFIED

1907

self._dirblock_state = DirState.IN_MEMORY_MODIFIED

1908

self._id_index = id_index

1909

self._validate()

1910

1911

def _sort_entries(self, entry_list):

1912

"""Given a list of entries, sort them into the right order.

1913

1914

This is done when constructing a new dirstate from trees - normally we

1915

try to keep everything in sorted blocks all the time, but sometimes

1916

it's easier to sort after the fact.

1917

"""

1918

# TODO: Might be faster to do a schwartzian transform?

1919

def _key(entry):

1920

# sort by: directory parts, file name, file id

1921

return entry[0][0].split('/'), entry[0][1], entry[0][2]

1922

return sorted(entry_list, key=_key)

1923

1924

def set_state_from_inventory(self, new_inv):

1925

"""Set new_inv as the current state.

1926

1927

This API is called by tree transform, and will usually occur with

1928

existing parent trees.

1929

1930

:param new_inv: The inventory object to set current state from.

1931

"""

1932

self._read_dirblocks_if_needed()

1933

# sketch:

1934

# incremental algorithm:

1935

# two iterators: current data and new data, both in dirblock order.

1936

new_iterator = new_inv.iter_entries_by_dir()

1937

# we will be modifying the dirstate, so we need a stable iterator. In

1938

# future we might write one, for now we just clone the state into a

1939

# list - which is a shallow copy, so each

1940

old_iterator = iter(list(self._iter_entries()))

1941

# both must have roots so this is safe:

1942

current_new = new_iterator.next()

1943

current_old = old_iterator.next()

1944

def advance(iterator):

1945

try:

1946

return iterator.next()

1947

except StopIteration:

1948

return None

1949

while current_new or current_old:

1950

# skip entries in old that are not really there

1951

if current_old and current_old[1][0][0] in ('r', 'a'):

1952

# relocated or absent

1953

current_old = advance(old_iterator)

1954

continue

1955

if current_new:

1956

# convert new into dirblock style

1957

new_path_utf8 = current_new[0].encode('utf8')

1958

new_dirname, new_basename = osutils.split(new_path_utf8)

1959

new_id = current_new[1].file_id

1960

new_entry_key = (new_dirname, new_basename, new_id)

1961

current_new_minikind = \

1962

DirState._kind_to_minikind[current_new[1].kind]

1963

if current_new_minikind == 't':

1964

fingerprint = current_new[1].reference_revision

1965

else:

1966

fingerprint = ''

1967

else:

1968

# for safety disable variables

1969

new_path_utf8 = new_dirname = new_basename = new_id = new_entry_key = None

1970

# 5 cases, we dont have a value that is strictly greater than everything, so

1971

# we make both end conditions explicit

1972

if not current_old:

1973

# old is finished: insert current_new into the state.

1974

self.update_minimal(new_entry_key, current_new_minikind,

1975

executable=current_new[1].executable,

1976

path_utf8=new_path_utf8, fingerprint=fingerprint)

1977

current_new = advance(new_iterator)

1978

elif not current_new:

1979

# new is finished

1980

self._make_absent(current_old)

1981

current_old = advance(old_iterator)

1982

elif new_entry_key == current_old[0]:

1983

# same - common case

1984

# TODO: update the record if anything significant has changed.

1985

# the minimal required trigger is if the execute bit or cached

1986

# kind has changed.

1987

if (current_old[1][0][3] != current_new[1].executable or

1988

current_old[1][0][0] != current_new_minikind):

1989

self.update_minimal(current_old[0], current_new_minikind,

1990

executable=current_new[1].executable,

1991

path_utf8=new_path_utf8, fingerprint=fingerprint)

1992

# both sides are dealt with, move on

1993

current_old = advance(old_iterator)

1994

current_new = advance(new_iterator)

1995

elif new_entry_key < current_old[0]:

1996

# new comes before:

1997

# add a entry for this and advance new

1998

self.update_minimal(new_entry_key, current_new_minikind,

1999

executable=current_new[1].executable,

2000

path_utf8=new_path_utf8, fingerprint=fingerprint)

2001

current_new = advance(new_iterator)

2002

else:

2003

# old comes before:

2004

self._make_absent(current_old)

2005

current_old = advance(old_iterator)

2006

self._dirblock_state = DirState.IN_MEMORY_MODIFIED

2007

self._id_index = None

2008

2009

def _make_absent(self, current_old):

2010

"""Mark current_old - an entry - as absent for tree 0.

2011

2012

:return: True if this was the last details entry for they entry key:

2013

that is, if the underlying block has had the entry removed, thus

2014

shrinking in length.

2015

"""

2016

# build up paths that this id will be left at after the change is made,

2017

# so we can update their cross references in tree 0

2018

all_remaining_keys = set()

2019

# Dont check the working tree, because its going.

2020

for details in current_old[1][1:]:

2021

if details[0] not in ('a', 'r'): # absent, relocated

2022

all_remaining_keys.add(current_old[0])

2023

elif details[0] == 'r': # relocated

2024

# record the key for the real path.

2025

all_remaining_keys.add(tuple(osutils.split(details[1])) + (current_old[0][2],))

2026

# absent rows are not present at any path.

2027

last_reference = current_old[0] not in all_remaining_keys

2028

if last_reference:

2029

# the current row consists entire of the current item (being marked

2030

# absent), and relocated or absent entries for the other trees:

2031

# Remove it, its meaningless.

2032

block = self._find_block(current_old[0])

2033

entry_index, present = self._find_entry_index(current_old[0], block[1])

2034

assert present, 'could not find entry for %s' % (current_old,)

2035

block[1].pop(entry_index)

2036

# if we have an id_index in use, remove this key from it for this id.

2037

if self._id_index is not None:

2038

self._id_index[current_old[0][2]].remove(current_old[0])

2039

# update all remaining keys for this id to record it as absent. The

2040

# existing details may either be the record we are making as deleted

2041

# (if there were other trees with the id present at this path), or may

2042

# be relocations.

2043

for update_key in all_remaining_keys:

2044

update_block_index, present = \

2045

self._find_block_index_from_key(update_key)

2046

assert present, 'could not find block for %s' % (update_key,)

2047

update_entry_index, present = \

2048

self._find_entry_index(update_key, self._dirblocks[update_block_index][1])

2049

assert present, 'could not find entry for %s' % (update_key,)

2050

update_tree_details = self._dirblocks[update_block_index][1][update_entry_index][1]

2051

# it must not be absent at the moment

2052

assert update_tree_details[0][0] != 'a' # absent

2053

update_tree_details[0] = DirState.NULL_PARENT_DETAILS

2054

self._dirblock_state = DirState.IN_MEMORY_MODIFIED

2055

return last_reference

2056

2057

def update_minimal(self, key, minikind, executable=False, fingerprint='',

2058

packed_stat=None, size=0, path_utf8=None):

2059

"""Update an entry to the state in tree 0.

2060

2061

This will either create a new entry at 'key' or update an existing one.

2062

It also makes sure that any other records which might mention this are

2063

updated as well.

2064

2065

:param key: (dir, name, file_id) for the new entry

2066

:param minikind: The type for the entry ('f' == 'file', 'd' ==

2067

'directory'), etc.

2068

:param executable: Should the executable bit be set?

2069

:param fingerprint: Simple fingerprint for new entry.

2070

:param packed_stat: packed stat value for new entry.

2071

:param size: Size information for new entry

2072

:param path_utf8: key[0] + '/' + key[1], just passed in to avoid doing

2073

extra computation.

2074

"""

2075

block = self._find_block(key)[1]

2076

if packed_stat is None:

2077

packed_stat = DirState.NULLSTAT

2078

entry_index, present = self._find_entry_index(key, block)

2079

new_details = (minikind, fingerprint, size, executable, packed_stat)

2080

id_index = self._get_id_index()

2081

if not present:

2082

# new entry, synthesis cross reference here,

2083

existing_keys = id_index.setdefault(key[2], set())

2084

if not existing_keys:

2085

# not currently in the state, simplest case

2086

new_entry = key, [new_details] + self._empty_parent_info()

2087

else:

2088

# present at one or more existing other paths.

2089

# grab one of them and use it to generate parent

2090

# relocation/absent entries.

2091

new_entry = key, [new_details]

2092

for other_key in existing_keys:

2093

# change the record at other to be a pointer to this new

2094

# record. The loop looks similar to the change to

2095

# relocations when updating an existing record but its not:

2096

# the test for existing kinds is different: this can be

2097

# factored out to a helper though.

2098

other_block_index, present = self._find_block_index_from_key(other_key)

2099

assert present, 'could not find block for %s' % (other_key,)

2100

other_entry_index, present = self._find_entry_index(other_key,

2101

self._dirblocks[other_block_index][1])

2102

assert present, 'could not find entry for %s' % (other_key,)

2103

assert path_utf8 is not None

2104

self._dirblocks[other_block_index][1][other_entry_index][1][0] = \

2105

('r', path_utf8, 0, False, '')

2106

2107

num_present_parents = self._num_present_parents()

2108

for lookup_index in xrange(1, num_present_parents + 1):

2109

# grab any one entry, use it to find the right path.

2110

# TODO: optimise this to reduce memory use in highly

2111

# fragmented situations by reusing the relocation

2112

# records.

2113

update_block_index, present = \

2114

self._find_block_index_from_key(other_key)

2115

assert present, 'could not find block for %s' % (other_key,)

2116

update_entry_index, present = \

2117

self._find_entry_index(other_key, self._dirblocks[update_block_index][1])

2118

assert present, 'could not find entry for %s' % (other_key,)

2119

update_details = self._dirblocks[update_block_index][1][update_entry_index][1][lookup_index]

2120

if update_details[0] in ('r', 'a'): # relocated, absent

2121

# its a pointer or absent in lookup_index's tree, use

2122

# it as is.

2123

new_entry[1].append(update_details)

2124

else:

2125

# we have the right key, make a pointer to it.

2126

pointer_path = osutils.pathjoin(*other_key[0:2])

2127

new_entry[1].append(('r', pointer_path, 0, False, ''))

2128

block.insert(entry_index, new_entry)

2129

existing_keys.add(key)

2130

else:

2131

# Does the new state matter?

2132

block[entry_index][1][0] = new_details

2133

# parents cannot be affected by what we do.

2134

# other occurences of this id can be found

2135

# from the id index.

2136

# ---

2137

# tree index consistency: All other paths for this id in this tree

2138

# index must point to the correct path. We have to loop here because

2139

# we may have passed entries in the state with this file id already

2140

# that were absent - where parent entries are - and they need to be

2141

# converted to relocated.

2142

assert path_utf8 is not None

2143

for entry_key in id_index.setdefault(key[2], set()):

2144

# TODO:PROFILING: It might be faster to just update

2145

# rather than checking if we need to, and then overwrite

2146

# the one we are located at.

2147

if entry_key != key:

2148

# this file id is at a different path in one of the

2149

# other trees, so put absent pointers there

2150

# This is the vertical axis in the matrix, all pointing

2151

# to the real path.

2152

block_index, present = self._find_block_index_from_key(entry_key)

2153

assert present

2154

entry_index, present = self._find_entry_index(entry_key, self._dirblocks[block_index][1])

2155

assert present

2156

self._dirblocks[block_index][1][entry_index][1][0] = \

2157

('r', path_utf8, 0, False, '')

2158

# add a containing dirblock if needed.

2159

if new_details[0] == 'd':

2160

subdir_key = (osutils.pathjoin(*key[0:2]), '', '')

2161

block_index, present = self._find_block_index_from_key(subdir_key)

2162

if not present:

2163

self._dirblocks.insert(block_index, (subdir_key[0], []))

2164

2165

self._dirblock_state = DirState.IN_MEMORY_MODIFIED

2166

2167

def _validate(self):

2168

"""Check that invariants on the dirblock are correct.

2169

2170

This can be useful in debugging; it shouldn't be necessary in

2171

normal code.

2172

"""

2173

from pprint import pformat

2174

if len(self._dirblocks) > 0:

2175

assert self._dirblocks[0][0] == '', \

2176

"dirblocks don't start with root block:\n" + \

2177

pformat(dirblocks)

2178

if len(self._dirblocks) > 1:

2179

assert self._dirblocks[1][0] == '', \

2180

"dirblocks missing root directory:\n" + \

2181

pformat(dirblocks)

2182

# the dirblocks are sorted by their path components, name, and dir id

2183

dir_names = [d[0].split('/')

2184

for d in self._dirblocks[1:]]

2185

if dir_names != sorted(dir_names):

2186

raise AssertionError(

2187

"dir names are not in sorted order:\n" + \

2188

pformat(self._dirblocks) + \

2189

"\nkeys:\n" +

2190

pformat(dir_names))

2191

for dirblock in self._dirblocks:

2192

# within each dirblock, the entries are sorted by filename and

2193

# then by id.

2194

assert dirblock[1] == sorted(dirblock[1]), \

2195

"dirblock for %r is not sorted:\n%s" % \

2196

(dirblock[0], pformat(dirblock))

2197

2198

def _wipe_state(self):

2199

"""Forget all state information about the dirstate."""

2200

self._header_state = DirState.NOT_IN_MEMORY

2201

self._dirblock_state = DirState.NOT_IN_MEMORY

2202

self._parents = []

2203

self._ghosts = []

2204

self._dirblocks = []

2205

self._id_index = None

2206

self._end_of_header = None

2207

self._cutoff_time = None

2208

self._split_path_cache = {}

2209

2210

def lock_read(self):

2211

"""Acquire a read lock on the dirstate"""

2212

if self._lock_token is not None:

2213

raise errors.LockContention(self._lock_token)

2214

# TODO: jam 20070301 Rather than wiping completely, if the blocks are

2215

# already in memory, we could read just the header and check for

2216

# any modification. If not modified, we can just leave things

2217

# alone

2218

self._lock_token = lock.ReadLock(self._filename)

2219

self._lock_state = 'r'

2220

self._state_file = self._lock_token.f

2221

self._wipe_state()

2222

2223

def lock_write(self):

2224

"""Acquire a write lock on the dirstate"""

2225

if self._lock_token is not None:

2226

raise errors.LockContention(self._lock_token)

2227

# TODO: jam 20070301 Rather than wiping completely, if the blocks are

2228

# already in memory, we could read just the header and check for

2229

# any modification. If not modified, we can just leave things

2230

# alone

2231

self._lock_token = lock.WriteLock(self._filename)

2232

self._lock_state = 'w'

2233

self._state_file = self._lock_token.f

2234

self._wipe_state()

2235

2236

def unlock(self):

2237

"""Drop any locks held on the dirstate"""

2238

if self._lock_token is None:

2239

raise errors.LockNotHeld(self)

2240

# TODO: jam 20070301 Rather than wiping completely, if the blocks are

2241

# already in memory, we could read just the header and check for

2242

# any modification. If not modified, we can just leave things

2243

# alone

2244

self._state_file = None

2245

self._lock_state = None

2246

self._lock_token.unlock()

2247

self._lock_token = None

2248

self._split_path_cache = {}

2249

2250

def _requires_lock(self):

2251

"""Checks that a lock is currently held by someone on the dirstate"""

2252

if not self._lock_token:

2253

raise errors.ObjectNotLocked(self)

2254

2255

2256

def bisect_dirblock(dirblocks, dirname, lo=0, hi=None, cache={}):

2257

"""Return the index where to insert dirname into the dirblocks.

2258

2259

The return value idx is such that all directories blocks in dirblock[:idx]

2260

have names < dirname, and all blocks in dirblock[idx:] have names >=

2261

dirname.

2262

2263

Optional args lo (default 0) and hi (default len(dirblocks)) bound the

2264

slice of a to be searched.

2265

"""

2266

if hi is None:

2267

hi = len(dirblocks)

2268

try:

2269

dirname_split = cache[dirname]

2270

except KeyError:

2271

dirname_split = dirname.split('/')

2272

cache[dirname] = dirname_split

2273

while lo < hi:

2274

mid = (lo+hi)//2

2275

# Grab the dirname for the current dirblock

2276

cur = dirblocks[mid][0]

2277

try:

2278

cur_split = cache[cur]

2279

except KeyError:

2280

cur_split = cur.split('/')

2281

cache[cur] = cur_split

2282

if cur_split < dirname_split: lo = mid+1

2283

else: hi = mid

2284

return lo

2285

2286

2287

2288

def pack_stat(st, _encode=base64.encodestring, _pack=struct.pack):

2289

"""Convert stat values into a packed representation."""

2290

# jam 20060614 it isn't really worth removing more entries if we

2291

# are going to leave it in packed form.

2292

# With only st_mtime and st_mode filesize is 5.5M and read time is 275ms

2293

# With all entries filesize is 5.9M and read time is mabye 280ms

2294

# well within the noise margin

2295

2296

# base64.encode always adds a final newline, so strip it off

2297

return _encode(_pack('>llllll'

2298

, st.st_size, int(st.st_mtime), int(st.st_ctime)

2299

, st.st_dev, st.st_ino, st.st_mode))[:-1]