[cvsnt] Re: check CVS repository integrity

Fri May 19 19:57:27 BST 2006

> From: cvsnt-bounces at cvsnt.org 
> [mailto:cvsnt-bounces at cvsnt.org] On Behalf Of Tony Hoyle
> Sent: Friday, 19 May, 2006 13:25
> 
> Michael Wojcik wrote:
> > I believe that's highly dubious.  CVS has to rewrite the whole file
for
> > most operations, since the RCS file format is plain-text (and
> 
> Actually for the most part it's simply a fast copy routine.  
> It's designed to be *very* fast.

It still has to do just as much disk I/O.

> At no time is the entire file or anything resembling it in 
> memory.

Nor would it need to be in order to checksum it.

Look, however the copy happens, it's going to map every page of the new
file at some point.  Doesn't matter whether you're using private-buffer
I/O with kernel copies (conventional read(2)/write(2) I/O), or
memory-mapping and copying.  Running a cumulative checksum over those
pages will take very little additional time; you're already pulling them
into cache.  It's not like CVS can DMA from one file to another.

> Checksumming the individual revisions won't work due to keyword
expansion

I wasn't suggesting that.  I'd run a checksum over the whole file, then
append the result as another RCS section.

> The only way you could checksum would be to do it to binary revisions,
and 
> even then you'd need a size threshold - the calculation is very CPU
intensive 
> compared to everything else.

Sorry, Tony, I don't believe that.  Calculating any reasonable checksum
is going to have negligible CPU costs, particularly compared to the disk
and network I/O time.

If a checksum takes noticeable time, that's an algorithm problem.  CVSNT
shouldn't have any problem getting on the order of tens of megabytes per
second (or better) throughput on a checksum on standard hardware.  It
wouldn't be noticeable.

I'm tempted now to implement it just to prove the point.

> Really it's not worth it.  The only thing that could corrupt an RCS
file is 
> actual hardware failure - and then you routinely recover from backups
anyway.. 

*If* you detect it.  A checksum is cheap insurance.  Also, recovering
the entire repository from backup is potentially more intrusive than
recovering only those files that have been corrupted.

> I wouldn't trust a file stored on such a device checksummed or not.

You have to trust something at some point.  My inclination would be to
let CVS administrators decide on their own threat models.

Now, it's perfectly reasonable for March Hare to say "checksums are not
a priority for us", whether that's because you don't believe their very
useful, or because there isn't significant demand; I have no arguments
with that.  But I'm very suspicious of performance arguments without
data to back them up.

-- 
Michael Wojcik
Principal Software Systems Developer, Micro Focus