Page 1 of 1

Short file checksum/hash?

Posted: Thu Aug 20, 2015 1:28 pm
by Spirit
I am increasingly frustrated with Quaddicted and am thinking about redoing the whole file archive. Problem is that there are files that fit in many categories (eg SP maps with DM settings and vice-versa) and files that have colliding names. So one cannot split everything into categories nor put everything into one directory.

My solution would be to (brace yourselves) put every file into its own directory which would be named by a hash or checksum of the file, uniquely identifying it. I do not want to simply use a counter. It's <100k files I think, no idea what kinda of collision free "space" would be good.

Is there a hash or checksum that is short and appropriate for this? It would need to be URL compatible without escaping. I would be fine with sacrificing compatibility with case-insensitive file systems though. 6 to 8 characters would rock.

Re: Short file checksum/hash?

Posted: Thu Aug 20, 2015 3:29 pm
by Spike
interesting post considering that's your 999th post (and I'm a brit)...

at the end of the day, you need to accept that collisions are going to happen eventually.
with that in mind, the hash you use doesn't really need to be all that long (if they're going to happen, you might as well use a weak hash so you can test it).
(if your hash is for security then there's probably better ways to do it - ones that support multiple different hashes).

each file in its own directory sounds a bit excessive to me, but then I'm thinking about windows and its inability to store too many items in a single directory.
if all else fails, you could just take something like sha1 and fold the bits over each other with xor, resulting in a 32bit / 8-char hash. quakeworld had a habit of doing that with md4 hashes.

if you just want some weak hash that is present in every quake engine, CRC-16-CCITT is your friend, which should give a nice short 16bit 4-char hash.

frankly, the more important thing is how you're going to get the quake injector to cope with all of this, you're gonna force everyone to update. :(
one of these days I may get the motivation to make an in-engine version of the quake injector...

Re: Short file checksum/hash?

Posted: Thu Aug 20, 2015 10:10 pm
by frag.machine
Just curious: any reason for not use the simple counter approach ?
If you are using some sort of database to keep the aggregate metadata (like author name, reviews, screenshots, etc) it's very likely you already have an integer primary key tied to the file.

Re: Short file checksum/hash?

Posted: Sat Aug 22, 2015 6:32 pm
by Spirit
If using the hash, you don't need a secondary lookup. The hash would be the unique identifier of a file anyways. If it was a counter, whatever wants information about the file would need to do several steps for finding out what it is.

I like to dream big (and complicated). Imagine telling an engine "install this.zip". It could calculate the hash and look up the instructions (Quaddicted or locally). If it used anything else, there would be more steps involved.

CRC-16-CCITT is too small. I think I will go for 8 characters so the sha1sum idea sounds good. Is there any benefit from mangling with the bits instead of just using the first 8 characters though?

Quake Injector will get the same data as before, it would require a lot more changes for more content if I actually do what I plan... I would also redirect/keep the current file URLs up.

Re: Short file checksum/hash?

Posted: Sun Aug 23, 2015 1:45 am
by frag.machine
Well, by definition a hash isn't unique. You may reduce the collision chance using larger values, but there's always a small chance of collision. OTOH it works well enough for BitTorrent, so may be worth to check their solution.