Re: g_str_hash




Tim Janik <timj@gtk.org> writes:

> On 11 Feb 2000, Havoc Pennington wrote:
> 
> > Of course this function can just be:
> > 
> > {
> >  const gchar* s = (char*)v;
> >  guint result = 0;
> >  while (*s)
> >   {
> >     result += (result << 3) + *s;
> >     ++s;
> >   }
> >  return result;
> > }
> > 
> > And it's even microscopically but measurably faster that way. Don't
> > ask me what's up with the structure of the original version's loop.
> 
> the tcl hash function is not so good as a general purpose string hash
> function in that it multiplies the hash value by 9 instead of an odd
> number marginally bigger than 26, e.g. 33 like the perl one does.
> i discussed this stuff to some extend with owen back in 1998, basically
> he was arguing the perl function to be faster because it onlly did
> a shift and two additions, while i was claiming that you get less lossage
> with the ASU one (which is in glib) because the higher bits are constantly
> fed back into the lower bits of the hash.
> 
> it of course depends on what kind of string you want to hash, for
> strings where you mostly care about the tails, you're certainly willing
> to trade some of the higher bits and thus allow hash collisions for
> strings that only differ in the beginning characters.
> 
> for a general purpose string hasher, i'm not so sure, maybe we should
> just leave the decision to the hash table user by providing:
> 
> guint
> g_str_hash (gconstpointer string)
> {
>   register const gchar *p = string;
>   register guint h = *p;
> 
>   for (p = string + 1; *p; p++)
>     h += (h << 4) + *p;
> 
>   return h;
> }
> 
> and for high-bit feedback:
> 
> guint
> g_str_hash_long (gconstpointer string)
> {
>   register const gchar *p = string;
>   register guint h = *p;
> 
>   for (p = string + 1; *p; p++)
>     {
>       register guint g;
> 
>       h = (h << 4) + *p;
>       g = h & 0xf0000000;
>       if (g)
>         h ^= g >> 25;
>     }
> 
>   return h;
> }
> 
> which will give hash speed ups for all the places where
> g_str_hash() is currently used, and hopefully also improve
> hashtable performance, based on the assumption that for
> most cases it'll be the string tail that matters.

I'm not sure this is a good idea: one point of glib is that people can
use it without having to understand things to the level of detail
of having to understand tradeoffs between different hash functions.
 
Note that the Tcl version does _not_ discard the low bits
of the string, since it is h += (h << 4). 

The perl version effectively uses '<< 5' instead of '<< 4',
and also differs by having 

 return (h >> 5) + h;

instead of 'return h'; something I don't completely understand.

It would take a bit of experimentation to see how the Tcl/Perl
versions perform on long strings that differ only in their tail, but I
suspect that

 a) Inserting such strings into a hash table is rare (and
    slow anyways)
 b) They'll probably do OK anyways

Regards,
                                        Owen



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]