Become a fan of Slashdot on Facebook

 



Forgot your password?
typodupeerror
×
Programming IT Technology

Probing Hash Tables? 48

David Rusenko asks: "I've been taking a datastructures class at CMU as part of a summer CS program. One of these structures we have gone over is hash tables. After going through many different probing methods (linear, quadratic), multiple hash functions, and double hashing, I was all too curious to know if these are the best methods currently known. Some other interesting ideas came up, such as using the Fibonacci numbers for probing, but I haven't had time to test them yet. Any comments?"
This discussion has been archived. No new comments can be posted.

Probing Hash Tables?

Comments Filter:
  • Actually, I don't think I've ever seen a real world library with a hash table that uses any sort of probing. Most "generic" hash tables use buckets, AFAIK.

    Probing has the problem that eventually, you run out of space, whereas with buckets you don't (although performance will degrade as the average bucket size increases). Of course, a decent implementation will resize the hash table, so this point isn't that important.

    I have seen probing used in some special case hash tables, in which case I think it was only linear or quadratic, nothing special (and the reason I saw this hash table was because I was fixing a bug where the hash table eventually overflowed, and wasn't getting resized).

    I would be very interested to hear whether other people have ever used a probing hash table, and why.
    • I've always been fond of using buckets instead of probing, and using Red Black trees for the buckets.
    • Higher performance applications call for probing. It's a little harder to program though. One of my CS professors worked on speech recognition for IBM and they used hash tables for n-gram language modeling, and probed using double hasing I believe. The largest problem with probing is removal, but their program didn't actually remove anything from the table.
  • Read the Bible (Score:2, Informative)

    by Woodblock ( 22276 )
    Do yourself a favour and read the bible [amazon.com] on it. The information might be a bit dated, but I doubt hashing has changed that much since the last edition.
    • Re:Read the Bible (Score:4, Interesting)

      by Tom7 ( 102298 ) on Monday July 29, 2002 @12:58AM (#3970016) Homepage Journal
      Yes, it certainly has changed... real world considerations are much different from the memory-poor days when knuth invented mix and all of his insane bit-twiddling techniques. The theoretical analysis of basically any sort of sensible hashing will wind up resulting in constant time access... so it's really a question of addressing the constant factors on modern machines (as well as the ease in implementing these different schemes!). Knuth is quite outdated in that respect.
      • and in some cases, it's still relavant, ie constricted embeded systems, low cost devices, etc.

        as for mix being outdated, it's being passed aside for an updated version (mmix).

        I don't think AoCP will ever truely become outdated until Knuth dies and isn't around any more to rewrite it.

        But even then, algorithms don't change, perhaps their relavance changes, but there's still value to learning the landscape.
  • What about a linked list with an array?

    What about another hash table that holds the conflicts (kind of like double hash functions but cooler)?
  • Fibonacci. (Score:3, Informative)

    by OnyxRaven ( 9906 ) on Monday July 29, 2002 @12:57AM (#3970010) Homepage
    As I dig out my Algorithms textbook, I see mention that Fibonacci heaps give fast times in Dijkstra and Prim's algorithms for Shortest dag path and Minimum Spanning Trees respectively: O(m+nlogn) in both cases. This is very fast for dense graphs.

    Shrug.
  • I like open addressing because its simple, and double hashing seems to be good for probling.

    I have to store string in a hashtable, to do so i seperate the data from the hashtable.
    I create a string stack which stores the actual data, and a hastable that contains the entry number in the stack.

    e.g.
    stack \0one\0two\0three\0
    ht 1, 0, 1, 1

    Note in the stack the 0'th entry is equivalent to a NULL and is where unasigned ht entries refer to, thats how you know the ht slot is empty.

    By seperating the content from the ht its easier to resize the hashtable, you simply create a new hashtable and reinsert the contents refered to by the old hashtable.

    I only need 2 pointers !

    I have code if you want a look
    • Unfortunately, by reinventing the wheel, all you've succeeded in doing is confusing the hell out of anyone who has to maintain your code in the future.

      I hope you commented real-world uses of this profusely, otherwise you may start receiving suspicious packages in the mail from those who take over for you some day.

      - A.P.
      • >Unfortunately, by reinventing the wheel, all you've succeeded in doing is confusing the hell out of anyone who has to maintain your code in the future.

        Last I checked, this was one of the benefits of being a coder. I have been known to document my code with DFWI-IM (lets just say that -IM stands for "it's magic")

        As if you would ever code something in a recursive manner because 'that was the best way.' BAH! You write recursive code to see who on your support team are software engineers, and who are the 'self taught wanna-bees'.

        Glonoinha
    • Maybe I'm missing something fundamental here, but it looks like what you've created isn't really a hash table, but an array with a poorly designed external index indicating which portions of the array contain data.

      Additionally, it would seem to me that obtaining the Nth element of this "hash" would be an O(N) operation, which provides, let me calcuate.... yes, zero gain over using a standard array.

      The only real advantage I see is a short-term increase in job security due to the likelihood that people will look at that and say 'What the fuck is he doing this for?' and thinking you're far more clever than you are.

      • I guess it was futile tryign to give na example.

        The "index" is the hashtable, the hashtable points to the real data in the stack.

        The reason i seperate the data from the hashtable are.
        1) Easy to resize
        2) Memory efficient, not using heaps of pointers.
        3) Multiple hashtables can refer to the same data.

        Data is inserted using double hashing addressing, it is a real hashtable.

        Good code doesnt need comments, it speaks for itself !
  • Who cares? (Score:5, Insightful)

    by Screaming Lunatic ( 526975 ) on Monday July 29, 2002 @02:02AM (#3970156) Homepage
    It's a table. Which means it'll be faster than a list. That's all you have to worry about.

    What's that about a little bit of knowledge and it being dangerous. In school you'll beat every algorithm under the sun to death. In the real world you'll link to the STL and use a hash map.

    Write well-designed, clean, maintainable code. Then profile it. If your table lookup blips on the profiler, then *THINK* about optimizing it. After you've *THOUGHT* about optimizing it, then decide if it makes sense to squeeze the time and effort into the schedule.

    Random quotes:

    Premature optimization is the root of all evil. -- Knuth

    There is never a best solution, only tradeoffs to consider. --Eberly

    The best optimizer is between your ears. --Abrash

    The Ferrai's are only gravy, honest! --Carmack

    Alright I threw the last one in for shits and giggles. Don't blame me, I can't get through a Sunday night without drinking a poop load of beer. Anyway, the point is that there is a difference between *FAST* and *FAST ENOUGH*. If it's fast enough, who cares.

    [CYNICISM]

    Unless you're an academic and you're looking for funding.

    [/CYNICISM]

    • The hash map in STL takes a hashing function, so that doesn't let you off the hook...

      Anyway, one of the more useful hash maps that I use utilizes strings for the key. To calculate the hash of the string, I use this functor:

      struct stringhash {
      size_t operator() (const string &str) const {
      const char *s = str.c_str ();
      unsigned long h = 0;
      for ( ; *s; ++s)
      h = 5*h + *s;
      return size_t(h);
      }
      };

      Is there a more efficient way? My strings are usually short, but suppose I want to run this function on the full text of "King Lear" or something...

      • I wouldn't worry about King Lear unless you really think it's possible that it would occur in practice. Making your hash function "robust", in that it deals with ridiculously large keys efficiently will slow down everything else. On the other hand, the situation is at least better in C++ than in C, because in C++ std::string remembers the size of a string, so it doesn't have to be recalculated. You could test for the length of a string, and use that to decide between two possible hash functions. Whether it is worth the trouble only profiling will show (I doubt it, unless your input is somewhat strange).

        On the other hand, I wonder about where you get the 5 from. This 5 means you need 11 characters to fill all of h. Are your keys really that long on average? Maybe you will get a better hash-distribution if you increased this number somewhat (e.g. 26, as that's the number of letters in the alphabet, it will still give you 5 or 6 "significant" characters). Or you could simplify it by using 32, since that will imply a bit-shift, although that doesn't mean much difference on todays pipelined computers. Again, only profiling will show whether your hash-function can be improved or not.

        • I actually don't remember where I got the hashing algorithm from. Probably one of the books on my shelf. Maybe one day I'll sit down and take a really good look at it.
    • You forgot the two rules of optimization:
      1) Don't do it.
      2) (For expert only) Don't do it yet.
    • It's a table. Which means it'll be faster than a list.
      Sequentially searching through a hash table is slower than sequentially searching through a list. Similarly, finding a given element in a 10-element list is much faster than finding an element in a 10-element hash table. The point being: hash tables are (much) faster for certain types of access patterns, but not so for others.
      Write well-designed, clean, maintainable code.
      How is choosing an effective hash algorithm not "writing well-designed, clean, maintainable" code? I would include choosing good algorithms and data structures as "good design" -- and an application which uses a single, efficient hash function throughout the application is much more maintainable than one which uses a myriad of hash functions chosen at random throughout the code.
  • by splorf ( 569185 ) on Monday July 29, 2002 @04:38AM (#3970412)
    Even Knuth vol. 3 doesn't go into that much detail about probing strategy. If you're getting collisions, make the table bigger.

    That said, the most thorough treatment I remember about probing was from David Gries' book Compiler Construction for Digital Computers, published in 1971 when memory wasn't so cheap. It's probably long out of print by now, because that stuff isn't really important any more.

    Similarly, a lot of the stuff in Knuth vol. 3 is about sorting data on magtape, which was important 20 years ago but nobody cares about now. In the introduction to the second edition, Knuth says he left the material in just because it's mathematically beautiful and because tape-like media may make a comeback, but it's possible that he'll remove it from future editions.

    • Correction (Score:3, Insightful)

      by splorf ( 569185 )
      I better add before someone makes a tragic mistake, enlarge the table only if you're getting enough collisions to actually slow your program down enough to matter. If you have a lot of entries you'll get a few collisions even with an enormous table that's mostly empty, because of the birthday paradox. Don't worry about it til you're getting collisions on a significant fraction of your lookups.
  • Remain simple... (Score:3, Insightful)

    by joto ( 134244 ) on Monday July 29, 2002 @04:40AM (#3970417)
    Remember that every complication to the algorithm makes it slower. When you move from linear to quadratic probing, you loose memory locality (think cache). When you move from quadratic probing to double hashing you do another function call (or at least, some calculation). A fibonacci scheme would be similar to quadratic probing, except it requires one more addition, I can't see immediately see any benefits to that, but maybe I'm not thinking hard enough about it...

    In the worst case, double hashing is better than quadratic probing, and quadratic probing is better than linear probing. But the trick for finding the "right" method is to ensure that the worst case never happens. Then you can avoid expensive advanced stuff, and keep it simple and lean.

    If you don't know much about your data, or your keys, or your insertion/removal pattern, than a separate chaining scheme might be best. If you know everything, then you can use a perfect hash function generator (but even that doesn't necessarily guarantee best performance). The reason they teach you several different methods is because there is no single "best" answer.

    • Another thing to worry about beyond memory cache efficiency is TLB efficiency. If your array is allocated in a single page of virtual memory it could have a smaller load on the TLB than nodes in a chained hash table that are spread over many pages.

      In practice don't worry unless profiling shows the cost is significant.

  • Don't let collisions happen in the first place - that's much more important than what kind of probing you do, after a collision happened. Hence the distribution of the first hash-function is most critical - and has to be checked against real input distribution.

    Second best thing is to keep your data together (to have them in the CPU-cache, when they are needed).

    Therefore I would go for primitive sequential collision-algos - within memory buckets of the size (and at the boundaray) of the CPU-cache and a spare slots at the end of each bucket for collisions. Simple coding, fast access.

    You see, everybody has his own taste.
  • Probing Jennifer's hash table was a smooth and efficient process, as we together established a very efficient insertion/removal pattern.

    Ok, am I the only person who saw this article's title as being a bit sexually suggestive?

    Ugh. I must have spent far too much time on the computer again this weekend... :)

  • CAM table (Score:2, Informative)

    by frantzen ( 137260 )
    CAM tables if you're gluing together your own hardware. kinda like a hash table in hardware that never gets collisions until it is full.

    the insertion/deletion maintance isn't free so you gotta be careful with your search/modify ratio.
  • It's been a little while since I last really looked into this, but these are the basic principles of dealing with a hash table; there are some factors (such as table size, particularly the primality of the table size) that will defeat your probing strategies unless you are careful to avoid them: 1) The size of the table should be a sufficiently large prime number. Why? Well, this digs into number theory, but the basic idea is that the set of integers modulo p for some prime p is a field and behaves well --- for one thing, multiplying by any nonzero number is equivalent to permuting the ordered list {1, 2, ..., p-1}; these permutations are guaranteed to be unique for each congruence class. This also allows you to take advantage of things called generators when you design your hashing functions (this borrows an idea from crypto*). This also avoids some nasty problems, for example: you have a hash table of size 2n, where every even index is filled, and are now trying to insert a string for which H(x) is 2. Think about it --- a lot of collision strategies won't handle this. 2) Your hashing function should give a uniform distribution over the length of your table; i.e. each number {0, 1, ..., p-1} should be equally likely regardless of the input. This will be the trickier part. You will want to play with this function on representative input to make sure that it behaves as expected. 3) Make sure a hashtable is what you're wanting. There are very nice data structures called prefix lists and tries which let you do insertions, lookups, and deletions in time that is linear in the *length* of the string, *not* the number of strings. If speed is an absolute must, you may be interested in exploring this structure as an alternate. This said, remember: Perfection is the enemy of good enough and it is good enough that sells. Don't put any more effort into this than you expect to see reasonable results from.
  • It's all in the fish (Score:3, Interesting)

    by MrIcee ( 550834 ) on Monday July 29, 2002 @01:37PM (#3972683) Homepage
    It is difficult to ask the question you are asking... because efficient hashing depends totally on the TYPE of data you are hashing. In general, TYPE comes down to issues such as length of data... and what the data can be composed of.

    For example, if your data is composed primairly of upper case squences of A T G C (e.g., genetic code) you would tend to have long elements of highly repeatable letters.

    If, on the other hand, you were memorizing, say, a binary image for uniqueness (such as in a virus scanner) you would have large files with binary data.

    Thus... each data type possible can be *tuned* to be more efficient when hashed, based on what you know of the incoming data. To ask for a *generic* algorithm that works well on all data will automatically result in less efficiency.

    The real secret to good hashing is to allow two things... first, allow the algorithm (usually hash length and table size) to be modifyable by the code. Second, allow simple statistics to be kept on collision rate and maximum child length - these two statistics can be calculated very very quickly at the key-add phase of the hashing, and only add a few instructions to the process.

    Now... throw a good deal of data at your hash table... all sorts of data that represent the type you EXPECT to get. Tune the hash table algorithm (using the exposed algorithm hooks) until your statistics see a collision of between 2 and 5 for all positions in the table (in general, our hashers, and we use LOTS of hashers, rarely go beyond 3 to 5 children).

    The tradeoff is obviously memory (table size) versus efficiency of the hash algorithm.

    One of my favorite hash algorithms for trivial hashing (say, hashing of a label or variable name) is simply incremental add and xor. Very quick and usually generates a good spread (this is only for simple hashing).

    MORE IMPORTANTLY (to me at least) is how you STORE your hashing and data info. We *always* store the data as a (void *) pointer. By doing so we can store ANY type of element, be it a structure, a function pointer, a pointer to text data, a LONG or a INT or a DOUBLE, doesn't matter, because we always store the POINTER to the data, not the data itself. This is an extremely powerful concept.

    One last thought... one of the more interesting things you can do with hash tables is to use them as on-the-fly indexed arrays that can grow to any length. In this case, the hash code becomes the index you want to store in. This is an interesting concept because it means you can create an array that grows in real-time. For example, you can store in hash_array[1] and hash_array[5923] and it will not take 5924 positions, it will only take two positions. And reading the two items only takes 2 instances of the loop, not 5924 instances. The array grows at will, taking only as much memory as what you require. Obviously, for this to work, you are hashing a small structure that contains the real data AND the real index, thus during a collision you then do a compare for the real index down that childs tree. But this is extremely quick and low overhead and solves an infinite array with gusto.

    Aloha

  • by epine ( 68316 ) on Monday July 29, 2002 @09:57PM (#3975772)
    I hate to be churlish about this, but none of the early posts have addressed the core issues. Knuth's treatment is rather narrow. Buckets have very little to do with it.

    The question to address is your key structure. Ideally you have the notion of your keys in cannonical representation. In object oriented contexts, the byte representation of your objects is not necessarily cannonical.

    Next step is to analyze the size and distribution of your key space in cannonical representation, using as a function of some N which represents the scale of the problem instance.

    At this point, if your the size of your key space is a weak function of N life is easy. Weak functions are where the average bit size of your cannonical keys is logarithmic in N or at worst sqrt(N). This represents the order of key entropy extraction.

    A best case scenario is where all your keys are eight byte fully randomized GUIDs. Entropy extraction from your keys can be handled in just a few machine instructions. Worst case: you have to gather your key entropy by traversing deeply linked object hierarchies, in which case the efficiency of bucket access is swamped by the cost of constructing the bucket index.

    When life is ugly, the games begin. There are no end of variations on how you can arrange to collect enough entropy (most of the time) for each key (most of the time) quickly enough (most of the time).

    An extreme measure involves memoizing your key entropy with all the hassles that entails of making sure every modification to the key structure maintains key hash correspondence. If key modification is rare, you can really brute force this. If all your keys are always in the table, then keys can't change without rehashing (so one term starts to absorb another). On the other hand, if you have many keys to check, but few keys to check against, you might get sucked into finding clever and/or complicated methods for maintaining your key entropy memos.

    If you have concocted a model with known degeneracies, who pays on the degenerate case? This is a game of hot potato, which again presents endless options, most of which are difficult to formally analyze (supposing you even have a sufficient model of your key space).

    There are human hazards when you enter into this terrain not to be taken lightly. For some unknown reason a rather large slice of the population cannot comprehend any event more certain than 99% or less certain that 1%. pow(2,-6) is rounded up by these people to 1%, which in Murphy's calculas is as good as sunk. Nothing you can do about it. Sooner or later the pointy haired what-ifers will grind you down. Especially if they studied arithmetic for a couple of years as an undergraduate thinking it was mathematics.

    Let's suppose now that you've done the dirty deed and concocted some method to extract at least log K bits of uniform key entropy (most of the time) where K somewhere between the total number of keys you might need to check and the total number of keys you might need to store. Only now does bucket management begin to matter.

    Probably the best thing to do at this point is to slice the world into rough orders of magnitude: less than 10 CPU clock cycles to on-chip L1/L2 caches, 10-100 CPU clock cycles to on chip L3/off chip L3 cache, 100-1000 CPU cycles to external memory, 1000-10,000 cycles via interprocess communication/message passing, 10,000+ cycles for data structures not necessarily memory resident.

    With a GHz class CPU, 1000 cycles still allows you to check one million keys per second. Do you really need to hone this down another 2.5 orders of magnitude? Sometimes you can if you really want to.

    I could go on for a long time yet, but I've covered most of the ground that the rest of this thread has largely ignored.

    On a more theoretical level, it is even possible to construct perfect hashing schemes with space efficiency near the Shannon limit for data sets where the average key entropy is less than one bit. Did someone mention Markov models for speech recognition? Forget Knuth volume 3. It has a lot more to do with Knuth volume 4.

"If I do not want others to quote me, I do not speak." -- Phil Wayne

Working...