|
# Hash # |
The time associated with information retrieval can be critical for some applications. Disk access can be slow causing dissatisfaction. A possible solution is to load the regularly used data to memory to obtain faster response. Depending on the size of the data, the search time can go up resulting in long delays that are not acceptable by the application.
A typical application is that for a point of sale; the longer the search for an item being sold, the longer the customer has to wait. Hashing methods can offer a quick and efficient data retrieval time. The idea is to store an item in a location based on its key value.
Ideally, each key should map to a unique position in the array, but that is not true in most situations; for example, if the store has 1000 items; and each is of seven digit number supplied by the manufacturer. A formula needs to be applied on the item number to drive a location to store the item in the table of size 1000.
Hash Function
the formula
key
the item number
hash key
the derived array location
hash table
the array
For example , when we apply the following hash function , H(key) = key mod 1000 , an item numbered 9456001 would be stored in location 001, and an item numbered 9670782 would be stored in location 782. As long as the number of items does not exceeds the table size there will be enough space to store all items.
Retrieving an item is similar to storing it; the function is applied on the key and the location is determined. In the best case, the table access time is reduced to a direct access that cannot be achieved by any searching algorithm.
The example shows the use of module arithmetic on the key to retrieve a table location. The formula would return best results if the three digits on the right were different for each key, and any similarities would result in the generation of the same hash key for two or more keys, a situation referred to as hash clash.The main properties of a good hash function:
It must be easy to compute in that it should not consume a lot of processing time
Simple and easily understood
Produces minimum number of hash clashes
|
Common Hash Functions |
1
Module Arithmetic
the record key is converted to an integer that is then divided by the hash. table size and the remainder is used as an index. The hash function is h(key) = key mod TableSize.
2
Selecting Digits
certain digits from the key are selected and used as the table index. For example, the 2nd, 4th, and 7th digits of telephone numbers may result in an even distribution of records in the table.
3
Folding
folding in which all numbers are-added. For example, h(34927) = 3+4+9+2+7 = 25.
The drawback in this method is the limited number of locations that can be generated ranging from 0 to 45 for a five-digit key.
The key would also be split into parts that are then added, for example, 34+9+27=70 or 34+927= 961.
If the resulting hash key is larger than the table size
The mod operator could be applied
or only a suitable number of digits used.
For example, if the size of the table is 100 ; 961 is divided by 100 and the remainder is used as the hash key, which definitely maps to a location in the table.
|
Hash Clashes and Their Resolution |
Large keys in the problem domain suggest their large number, and selecting the perfect hash function to map the range of possible keys each to a unique position in the table is almost impossible. Therefore, collisions are bound to take place and a method to resolve them must be thought of.
A collision is the act of trying to store a record in a location that is occupied by another. For example, if module arithmetic is used as a hash function. The keys 954765 and 863765 would be mapped to the same location in an array of size 1000, since they are both divided by 1000 and the remainder is used as the hash key.
There are two methods to resolve collisions (fixed array size)
Open Addressing
When a clash occurs, an alternative location in the table is used to store the new record.
Chaining
Each location is a pointer that links to a record of information to be stored. All data that map to the same index are stored at that position as a linked list.
Open Addressing Methods ( Re-hashing )
Linear Probing
Whenever a collision occurs the array is probed to find an unoccupied location to store the new record.
The process is repeated to retrieve the data. The hash function is applied on the key and the location is examined sequentially until the item searched for is found.
The array must be circular so that all elements are examined.
The drawback with this approach is the clustering of data at certain locations resulting in lengthy searches through the array.
This is referred to as primary clustering.
When the array is empty the probability of an item being indexed to any of the locations is 1/n.
A location containing data forwards this probability to the location that follows and so on.
The probability of an empty location following a filled location to be filled is 2/n and so on. The more filled locations preceding a site, the more its probability to be used next.
Quadratic Probing
If a collision occurs at location i; locations, i+1, i+22, i+32, i+42 and so on are tested until an empty location is found. When more than one item hashes to the same location, the method probes through the same sequence of locations resulting secondary clustering.
Item-Dependent probe distance
For example to use the last digit in a phone number as an increment. A random numb generator could also be used to determine an increment, but the same seed must used every time to insure that the same sequence of numbers are generated when inserting and searching for a key.
Double Hashing
Two hash functions are used. The primary hash function is applied on the key to determine the location, and if found busy, the second hash function is applied on the same key to find an increment that is unique to it.
The process is repeated until an empty location is found to store the new item.For example using the following hash functions:
h1(key) = key mod TableSize
h2(key) = 4 - key mod 3rh(key,i)=( h1(key) + ih2(key)) mod TableSize
Examples:
Consider a hash table of size 10 and modulo arithmetic for hashing .Show the state of the table after inserting the following sequence of inputs : 89,18,49,58,59 using the following collision resolution:
(1) Linear probing
(2) Quadratic hashing
(3) Double hashing ( Choose an appropriate re-hashing function )
Empty
after 89
after 18
after 49
after 58
after 59
0
49
49
49
1
58
58
2
59
3
4
5
6
7
8
18
18
18
18
9
89
89
89
89
89
Linear probing
Empty
after 89
after 18
after 49
after 58
after 59
0
49
49
49
1
2
58
58
3
59
4
5
6
7
8
18
18
18
18
9
89
89
89
89
89
Quadratic hashing
h2(key) = 7 - key % 7
49 collides , so h2(49) = 7 - 49%7 = 7
rh(49) = ( 9 + 7 ) %10 = 6
58 collides , so h2(58) = 7 - 58%7= 5
rh(58) = ( 8 + 5 ) % 10 = 3
59 collides , so h2(59) = 7 - 59%7= 4
rh(59) = ( 9 + 4) %10 = 3 ( Which is occupied !)
rh(59) = ( 9 + 2(4))%10 = 7
Empty
after 89
after 18
after 49
after 58
after 59
0
1
2
3
58
58
4
5
6
49
49
49
7
59
8
18
18
18
18
9
89
89
89
89
89
Double Hashing