Hashing in Data Structures PDF

Title	Hashing in Data Structures
Author	shovon sikder
Course	Discrete mathematics
Institution	Green University of Bangladesh
Pages	10
File Size	208.2 KB
File Type	PDF
Total Downloads	49
Total Views	148

Preview

CLICK TO PREVIEW PDF

Summary

Lecture on hash table...

Description

What is Hashing? 

Hashing is the process of mapping large amount of data item to smaller table with the help of hashing function.



Hashing is also known as Hashing Algorithm or Message Digest Function.



It is a technique to convert a range of key values into a range of indexes of an array.



It is used to facilitate the next level searching method when compared with the linear or binary search.



Hashing allows to update and retrieve any data entry in a constant time O(1).



Constant time O(1) means the operation does not depend on the size of the data.



Hashing is used with a database to enable items to be retrieved more quickly.



It is used in the encryption and decryption of digital signatures.

What is Hash Function? 

A fixed process converts a key to a hash key is known as a Hash Function.



This function takes a key and maps it to a value of a certain length which is called a Hash value or Hash.



Hash value represents the original string of characters, but it is normally smaller than the original.



It transfers the digital signature and then both hash value and signature are sent to the receiver. Receiver uses the same hash function to generate the hash value and then compares it to that received with the message.



If the hash values are same, the message is transmitted without errors.

What is Hash Table? 

Hash table or hash map is a data structure used to store key-value pairs.



It is a collection of items stored to make it easy to find them later.



It uses a hash function to compute an index into an array of buckets or slots from which the desired value can be found.



It is an array of list where each list is known as bucket.



It contains value based on the key.



Hash table is used to implement the map interface and extends Dictionary class.



Hash table is synchronized and contains only unique elements.



The above figure shows the hash table with the size of n = 10. Each position of the hash table is called as Slot. In the above hash table, there are n slots in the table, names = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}. Slot 0, slot 1, slot 2 and so on. Hash table contains no items, so every slot is empty.



As we know the mapping between an item and the slot where item belongs in the hash table is called the hash function. The hash function takes any item in the collection and returns an integer in the range of slot names between 0 to n-1.



Suppose we have integer items {26, 70, 18, 31, 54, 93}. One common method of determining a hash key is the division method of hashing and the formula is : Hash Key = Key Value % Number of Slots in the Table



Division method or reminder method takes an item and divides it by the table size and returns the remainder as its hash value. Data Item Value % No. of Slots Hash Value 26 26 % 10 = 6 6 70 70 % 10 = 0 0 18 18 % 10 = 8 8 31 31 % 10 = 1 1 54 54 % 10 = 4 4 93 93 % 10 = 3 3



After computing the hash values, we can insert each item into the hash table at the designated position as shown in the above figure. In the hash table, 6 of the 10 slots are occupied, it is referred to as the load factor and denoted by, λ = No. of items / table size. For example , λ = 6/10.



It is easy to search for an item using hash function where it computes the slot name for the item and then checks the hash table to see if it is present.



Constant amount of time O(1) is required to compute the hash value and index of the hash table at that location.

Linear Probing 

Take the above example, if we insert next item 40 in our collection, it would have a hash value of 0 (40 % 10 = 0). But 70 also had a hash value of 0, it becomes a problem. This problem is called as Collision or Clash. Collision creates a problem for hashing technique.



Linear probing is used for resolving the collisions in hash table, data structures for maintaining a collection of key-value pairs.



Linear probing was invented by Gene Amdahl, Elaine M. McGraw and Arthur Samuel in 1954 and analyzed by Donald Knuth in 1963.



It is a component of open addressing scheme for using a hash table to solve the dictionary problem.



The simplest method is called Linear Probing. Formula to compute linear probing is: P = (1 + P) % (MOD) Table_size

For example,

If we insert next item 40 in our collection, it would have a hash value of 0 (40 % 10 = 0). But 70 also had a hash value of 0, it becomes a problem. Linear probing solves this problem: P = H(40) 44 % 10 = 0 Position 0 is occupied by 70. so we look elsewhere for a position to store 40. Using Linear Probing: P= (P + 1) % table-size 0 + 1 % 10 = 1 But, position 1 is occupied by 31, so we look elsewhere for a position to store 40. Using linear probing, we try next position : 1 + 1 % 10 = 2

Position 2 is empty, so 40 is inserted there.

What are Hash Tables in Data Structures? Hash tables are made up of two distinct parts: 1. An array (something which we’re already familiar with) 2.

Hash function

Assume we’ve to store these strings using Hash Table {‘Ada’, ‘Bea’, ‘Sam’, ‘Mia’} Now we know internally hash table will use an array to store them. But, before storing the values directly in the array, we will pass this string one by one to the hash function. What is Hash Function? Think of term hash as used in cooking like a hash brown. We take raw ingredients, chop them and mix them all together. Our raw ingredients are data like strings in our example or something like a password or maybe an entire object. All we do is add this data to a hash function. This hash function, in simplest term, will grind all data up and give us a simple and very smaller integer number. There are few things that should be noted about hashing here: 1) The hashing we will be doing in our example is not encryption. 2) Also, hash functions are not reversible or invertible. When we hash any information, we lose information about our raw ingredients. This is intentional. The motive here is that we cannot take any hash value result and try to turn it around to give us raw ingredients. Imagine, you’ve cooked hash brown and you want your raw ingredients like potatoes and onions back from it. Now coming back to our example, the hash function will take a string value and gives us back an integer number based on some calculation.

This integer number is where we insert/add the value in the array. Below is an example how Hash function calculates integer number for all strings we want to store in the array. Our String ‘Ada’ gets stored in an array at index number 2 Now, let us try to store all the strings using the same calculation into the Hash table. The array is fully populated now. But, this time based on index numbers given by a calculation. Calculation used: Index Number = Sum of ASCII Codes %(Modulo) Size Of the Array Retrieving String From the Hash Table: Now, let’s try to retrieve string ‘Sam’ from this array. We perform same calculation again. 1: Sum of ASCII code of each character 2: Perform Division 3: Get Remainder And just like that, the remainder just gave an array index. We just found our angel which will always tell us what our index number is. Thanks to Hash tables, the complexity of searching an element drops to O(1). Associative Array Hash functions: Just like our above example, the associative array uses a hash function too But, the index number of an array is generated by ‘keys’. Let’s see how this below key-value pair gets stored in the hash table. Here, keys like ‘MH’ are given to hash function to generate array integers. Hash Collisions:

The above things happen in fairy tales or if it belongs to Happily-ever after Bollywood movie. But, real life is not fair. Assume we’ve to store strings in the hash table by using the hashing technique {“Mia”, “Tim”, “Bea”}. We’ll use the same sum of ASCII Code modulo array size calculation Wait, don’t we already have String ‘Mia’ at array position 0. How can we put ‘Bea’ in the same position? The keys have collided. Something we call as hashing collisions. Collision Resolution Techniques: There are many ways in which we can solve this collision thing-y. Approach No 1- Linear Probing (Also fondly known as Open addressing or Closed hashing) An extremely simple solution. If the position of an array is occupied, always love thy neighbor i.e try to put the element in next position. So, if we can’t put ‘Bea’ in position 0, try to put it in position 1. But, our string ‘Tim’ already occupies position 1. Now what? Love Thy Neighbors Neighbor. Try position no 2. Phew! We finally found an unoccupied spot. We can store ‘Bea’ here. Placing an item in an unoccupied position of an array in case of collision is known as open addressing. Every position is open to any item of an array. This is also called as linear probing as we perform a linear search to look for unoccupied space in case of collisions. If linear probing gets to the end of the array and still can’t find an empty position, it might cycle around from the beginning of the array and continue searching from there. Finding An Element In Collision Affected Hash Tables: Let’s say we want to find string ‘Bea’ in our Hash Table. According to our index calculation, ‘Bea’ should’ve been at position 0. But, thanks to collisions, it had to take up a slot at position 3.

So, to search string ‘Bea’, we first go to the index calculated by the hash function, check if the element exists and if not, perform a linear search to see where the element resides. Load Factor: We were only dealing with 3 elements and we encountered a collision. Imagine when we’re dealing with thousands of data, what will happen? One way we can deal with this make the hash table bigger than needed to total amount of data we’re expecting. Something like only 70% of our hash table is ever occupied. The ratio between the number of items stored to the size of the array is known as load factor. Remember, when we were talking about arrays, we studied about a special type of arrays known as dynamic or re-sizable arrays. If we’re implementing Hash tables with the dynamic arrays, it can be made to resize automatically if the above load factor reaches some threshold. Fairy Tale Vs Real World: In Fairy Tale, all the elements are stored in their respective indexes without any collision. This best case scenario gives us O(1) complexity. But, in worst case scenario, we’re still looking at O(n) complexity where collisions force us to perform a linear search. Approach No 2: Separate chaining: (Also fondly known as Closed addressing or Open hashing) There’s another way to deal with the collision which is known as Separate chaining. Let’s try to populate our hash table again with Strings {“Mia”, “Tim”, “Bea”}. If we calculate the index based on our sum of ASCII codes modulo array size, we get index as zero. But, instead of storing string ‘Mia’ at position 0, we have got a pointer to the first node at the linked list. What is a Linked List?

We’re going to study about the linked list in great detail soon but for now, think of it as a bunch of nodes that point to each other in one direction. An example would be that node A points to B and B points C. Now, let’s try to store string ‘Tim’ with same calculation, the index comes out to be 1. Again a pointer to a linked list from position 1. Let’s try to store string ‘Bea’ now. The index number calculated is 0. So, ‘Bea’ is added to the linked list at position 0 and Node ‘Mia’ points to it. Finding An Element In Collision Affected Hash Tables: Let’s say we want to find string ‘Bea’ in our Hash Table. Our index calculation tells ‘Bea’ should’ve been at position 0. We reach position 0 and then follow simple linked list traversal to find ‘Bea’. At least with this technique, the index numbers where we store element given by hash functions remain unchanged. So the searching or lookup of an element is faster. Sure, traversing a linked list also comes at some cost. The worst case complexity of traversing a linked list can be O(n). So, what do we do? Hash tables were supposed to solve our beloved array search problem. We are still looking at O(n) complexity in most cases. The power is all in the function: You want a powerful hash table, all you need is a good hash function. The hash function we used above, that is the sum of ASCII codes modulo array size was a bad one. In any case, we have permutations of the same letters in the set, we will end up with the same value for the sum and leading same key which leads to disastrous collisions. Secondly, we need a good table size, preferably a prime number. How do we pick a good hash function? Picking a “good” hash function is absolutely necessary to successfully implementing a hash table. The storing and retrieving data in O(1) time comes down to answering the above question.

There are certain things which makes a hash function goody-goody. 

The most obvious one, it should avoid a collision as much as possible



There should be a uniform distribution of value across hash table



The function must be easy to compute.

Finding a “good” Hash Function: A perfect hash function that is a function that has no collisions is an illusion. But, better hash functions are not. Let’s try a different hash function. The index for a specific string will be equal to the sum of ASCII values of characters multiplied by their respective order in the string after which it is modulo with some prime number like 29. Here, each string will lead to a unique number, and with modulus a prime number, it is still possible that we may have collisions but maybe fewer collisions than when using a naive hash function like the sum of the characters. Supporting Hashing: I know you might be wondering do I have to go off and write such complicated hash functions? The answer is No. Most programming languages give us a perfectly acceptable hash function which does all this heavy work for us. Applications Of Hash Tables: 

Associative arrays have different names in various programming languages. They are implemented using Hash tables.



The hash functions are used in various algorithms to make their updating and storing computing faster.



Hash tables are used as disk-based data structures and database indexing.



They can be used to implement caches mainly used to that are used to speed up the access to data.



Several dynamic programming languages like Python, JavaScript, and Ruby use hash tables to implement objects.



Hashing is a not just used in data structures, but also widely used in security, cryptography, graphics, audio....