Chapter 8

Chapter 8: Dictionaries

Like a priority queue, a dictionary is a container of key-element pairs. Nevertheless, although a total order relation on the keys is always required for a priority queue, it is optional for a dictionary. Indeed, the simplest form of a dictionary assumes only that we can determine whether two keys are equal. When the total order relation on the keys is defined, then we can talk about an ordered dictionary, and we specify additional ADT functions that refer to the ordering of the keys.

8.1 The Dictionary Abstract Data Type

A dictionary ADT stores key-element pairs (k,e) which we call items, where k is the key and e is the element.
In an unordered dictionary we can use an equality tester object to test whether two keys, k₁ and k₂, are equal with function isEqualTo(k₁, k₂).

8.1.1 The Dictionary ADT

HashTables.pdf 2
As an ADT, a dictionary D supports the following functions:

Function	Input	Output	Description
size()	-	Integer	Return the number of items in D.
isEmpty()	-	Boolean	Test whether D is empty.
elements()		Iterator of objects (elements)	Returns the elements stored in D.
keys()		Iterator of objects (keys)	Returns the keys stored in D.
find(k)	Object (key)	Position	If D contain an item with key equal to k, then return the position of such an item. If not, a null position is returned.
findAll(k)	Object (key)	Iterator of Positions	Return an iterator of positions for all items whose key equals k.
insertItem(k,e)	Objects k (key) and e (element)	-	Insert an item with element e and key k into D.
removeElement(k)	Object (key)	-	Remove an item with key equal to k from D. An error condition occurs if D has no such item.
removeAllElements(k)	Object (key)	-	Remove the items with key equal to k from D.

Remarks: The way the items of a dictionary are stored is implementation dependent. The notation p(x) indicates the position of the item storing element x.

Operation	Output	Dictionary
insertItem(5,A) insertItem(7,B) insertItem(2,C) insertItem(8,D) insertItem(2,E) find(7) find(4) find(2) findAll(2) size() removeElement(5) removeElement(5) removeAllElements(2) find(2) findAll(2)	- - - - - p(B) "null" p(C) or p(E) p(C),p(E) 5 - "error" - "null" "empty iterator"	{(5,A)} {(5,A),(7,B)} {(5,A),(7,B),(2,C)} {(5,A),(7,B),(2,C),(8,D)} {(5,A),(7,B),(2,C),(8,D),(2,E)} {(5,A),(7,B),(2,C),(8,D),(2,E)} {(5,A),(7,B),(2,C),(8,D),(2,E)} {(5,A),(7,B),(2,C),(8,D),(2,E)} {(5,A),(7,B),(2,C),(8,D),(2,E)} {(5,A),(7,B),(2,C),(8,D),(2,E)} {(7,B),(2,C),(8,D),(2,E)} {(7,B),(2,C),(8,D),(2,E)} {(7,B),(8,D)} {(7,B),(8,D)} {(7,B),(8,D)}

Position class provides:

Operation	Input	Output	Description
element()	-	Object (element)	Return a reference to the element of the associated item.
key()	-	Object (key)	Return a constant reference to the key of the associated item.
isNull()	-	Boolean	Determine if this is a null position.

8.1.2 Log Files

A simple way of realizing a dictionary is to use an unordered vector, list, or general sequence to store the key-element pairs. Such an implementation is called a log file.
HashTables.pdf 3

Unordered Sequence Implementation
HashTables.pdf 3
The sequence S used for the log file is implemented either a vector or a doubly linked list.

Analysis of the Log File Data Structure
HashTables.pdf 3

Applications for Log Files
HashTables.pdf 3

8.2 Hash Tables

One of the most efficient ways to implement a dictionary is to use a hash table. Although hash tables have high worst-case running times for dictionary ADT operations, we will see that their expected-case running time are excellent. Letting n denote the number of items, the worst-case running times are O(n), but the expected-case times are only O(1).

8.2.1 Bucket Arrays

A bucked array for a hash table is an array A of size N, where each cell of A is thought of as a "bucket" (that is, a container of key-element pairs) and the integer N denotes the capacity of the array. If the keys are integers well distributed in the range [0,N-1], this bucket array is all that is needed - an element e with a key k is simply inserted into the bucket A[k].
If keys are not unique, then two different elements may be mapped to the same bucket in A. In this case, we say that a collision has occurs.

Analysis of the Bucket Array Structure

O(1) for all functions
space Theta(N) - wasteful when N is large relative to n
keys are integers in [0, N-1]

8.2.2 Hash Functions

HashTables.pdf 4-6
The hash function is "good" if it maps the keys in out dictionary to minimize collisions as much as possible.
Also it should be fast and easy to compute.

8.2.3 Hash Codes

The integer assigned to a key k is called the hash code or hash value for k.

Hash Codes in C++
HashTables.pdf 7

Casting to an Integer
Take an integer interpretation of data type X bits as a hash code for X.
HashTables.pdf 7

Summing Components
HashTables.pdf 7

A Small C++ Example
64-bit integer if we have 32-bit integer hash function

int hashCode(int x)
{ return x; }

int hashCode(long x)
{  typedef unsigned long ulong;
   return hashCode(int(ulong(x)>>32)+int(x));
}

int hashCode(long x)
{  typedef unsigned long ulong;
   return hashCode(static_cast<int>(static_cast<ulong>(x) >> 32) 
          + static_cast<int>(x));
}

Polynomial Hash Codes
HashTables.pdf 8

Cyclic Shift Hash Codes

int hashCode(const char* p, int len) // hash a character array
{ unsigned int h = 0;
  for (int i=0; i<len; i++)
  { h = (h<<5)|(h>>27);               // 5-bit cyclic shift
    h += (unsigned int)p[i];         // add in next character
  }
  return hashCode(int(h));
}

Experimental Results
25000 English words

Shift	Collisions Total	Collisions Max
0	23739	86
1	10517	21
5	4	2
6	6	2
11	453	4

Hashing Floating-Point Quantities

int hashCode(const double& x)       // hash a double
{ int len = sizeof(x);
  const char* p = reinterpret_cast<const char *>(&x);
  return hashCode(p, len);
}

8.2.4 Compression Maps

The Division Method
HashTables.pdf 9

The MAD Method
HashTables.pdf 9

8.2.5 Collision-Handling Schemes

HashTables.pdf 10

Separate Chaining

Open Addressing

Linear Probing
HashTables.pdf 11-13

Quadratic Probing

Double Hashing
HashTables.pdf 14-15

8.2.6 Load Factors and Rehashing

Rehashing into a New Table
HashTables.pdf 16

8.2.7 A C++ Hash Table Implementation

html-8.1 (HashEntry)
html-8.2 (Position)
html-8.3 (Hash1)
html-8.4 (Hash2)

hash.cpp

8.3 Ordered Dictionaries

In an ordered dictionary, we wish to perform the usual dictionary operations, but also maintain an order relation for the keys in our dictionary.

8.3.1 The Ordered Dictionary ADT

An ordered dictionary supports the following functions beyond those included in the general dictionary ADT (8.1.1):

closestBefore(k) - Return a position of an item with the largest key less than or equal to k.
closestAfter(k) - Return a position of an item with the smallest key greater than or equal to k.

8.3.2 Look-Up Tables

Dictionary.pdf 6

8.3.3 Binary Search

Dictionary.pdf 5

bsearch.cpp

Analysis of Binary Search
The running time is proportional to the number m of recursive calls. The number of remaining candidates is reduced by at least one half with each recursive call. In the worst case (unsuccessful search), the recursive call stops when there are no more candidates, i.e. n/2^m = 1, m = log n and we obtain O(log n) running time.

Using Look-Up Tables as Ordered Dictionaries
Dictionary.pdf 6

Comparing Simple Ordered Dictionary Implementations

Function	Log File	Look-Up Table
size(), isEmpty()	O(1)	O(1)
keys(), elements()	O(n)	O(n)
find(key)	O(n)	O(log n)
findAll(key)	Theta(n)	O(log n + s)
insertItem(key, element)	O(1)	O(n)
removeElement(key)	O(n)	O(n)
removeAllElements()	Theta(n)	O(n)