Hash table vs Binary search tree

Hash table and binary search tree are two fundamental data structures in CS. When would you want to use one over another one? What is the difference? This is a common interview question that I had most frequently been asked.

Well, it is not easy to answer by one or two sentences. The main difference between hash table and trees is on two aspects: Implementation details and  behaviors and performance under different circumstance.

Let me start with implementation. Hash table uses hash function to assign index to input data and put data into array under corresponding index. If the size of hash table is 100, hash function can generate 0~99 to input data and store them into hash table. Theoretically,  time complexity of insertion and looking-up is constant time (O(1)).

Binary search tree is implemented as the rule that all left children’s values are less than root, while all right children’s value are greater than it. If the tree is balanced, it always takes O(log(n)) time to insert a new node or look up.O(log(n)) is not as fast as constant time but rather fast.  n is the total number in tree and log(n) is usually depth of tree. Notice that I mentioned if it is balanced tree, however there are sophisticated algorithm (e.g., RB tree) to maintain tree balanced.

It seems we can prefer hash table over tree, but it is not always the case. Hash table has significant drawbacks:

1. As more data input comes, there is huge probability that collision shows up (hash function maps different data to same index). There are two ways to handle collision. First is linear probing that implement hash table as array of linked list. In this case, worst time for insertion or retrieve or deletion is O(n) that all input data are mapped to same index. Besides, hash table need more space than number of input data.  Second way is open addressing. It would not consume more space than input data, but at worst case insertion and retrieve is still O(n), which is extremely slow than constant time.

2. You have to know approximate size of input data before initializing hash table. Otherwise you need to resize hash table which is a very time-consuming operation. For example, your hash table size is 100 and then you want to insert the 101st element. Not only the size of hash table is enlarged to 150, all element in hash table have to be rehashed. This insertion operation takes O(n).

3. The elements stored in hash table are unsorted. In certain circumstance, we want data to be stored with sorted order, like contacts in cell phone.

However, binary search tree performs well against hash table:

1. Binary search tree never meets collision, which means binary search tree can guarantee insertion, retrieve and deletion are implemented in O(log(n)), which is hugely fast than linear time. Besides, space needed by tree is exactly same as size of input data.

2. You do not need to know size of input in advance.

3. all elements in tree are sorted that in-order traverse takes O(n) time.

OK, let me make a summary.

If you know how many data to maintain, and have enough space to store hash table and do not need data to be sorted, hash table is always good choice. Because, hash table provides constant time operation for insertion, retrieve and deletion. On the other hand, if items will be consistently added, binary search tree’s O(log(n)) operation is acceptable, comparing with rehashing operation during running time.

Besides, If you actually do not know size of input items, but after inserting, most operations are item looking up, hash table is preferred due to constant retrieve time. However, if items are continuously added or removed, tree’s O(log(n)) insertion and deletion time are more suitable in this condition.

In a word, there is no one answer that hash table or tree is better. All we need to know is pros and cons of hash table and tree in different conditions. Best decision can be made with knowledge of benefits and trade-offs of these two structures.

Topics on semaphore

A couple of days ago, I had a phone interview with Cisco. During the interview, I had been asking several questions about multi-threads programming, which embarrassed me a lot. After review the knowledge of Operating System, I realize several basic concepts about semaphore to mark.

1. What is semaphore?

In Wikipedia, semaphore is defined as a variable or data structure that provides simple but useful abstraction for controlling access by multiple processes to a common resource in a parallel programming environment.

We cannot talk about semaphore without mutex? So,

2. What is mutex?

Mutex is a way of mutual exclusion referring to the problems of ensuring that no two processes or threads can be in their critical section at same time. Here, a critical section refers to a period of time when process accesses a shared resource, such as shared memory.

In a word, mutex is a key that only one thread can hold and unlock the door, while other threads are waiting for this threads leaving and releasing the key.

Now let we see an example, ATM problem.

Husband, Bob, and wife, Ann, withdraw $100 each from their joint account by different ATM at same time. The thread for the first ATM deducts $100 from the couple’s account, but the thread is switched out after executing this line:

newBalance = userBalance - amount;  // amount is $100 in this scenario

userBalance = newBalance;

Processor control then switches to the thread for Ann’s ATM, which also deducting $100. When that thread deducts $100, the account balance is still $500 because the variable, userBalance, has not yet been updated. Ann’s thread executes until completing this function and updates the value of userBalance to $400. Then control switches back to Bob’s transaction.  Bob’s thread has the value $400 in newBalance. Therefore, it simply assigns this value to userBalance and returns. Thus, Bob and Ann have deducted $200 total from their account, but their balance still indicates $400, or a net $100 withdraw. This is a great feature for Bob and Ann, but a big problem for the bank.

So, here is solution by mutex:


// only thread with mutex can process
synchronized bool deposit() {}

synchronized bool withdraw() {}

OK. Since we know what mutex is, it is very natural to ask a  question that

3.  What is difference between semaphore and mutex?

Mutex is a binary semaphore, but semaphore is not mutex. Mutex is always being used as mutual exclusion, e.g. content protection, while semaphore can do more than mutex. Semaphore is a more general concept that limit numbers of threads accessing shared resources concurrently.

Let me use a metaphor from StackOverflow. You can use water to wash your dish, while I can use water to wash my dish. However, the total water is limited. Semaphore is the number of people can use water to wash dishes at same time.

4. What can we do with semaphore?

First, let me introduce syntax of semaphore. To make is simple, I use pseudo-code for semaphore in this article.

The syntax for creating  and initializing it is

fred = semaphore(1)

Semaphore operations are

fred.signal();
fred.wait();

The value of semaphore increase by 1 when signal() is called, while decreased by 1 when wait() is called. The threads will be blocked when value of semaphore is negative and wait() is called. After value of semaphore become positive, the threads will be unblocked.

Let us see the use of semaphore.

(1). Mutual exclusion, which is mentioned above.

(2). Signal. Semaphore is a way of communication between processes/threads.

Assume we have a semaphore named sem with initial value 0, and that Thread A and Thread B can access it.

//Thead A
statement a1;
sem.signal();
//Thead B
sem.wait();
statement b1;

The statement represent arbitrary program statement. To make the example concrete, we assume a1 reads a line from file and b1 displays that line on screen. The semaphore in this program guarantees that Thread A completes a1 before Thread B begins b1.
Here is how it works, if  Thread B comes first, it calls wait() that decrease value of sem to -1. Thus, Thread B is blocked until Thread A comes and completes with calling signal(), which increase value of sem. Thread B is unblock and resumes to process statement b1.

Similarly, if Thread A comes first, it will increase value of sem by 1. When Thread B comes, it will process immediately.  Either way will guarantee statement a1 precedes statement b1.

Signaling is a basic use of semaphore, more uses will be introduced later.

There is a classic multi-threading problem: Producer and Consumer can be solved by semaphore.

Q: In one common pattern, some threads are producers and some are consumers. Producers create items of some kind and add them to a data structure; consumers remove the items and process them. Event-driven programs are a good example. An “event” is something that happens that requires the program to respond: the user presses a key or moves the mouse, a block of data arrives from the disk, a packet arrives from the network, a pending operation completes. Whenever an event occurs, a producer thread creates an event object and adds it to the event buffer. Concurrently, consumer threads take events out of the buffer and process them. In this case, the consumers are called “event handlers.”

A: There are several synchronization constraints that we need to enforce to make system work correctly:

  • Accessing to buffer/content is mutually exclusive among multiple threads.
  • Consumer’s thread is blocked when buffer is empty.
  • Producer’s thread is blocked when buffer is full.

Ok, here is my code for producer/consumer problem.

// initialization
mutex = semaphore(1);
notEmpty = semaphore(0);
notFull = semaphore(buffer.size());

// code for Thread producer
notFull.wait();
mutex.wait();
producer statement;
mutex.signal();
notEmpty.signal();

//code for Thread consumer
notEmpty.wait();
mutex.wait();
consumer statement;
mutex.signal();
notFull.signal();
Design a site like this with WordPress.com
Get started