Few Notes on SQL

There are couples of  important concepts in SQL that confused me for a long time. After one afternoon reading in library, I realize a few tricks on SQL. So I decide to take a note as snapshot for future.

1. Index

Index is primary mechanism to get improved performance on relational database design, especially for speeding up queries. Indexes, which are  persistent data structure and stored in database, are created automatically or explicitly by programmer.

Let us see how index works. If we have a table below,

A B
cat 2
dog 5
cow 1
dog 9
cat 2
cat 8
cow 6

Indexes can be created on by ” create index on T.A”, so the database turns into

index A B
1 cat 2
2 dog 5
3 cow 1
4 dog 9
5 cat 2
6 cat 8
7 cow 6

If query T.A = ‘cow’ is sent, ‘cow 1’ and ‘cow 6’ will be returned. Be clear that user don’t access indexes; they are used underneath by the query execution engine.

The mechanism is indexing every tuples of database to speed up retrieve time ( in some sense like hash table).  So it is nature to ask which data structure is index based on? Well, two popular data structures, balanced tree and hash table, are candidates for index. Balanced tree provides O(logn) time for looking-up, which is not bad even for large numbers of data. While hash table provides O(1) for searching if no collisions happen. This seems hash table outweighs balanced tree. However, data stored in balanced tree are ordered which is efficient for inequality query, like query T.A  < ‘d’. Hence, there is trade-off on these two data structure for index and choose appropriate data structure based on type of query.

Downsides of Indexes. Obviously,  indexes are employed to improve performance of query on database. Every object has both sides, no excepts for indexes. The drawbacks of indexes are mainly illustrated as below.

(1) Indexes consume extra space. It is obvious that new column of index is added to database when index is created. Indexes will cost large memory when database consists of large mount of rows.

(2) Indexes creation consumes time. The index creation time seems trivial compared with query time of large database. But it is not case when database is relatively small.

(3) Indexes maintenance becomes huge problem for designers (significant).  Any updates, new tuple insertion or deletion might introduce data integrity problem to database designing. The burden of Indexes maintenance might even offset benefit of indexes itself.

As we talked above, indexes have both benefits and downsides for database design. Generally, whether we use indexes depends on:

(1) Size of table. Indexes are not benefit for small table, for both time cost and space cost.

(2) Data distribution. Table with lots of duplicates is better without indexes, due to less efficiency improvement on query but huge consumption on space and time.

(3) Query vs update load. If this database is designed for lots of queries, indexes will definitely have benefits for performance of database. However, if operations on database are mainly update load, indexes maintenance will become huge burden that benefit of indexes will be offset.

2. Referential Integrity

Integrity of references is a mechanism to keep data integrity between tables, which is implemented by primary key and foreign key.

Let us show this by an example. If we have two tables Student(sID, sName, GPA, HS) and Apply(sID, cName, major, dec).

Student
sID sName GPA HS
123 Mary 3.8
Apply
sID cName major dec
123 Stanford CS y

In this example, Student.sID is primary key while Apply.sID is foreign key.

Definition: Referential integrity is  from R.A to S.B. Each value in column A of Table R must appear in column B of table S. ( elements of  foreign key is subset of primary key)

(1) A is called the foreign key. (Apply.sID)

(2) B is usually required to be the primary key for table S or at least unique. (Student.sID)

If any updates, insertion or deletion on primary key, foreign key will do same to table S, which is mechanism of referential integrity.

 

Hash table vs Binary search tree

Hash table and binary search tree are two fundamental data structures in CS. When would you want to use one over another one? What is the difference? This is a common interview question that I had most frequently been asked.

Well, it is not easy to answer by one or two sentences. The main difference between hash table and trees is on two aspects: Implementation details and  behaviors and performance under different circumstance.

Let me start with implementation. Hash table uses hash function to assign index to input data and put data into array under corresponding index. If the size of hash table is 100, hash function can generate 0~99 to input data and store them into hash table. Theoretically,  time complexity of insertion and looking-up is constant time (O(1)).

Binary search tree is implemented as the rule that all left children’s values are less than root, while all right children’s value are greater than it. If the tree is balanced, it always takes O(log(n)) time to insert a new node or look up.O(log(n)) is not as fast as constant time but rather fast.  n is the total number in tree and log(n) is usually depth of tree. Notice that I mentioned if it is balanced tree, however there are sophisticated algorithm (e.g., RB tree) to maintain tree balanced.

It seems we can prefer hash table over tree, but it is not always the case. Hash table has significant drawbacks:

1. As more data input comes, there is huge probability that collision shows up (hash function maps different data to same index). There are two ways to handle collision. First is linear probing that implement hash table as array of linked list. In this case, worst time for insertion or retrieve or deletion is O(n) that all input data are mapped to same index. Besides, hash table need more space than number of input data.  Second way is open addressing. It would not consume more space than input data, but at worst case insertion and retrieve is still O(n), which is extremely slow than constant time.

2. You have to know approximate size of input data before initializing hash table. Otherwise you need to resize hash table which is a very time-consuming operation. For example, your hash table size is 100 and then you want to insert the 101st element. Not only the size of hash table is enlarged to 150, all element in hash table have to be rehashed. This insertion operation takes O(n).

3. The elements stored in hash table are unsorted. In certain circumstance, we want data to be stored with sorted order, like contacts in cell phone.

However, binary search tree performs well against hash table:

1. Binary search tree never meets collision, which means binary search tree can guarantee insertion, retrieve and deletion are implemented in O(log(n)), which is hugely fast than linear time. Besides, space needed by tree is exactly same as size of input data.

2. You do not need to know size of input in advance.

3. all elements in tree are sorted that in-order traverse takes O(n) time.

OK, let me make a summary.

If you know how many data to maintain, and have enough space to store hash table and do not need data to be sorted, hash table is always good choice. Because, hash table provides constant time operation for insertion, retrieve and deletion. On the other hand, if items will be consistently added, binary search tree’s O(log(n)) operation is acceptable, comparing with rehashing operation during running time.

Besides, If you actually do not know size of input items, but after inserting, most operations are item looking up, hash table is preferred due to constant retrieve time. However, if items are continuously added or removed, tree’s O(log(n)) insertion and deletion time are more suitable in this condition.

In a word, there is no one answer that hash table or tree is better. All we need to know is pros and cons of hash table and tree in different conditions. Best decision can be made with knowledge of benefits and trade-offs of these two structures.

Topics on semaphore

A couple of days ago, I had a phone interview with Cisco. During the interview, I had been asking several questions about multi-threads programming, which embarrassed me a lot. After review the knowledge of Operating System, I realize several basic concepts about semaphore to mark.

1. What is semaphore?

In Wikipedia, semaphore is defined as a variable or data structure that provides simple but useful abstraction for controlling access by multiple processes to a common resource in a parallel programming environment.

We cannot talk about semaphore without mutex? So,

2. What is mutex?

Mutex is a way of mutual exclusion referring to the problems of ensuring that no two processes or threads can be in their critical section at same time. Here, a critical section refers to a period of time when process accesses a shared resource, such as shared memory.

In a word, mutex is a key that only one thread can hold and unlock the door, while other threads are waiting for this threads leaving and releasing the key.

Now let we see an example, ATM problem.

Husband, Bob, and wife, Ann, withdraw $100 each from their joint account by different ATM at same time. The thread for the first ATM deducts $100 from the couple’s account, but the thread is switched out after executing this line:

newBalance = userBalance - amount;  // amount is $100 in this scenario

userBalance = newBalance;

Processor control then switches to the thread for Ann’s ATM, which also deducting $100. When that thread deducts $100, the account balance is still $500 because the variable, userBalance, has not yet been updated. Ann’s thread executes until completing this function and updates the value of userBalance to $400. Then control switches back to Bob’s transaction.  Bob’s thread has the value $400 in newBalance. Therefore, it simply assigns this value to userBalance and returns. Thus, Bob and Ann have deducted $200 total from their account, but their balance still indicates $400, or a net $100 withdraw. This is a great feature for Bob and Ann, but a big problem for the bank.

So, here is solution by mutex:


// only thread with mutex can process
synchronized bool deposit() {}

synchronized bool withdraw() {}

OK. Since we know what mutex is, it is very natural to ask a  question that

3.  What is difference between semaphore and mutex?

Mutex is a binary semaphore, but semaphore is not mutex. Mutex is always being used as mutual exclusion, e.g. content protection, while semaphore can do more than mutex. Semaphore is a more general concept that limit numbers of threads accessing shared resources concurrently.

Let me use a metaphor from StackOverflow. You can use water to wash your dish, while I can use water to wash my dish. However, the total water is limited. Semaphore is the number of people can use water to wash dishes at same time.

4. What can we do with semaphore?

First, let me introduce syntax of semaphore. To make is simple, I use pseudo-code for semaphore in this article.

The syntax for creating  and initializing it is

fred = semaphore(1)

Semaphore operations are

fred.signal();
fred.wait();

The value of semaphore increase by 1 when signal() is called, while decreased by 1 when wait() is called. The threads will be blocked when value of semaphore is negative and wait() is called. After value of semaphore become positive, the threads will be unblocked.

Let us see the use of semaphore.

(1). Mutual exclusion, which is mentioned above.

(2). Signal. Semaphore is a way of communication between processes/threads.

Assume we have a semaphore named sem with initial value 0, and that Thread A and Thread B can access it.

//Thead A
statement a1;
sem.signal();
//Thead B
sem.wait();
statement b1;

The statement represent arbitrary program statement. To make the example concrete, we assume a1 reads a line from file and b1 displays that line on screen. The semaphore in this program guarantees that Thread A completes a1 before Thread B begins b1.
Here is how it works, if  Thread B comes first, it calls wait() that decrease value of sem to -1. Thus, Thread B is blocked until Thread A comes and completes with calling signal(), which increase value of sem. Thread B is unblock and resumes to process statement b1.

Similarly, if Thread A comes first, it will increase value of sem by 1. When Thread B comes, it will process immediately.  Either way will guarantee statement a1 precedes statement b1.

Signaling is a basic use of semaphore, more uses will be introduced later.

There is a classic multi-threading problem: Producer and Consumer can be solved by semaphore.

Q: In one common pattern, some threads are producers and some are consumers. Producers create items of some kind and add them to a data structure; consumers remove the items and process them. Event-driven programs are a good example. An “event” is something that happens that requires the program to respond: the user presses a key or moves the mouse, a block of data arrives from the disk, a packet arrives from the network, a pending operation completes. Whenever an event occurs, a producer thread creates an event object and adds it to the event buffer. Concurrently, consumer threads take events out of the buffer and process them. In this case, the consumers are called “event handlers.”

A: There are several synchronization constraints that we need to enforce to make system work correctly:

  • Accessing to buffer/content is mutually exclusive among multiple threads.
  • Consumer’s thread is blocked when buffer is empty.
  • Producer’s thread is blocked when buffer is full.

Ok, here is my code for producer/consumer problem.

// initialization
mutex = semaphore(1);
notEmpty = semaphore(0);
notFull = semaphore(buffer.size());

// code for Thread producer
notFull.wait();
mutex.wait();
producer statement;
mutex.signal();
notEmpty.signal();

//code for Thread consumer
notEmpty.wait();
mutex.wait();
consumer statement;
mutex.signal();
notFull.signal();

Begin to code

I  need to think about what kind of programming language I will use before writing codes.

If pick C, I need to consider how much memory I will need and declare that. It is dangerous to use undeclared memory.

If pick C++/Java ( I do not know much though), I should write code by C++-style ( containers, methods). C/C++ mixture is not a good habit.

This is a C/C++ mixture code. (dirty code)

string replaceStr(string str)
{
string strNew;
string::size_type nx=0;
for (string::size_type ox=0; ox != str.size(); ++ox)
{
if (str[ox] == ' ')  // replace space with '%20'
{
strNew[nx++] = '%';  // error: assign value to empty string!!!!!
strNew[nx++] = '2';
strNew[nx++] = '0';
}
else
strNew[nx++] = str[ox];
}

return strNew;
}

This below is C++-style code (container). This piece of code is totally OK

 string replaceStr(string str)
 {
 string strNew;
 for (string::iterator iter = str.begin(); iter != str.end(); iter++)
 {
 if (*iter == ' ')
 strNew += "%20";
 else
 strNew += *iter;
 }

return strNew;
 }
 
Design a site like this with WordPress.com
Get started