Sunday, September 29, 2013

Sorting Quickly

Merge sort and heap sort both run in O(nlgn), which is where the lower limit lies in comparison sort algorithms.

Merge sort is intuitive and fairly straightforward. Heap sort has the advantage of being able to be used in priority queues efficiently.

Then there is quick sort. Quick sort also runs in O(nlgn), but only for its average and best case scenarios. In the worst case, quick sort runs in O(n^2). However, quick sort is preferred over merge sort and heap sort. Why? Because other than the worst case, quick sort runs faster in practice due to its tighter loops. But what about the worst case? The good news is that there is only 1 worst case sequence, which is if at each iteration, we choose the largest number as the pivot. If we use a randomized algorithm of quick sort, the chances of hitting the worst case is 1/n!, which is very low. Hence, in practice, it's worth the risk of hitting the worst case.

Here's my code for quick sort and a test file.
https://gist.github.com/wmwong/6758508#file-quicksort-rb
https://gist.github.com/wmwong/6758508#file-quicksort_test-rb

Saturday, September 28, 2013

Fun with Heaps!

It's been a while since my undergraduate and I've forgotten how fun my Computer Science degree was.

I'm re-reading Introduction to Algorithms and having quite a lot of fun with it.

The latest chapter I went over was the chapter on heaps. So I decided to implement some of the data structures and algorithms.

Here is my code for creating heaps (both max and min-heaps) and heap sort.
https://gist.github.com/wmwong/6747122#file-heap-rb

I also created a test file.
https://gist.github.com/wmwong/6747122#file-heap_test-rb

Also introduced along with heaps are priority queues as they go really well together. Here are my implementations for a max and min-priority queue.
https://gist.github.com/wmwong/6747122#file-priority_queue-rb

And of courses tests!
https://gist.github.com/wmwong/6747122#file-priority_queue_test-rb

Friday, September 27, 2013

Code Kata 6: Anagrams

For this one, we get to play with anagrams! Anagrams are words with the same letters. An example is "read" and "dear".

The task is to take a list of words and find all the anagrams in the list.
http://codekata.pragprog.com/2007/01/kata_six_anagra.html

Being honest, I have to tell you that I originally misread the task. I thought that I was supposed to generate all possible anagrams for each word in the list. Instead, you're supposed to find all words that are anagrams within the list. More on my folly later.

For this task, the link to the word list is broken. I happened to find it after Googling, so I included it in my Gist.
https://gist.github.com/wmwong/6724712#file-wordlist-txt

My solution was fairly simple. I used a hash function which basically took each word and sorted the letters in alphabetical order. I then used the sorted word as the key to my hash and stored the original word as the value. Because of the nature of anagrams, all words that are anagrams of each other will generate the same sorted word.
https://gist.github.com/wmwong/6724712#file-6_anagrams-rb

There is one issue with this solution though. It is not optimal. We need to sort each word. If we let n be the number of words, and m be the average number of letters in a word, our running time is O(nmlogm) if we use merge sort.

I cannot take credit for coming up with a better solution, but I will share it here. There is a theorem called the Fundamental theorem of arithmetic. This gives rise to two things: that all integers greater than 1 can be represented as a product of primes and that those primes are always the same (although you can order them any way you like).

How does this help us? Well, that means we can assign each letter in the alphabet a unique prime number. When we get a word, we take each letter, map it to its prime number, and multiply it all together. This generates a unique integer. In order to get the same integer, you must multiply the same set of primes, which maps to the same set of letters, which means we found an anagram! What's more important is that this hash function takes O(m), which was better than our hash function based on sorting which was O(mlogm).

With this new hash function, that brings the algorithm to O(nm) which is an improvement over O(nmlogm). In reality though, m could probably be considered a constant because the average length of a word doesn't really change and so either solution would be just as good.

I didn't write example code for this as the code is straight forward.

Instead, I'll dive into what I thought was the task.

After reading the task, I thought that we were given a list of words, and for each word, we were to find all anagrams of that word. This task was actually harder than the original task as I had to generate possible anagrams of a word and then check them against a dictionary.

Here is my solution to this task. I stopped after I noticed I was working on the wrong task so it's not fully thought through. My first attempt tries to reduce the number of permutations checked by disregarding duplicate letters. Further optimizations could be made by memoizing each set of letters we come across. This also does not massage the word to only contain alphanumeric characters, so this solution may not be 100% correct.
https://gist.github.com/wmwong/6724712#file-6_anagram-rb

And here is the test file I used to check the solution.
https://gist.github.com/wmwong/6724712#file-6_anagram_test-rb

Lastly, here is the file I used to run the small example given in Code Kata 6. This is where I found out my list looked different than his, and that I was actually solving the wrong task.
https://gist.github.com/wmwong/6724712#file-6_anagram_generator-rb

Monday, September 23, 2013

Code Kata 4: Data Munging

This one is about reading data from a file and retrieving relevant data.
http://codekata.pragprog.com/2007/01/kata_four_data_.html

Solutions

Kata Questions

  • Deciding to use regular expressions made the refactoring much more easy.
  • When I wrote the second program, I definitely followed many of the things I did in the first program. This made refactoring much more easy too.
  • Refactoring code to rip out common code is usually a good practice. This helps in maintainability especially when you have to change the piece of code. If it wasn't refactored, you would have to change the code in two different places and test them separately. It also becomes a candidate for reuse when you come across the same scenario. If done right, it helps with readability because it breaks code into smaller chunks. This way, you can hold less in your head. There is, however, one scenario where I find refactoring out common code is a downfall: when future requirements cause slight differences in the common code. Because we don't know all the cases we will ever handle when we refactor out common code, it is hard to account for this. However, there have been several times where future changes cause slight differences in the common code and it turned out that refactoring was a bad idea.

Code Kata 3: How Big, How Fast?

This one helps you quickly estimate sizes and speed.
http://codekata.pragprog.com/2007/01/kata_three_how_.html

How Big?

How many bits?

For this one, the way I thought about it is that 10 bits is roughly 1000 (2^10).
11 bits gives you roughly 2000 (2^11).
A pattern emerges as such: 2000 = 1000 * 2 ~ 2^10 * 2^1 = 2^(10+1).
Hence, you can quickly figure out a rough approximation by seeing how many multiples of 1000 there are and then figure out the leftovers.
  • 1,000 ~ 2^10
  • 1,000,000 ~ 2^20
  • 1,000,000,000 ~ 2^30
  • 1,000,000,000,000 ~ 2^40
  • 8,000,000,000,000 = 1,000,000,000,000 * 8 ~ 2^40 * 2^3 = 2^43

How much space?

For a town of 20,000 residences, to store their names, addresses, and phone numbers, we can assume that we need about 200 characters per resident. This gives us 20,000 * 200 which is 4,000,000 characters. If each character takes up a byte, then we end up with roughly 4MB of data.

Binary trees!

Storing 1,000,000 integers in a binary tree will require 1,000,000 nodes. In the best case (balanced), the number of levels will be log2(1,000,000). Using our approximation technique from above, that would be roughly 20 levels. Any worse then a balanced tree would have more levels up to 1,000,000 levels. The space required to hold a tree like this would be calculated by noticing what data each node needs to hold. Each node needs to hold its own value, a pointer to the left child, and a pointer to the right child. If each piece of data requires 32 bits (4 bytes), then each node requires 12 bytes. This totals 12,000,000 bytes or 12MB of data.

How fast?

How long?

Assuming that each page in the book has roughly 200 characters. Each character takes up a byte. This means we have roughly 240,000 bytes to send. A 56kb/s modem translates to a 7,000B/s modem. This means it would take roughly 350 seconds which is roughly 6 minutes.

Binary search!

It takes 4.5mS to search 10,000 entries and 6mS to search 100,000. There is 10x more to search and it was done in 1.5mS more time. Hence, for 10,000,000, it should take roughly 9mS.

Brute force

There are 96^16 possible passwords.
A rough estimation would be 100^16 = (10^2)^16 = 10^32.
Since we can do 1000 searches per second, this would take 10^32 / 10^3 = 10^29 seconds to perform.
There are 86,400 seconds in a day which is roughly 100,000 seconds. This gives us 10^24 days.
Dividing by 365 is a little hard to do but dividing by 1000 is "close enough". This gives us 10^21 years.
Brute force is definitely not a viable approach.

Sunday, September 22, 2013

Code Kata 2: Karate Chop

This kata challenges you to implement binary search in 5 different ways and to document things you had trouble with, like "off by one errors" and fencepost errors.
http://codekata.pragprog.com/2007/01/kata_two_karate.html

Here are my attempts in Ruby.
https://gist.github.com/wmwong/6662429

Got any other ideas on how else to solve this? Or feedback for my code? Let me know down below in the comments!