Trail of Coding: Code Kata 6: Anagrams

For this one, we get to play with anagrams! Anagrams are words with the same letters. An example is "read" and "dear".

The task is to take a list of words and find all the anagrams in the list.
http://codekata.pragprog.com/2007/01/kata_six_anagra.html

Being honest, I have to tell you that I originally misread the task. I thought that I was supposed to generate all possible anagrams for each word in the list. Instead, you're supposed to find all words that are anagrams within the list. More on my folly later.

For this task, the link to the word list is broken. I happened to find it after Googling, so I included it in my Gist.
https://gist.github.com/wmwong/6724712#file-wordlist-txt

My solution was fairly simple. I used a hash function which basically took each word and sorted the letters in alphabetical order. I then used the sorted word as the key to my hash and stored the original word as the value. Because of the nature of anagrams, all words that are anagrams of each other will generate the same sorted word.
https://gist.github.com/wmwong/6724712#file-6_anagrams-rb

There is one issue with this solution though. It is not optimal. We need to sort each word. If we let n be the number of words, and m be the average number of letters in a word, our running time is O(nmlogm) if we use merge sort.

I cannot take credit for coming up with a better solution, but I will share it here. There is a theorem called the Fundamental theorem of arithmetic. This gives rise to two things: that all integers greater than 1 can be represented as a product of primes and that those primes are always the same (although you can order them any way you like).

How does this help us? Well, that means we can assign each letter in the alphabet a unique prime number. When we get a word, we take each letter, map it to its prime number, and multiply it all together. This generates a unique integer. In order to get the same integer, you must multiply the same set of primes, which maps to the same set of letters, which means we found an anagram! What's more important is that this hash function takes O(m), which was better than our hash function based on sorting which was O(mlogm).

With this new hash function, that brings the algorithm to O(nm) which is an improvement over O(nmlogm). In reality though, m could probably be considered a constant because the average length of a word doesn't really change and so either solution would be just as good.

I didn't write example code for this as the code is straight forward.

Instead, I'll dive into what I thought was the task.

After reading the task, I thought that we were given a list of words, and for each word, we were to find all anagrams of that word. This task was actually harder than the original task as I had to generate possible anagrams of a word and then check them against a dictionary.

Here is my solution to this task. I stopped after I noticed I was working on the wrong task so it's not fully thought through. My first attempt tries to reduce the number of permutations checked by disregarding duplicate letters. Further optimizations could be made by memoizing each set of letters we come across. This also does not massage the word to only contain alphanumeric characters, so this solution may not be 100% correct.
https://gist.github.com/wmwong/6724712#file-6_anagram-rb

And here is the test file I used to check the solution.
https://gist.github.com/wmwong/6724712#file-6_anagram_test-rb

Lastly, here is the file I used to run the small example given in Code Kata 6. This is where I found out my list looked different than his, and that I was actually solving the wrong task.
https://gist.github.com/wmwong/6724712#file-6_anagram_generator-rb

Trail of Coding

Friday, September 27, 2013

Code Kata 6: Anagrams

1 comment: