Coding katas Clojure -- Trigrams

Kata 14 is a seemingly simple one that is concerned with, as Dave Thomas puts it, “the heuristics of processing” texts, using trigrams to produce (more or less) random new texts.

Trigrams are not a new concept for me. Although the underlying concept is simple, they can be used for many interesting applications. Trigrams are a special case of N-grams, where N=3 turns out to be especially useful (as in “giving better results as other values for N”) for natural language processing, at least for western languages. Nearly a decade ago, I had the pleasure to collaborate with some rather smart people who used trigrams to identify “matching” text snippets between dictionary entries. The idea was similar to what is described in this article on using trigram vectors and nearest neighborhood calculation for document classification.As I’m generally interested in NLP and not only in doing coding katas, I will mainly focus on the trigram aspects in this kata, not so much on the random text generation part.

If you followed the link to the Wikipedia article, it’s clear from the kata description that we need word-level trigrams, not character-level trigrams. The kata description also already reveals the data structure to use for solving the task, a HashMap and the algorithm is also described in enough detail to be straight-forward.

Let’s augment the kata a little and decompose the tasks:

  1. split some string into n-grams with default n=3, where we might want to apply different criteria to apply on where we can split the string (e.g. after each character or after each word)
  2. parse a file into n-grams, where we need to consider sentence boundaries
  3. parse a collection of files into n-grams concurrently (just to speed up parsing of a larger file collection and also to introduce another possibility to learn a little more about Clojure’s specific tools to handle concurrency)
  4. do some analysis on the trigrams found in the recommended Tom Swift and his aircraft text
  5. modify the n-gram computation to yield the “first two words as key / list of all third words value” map described in the kata
  6. build a lazy-seq version of the text generation algorithm (because, as the example in the description already shows, there might be circles which could lead to infinite results)
  7. maybe implement the nearest neighborhood classification scheme described in the paper linked just for fun

But first things first: let’s parse some string into trigrams. This, first of all, requires tokenization. As a first obvious naive idea, we start out with simple string splitting, using clojure.string. First let’s split on all whitespace #"\s", using the first sentence in the Tom Swift text:

kata14-trigrams.core> (str/split "Are you all ready, Tom?" #"\s")
["Are" "you" "all" "ready," "Tom?"]

This already shows the issues surrounding punctuation that Dave Thomas mentions in the kata description. Basically, we have to consider what we want to do with sentence boundaries. Fortunately, we’re ultimately using Java’s Pattern class, so we can also match (or split) on punctuation, although probably not on all punctuation, but only on those which signify a sentence boundary (i.e. the charset [.!?] followed by either whitespace #"\s+" or end of line $):

kata14-trigrams.core> (str/split "Are you all ready - Tom?" #"\s*\p{Punct}\s*")
["Are you all ready" "Tom"]
kata14-trigrams.core> (str/split "Are you all ready, Tom?" #"\s*[.?!](\s+|$")
["Are you all ready, Tom"]
kata14-trigrams.core> (map #(str/split %1 #"\s+")
                      (str/split "Are you all ready, Tom? I want to go."
                        #"\s*[.!?](\s+|$)"))
(["Are" "you" "all" "ready," "Tom"] ["I" "want" "to" "go"])

This still leaves the question open of what we want to do with the comma or any other interleaving punctuation. It’s clear that we want to get rid of it somehow, but it’s not too clear whether we would like to see “Tom” as a valid consecutive element in the text generation part. Probably not, so an idea here would be to try to make the remaining punctuation elements visible as separate tokens.

Let’s put this issue aside and move on to the actual n-gram generation. Quite obviously, “computing” an n-gram is really just a simple sequence operation: you move through the sequence, always taking n elements as needed, until you’re done. This is completely straight-forward to accomplish with a simple accumulator (acc) to collect the results that we take while looping through the sequence. (Code is on github, as always.)

(defn ngram
   "Given a sequence sq and a number n, returns a sequence of new contiguous sequences of n items that appear in sq."
   ([squence n]
     (ngram squence n []))
   ([squence n acc]
         (if-let [sq (seq squence)]
             (recur (rest sq) n (conj aux (take n sq)))
             acc)))

Given that we might want to run this on longer strings (texts, books), it makes sense to make this lazy by wrapping the call to the accumulator version in a lazy sequence.

 kata14-trigrams.core> (ngram [1 2 3 4 5 6] 3)
 ((1 2 3) (2 3 4) (3 4 5) (4 5 6) (5 6) (6))
 kata14-trigrams.core> (realized? (ngram [1 2 3 4 5 6] 3))
 false
 kata14-trigrams.core> (take 2 (ngram [1 2 3 4 5 6] 3))
 ((1 2 3) (2 3 4)

Okay, let’s combine this with our clojure.string/split experiments:

kata14-trigrams.core> (map #(str/split %1 #"\s+")
                      (str/split "Are you all ready, Tom? I want to go."
                        #"\s*[.!?](\s+|$)"))
(["Are" "you" "all" "ready," "Tom"] ["I" "want" "to" "go"])
kata14-trigrams.core> (map #(ngram %1 3) *1)
((("Are" "you" "all") ("you" "all" "ready,") ("all" "ready," "Tom") ("ready," "Tom") ("Tom"))
 (("I" "want" "to") ("want" "to" "go") ("to" "go") ("go")))

Okay, this looks like we’ve basically have everything we need in hand, now let’s make it a little bit more formal. First of all, the tokenization step. So far, we have basically done two things in one step, sentence boundary detection and in-sentence tokenization. I already hinted at the need to do further work on the in-sentence tokenization wrt. punctuation and there might be other steps that we might want to add in the future, for instance, stemming or further morphological analysis. I’ll not go in the direction of a more thorough tokenization method, which would require to go beyond regular expressions for many languages, but at least let’s communicate the intention of how the tokenize method works clearly.

(fact "Tokenize an input string, splitting sentences along the way"
      (tokenize "Are you ready, Tom? I want to go.") => '(("Are" "you" "ready" "," "Tom")
                                                                  ("I" "want" "to" "go")))

(defn tokenize
  "Tokenize a string"
  [string]
  (-> string
     (split-sentences)
     (tokenize-sentences)))

This is basically the top-level function for tokenizing an incoming string, threading the result of splitting sentences into a tokenization function. Let’s take a look at the details, which shouldn’t be surprising at all. First we have split-sentences, which is basically str/split on sentence end markers. Then we have split-on-whitespace, which we’ve also already seen. split-off-punctuation is basically handling all punctuation not used up during sentence boundary detection, which we will want to keep. And then we have two wrappers tokenize-sentence(s) which do nothing more than handling mapping over the various bits and pieces. This concludes the tokenization step, phew.

(fact "Split sentences in a string"
      (split-sentences "Are you ready, Tom? I want to go.") => ["Are you ready, Tom" "I want to go"])

 (defn split-sentences
   "Split a string into a sequence of sentences"
   [string]
   (str/split string #"\s*[.!?](\s+|$)"))

    (fact "Splitting on whitespace"
     (split-on-whitespace "Are you  ready") => ["Are" "you" "ready"]
     (split-on-whitespace "Are") => ["Are"])

(defn split-on-whitespace
  "Take a string and split it's content on whitespace, removing the whitespace"
  [string]
      (str/split string #"\s+"))

(fact "Splitting but keeping punctuation if any"
     (split-off-punctuation "ready,") => ["ready" ","]
     (split-off-punctuation "ready") => ["ready"]
     (split-off-punctuation "!+#?") => [""])

(defn split-off-punctuation
   "Take a string and split it's content, keeping punctuation as new tokens"
  [string]
  (let [match (re-find #"(\w+)(\p{Punct})?" string)
            result (rest (keep identity match))]
     (if (seq result)
    result
    (vector ""))))

(fact "Tokenize a sentence"
      (tokenize-sentence "Are you ready, Tom?") => '("Are" "you" "ready" "," "Tom" "?"))

(defn tokenize-sentence
  "Take a single sentence and return a sequence of tokens for it"
  [sentence]
  (flatten (map split-off-punctuation
               (split-on-whitespace sentence))))

(fact "Tokenize some sentences"
       (tokenize-sentences ["Are you ready, Tom?" "I want to go."]) => '(("Are" "you" "ready" "," "Tom" "?")
                                                                         ("I" "want" "to" "go" ".")))

(defn tokenize-sentences
   "Take a sequence of sentences and return a sequence of tokens for  each sentence"
  [sentences]
  (map tokenize-sentence sentences))

When combining this with the ngram function, this is already pretty close to what we’ll need to solve the original kata, although we will need some further adjustment to the data structure, which I’m going to tackle later.

kata14-trigrams.core> (map #(ngram %1 3)
                       (tokenize "Are you ready, Tom? I want to go."))
((("Are" "you" "ready") ("you" "ready" ",") ("ready" "," "Tom") ("," "Tom") ("Tom"))
(("I" "want" "to") ("want" "to" "go") ("to" "go") ("go")))

So, let’s move to the next part which is generating ngrams for an entire file. First of all, I think that again we better do this in a lazy fashion, no need to do lots processing of huge files that might not be completely necessary. Looking back we can see that for any given string tokenize processes the same string at least thrice: we’re first splitting on sentence boundaries, then handle punctuation and finally split on whitespace. If you think about reading files, it’s quite obvious that the readLine method of java.io.BufferedReader which is behind Clojure’s line-seq is also processing buffers quite similarly looking for line ends to split on. Maybe, we can combine some of the work? Let’s start out with figuring out how to process a file char by char lazily. An answer to a stackoverflow question on processing files per character in Clojure strictly follows line-seq:

(defn char-seq
  [^java.io.Reader rdr]
   (when-let [chr (.read rdr)]
      (if (>= chr 0)
          (cons (char chr) (lazy-seq (char-seq rdr))))))

This is a start but not too helpful as discussed in this other stackoverflow thread on processing large text files, as the result of line-seq and char-seq is a cons and the lazy part of it doesn’t help you much when you’re not processing the file right away. Instead one might want to return a lazy sequence of results, closing the file only afterwards. This could like this:

; cf. https://stackoverflow.com/questions/4118123/read-a-very-large-text-file-into-a-list-in-clojure/10462159#10462159
(defn lazy-file-chars [file]
   (letfn [(lfl-helper [rdr]
              (lazy-seq
               (if-let [chr (.read rdr)]
                      (when (> chr 0)
                          (cons (char chr) (lfl-helper rdr)))
                      (do (.close rdr) nil))))]
      (lfl-helper (clojure.java.io/reader file))))

When you look at this simple piece of code, besides reading characters from disc and building up a lazy-seq, it’s also a) doing a sanity check on the input and b) building up a particular structure to return. Sounds exactly like the hooks we might want to consider for parsing sentences on read. Let’s rip the code apart and combine it with the guts of split-sentences (matching explicitly on characters instead of using regular expression character classes):

 (defn read-next-sentence [rdr aux]
    (if-let [chr (.read rdr)]
       (let [character (char chr)]
         (cond (= \. character) aux
                       (= \? character) aux
                       (= \! character) aux
                       (= \tab character) (recur rdr (conj aux \ ))
          :else (recur rdr (conj aux character))))
    aux))

(defn file-sentences [file]
     (letfn [(lfs-helper [rdr]
                 (lazy-seq
               (if-let [sentence (seq (read-next-sentence rdr (vector)))]
                 (cons (apply str sentence) (lfs-helper rdr))
                 (do (.close rdr) nil))))]
                (lfs-helper (clojure.java.io/reader file))))

read-next-sentence has some obvious deficiencies: it now splits sentences on every occurrence of .?!, not only on those occurrences which are followed by whitespace. Second, it should handle (only) multiple occurrences of \return\newline characters (CRLF) as sentence delimiters, too. Solving both of these issues requires to go in the direction of real parsers where we would have to see aux as a stack of previously read characters. And we might not only want to deal with tabs specially (turning them into a space), e.g. we might want to replace multiple spaces/tabs into a single space etc. I’ll just draw a sketch here that we might want to elaborate further:

(fact "Test for sentence end"
      (sentence-end-p \space   [\g \o \.])                    => true
      (sentence-end-p \space   [\r \e \a \d \y \?])           => true
      (sentence-end-p \newline [\y \return \newline \return]) => true
      (sentence-end-p \newline [\y \newline])                 => true
      (sentence-end-p "B"      [\.])                          => false
      (sentence-end-p \newline [\y \return])                  => false
      (sentence-end-p \newline [\y])                          => false)

   (fact "Parse result for characters depends on previous reads"
      (next-char-result \space   [\g \o \space])  => [\g \o \space]
      (next-char-result \tab     [\g \o])         => [\g \o \space]
      (next-char-result \tab     [\g \o \space])  => [\g \o \space]
      (next-char-result \tab     [\g \o \tab])    => [\g \o \space]
      (next-char-result \return  [\g \o])         => [\g \o]
      (next-char-result \newline [\g \o \return]) => [\g \o \space]
      (next-char-result \newline [\g \o])         => [\g \o \space])

(defn sentence-end-p [character charstack]
    (cond (and (= character \space)
                       (some (partial = (peek charstack)) [\. \? \!])) true
                (and (= character \return)
                       (some (partial = (peek charstack)) [\. \? \!])) true
                (and (= character \newline)
                       (some (partial = (peek charstack)) [\. \? \!])) true
                (and (= character \newline)
                       (or (= (peek charstack) \newline)
                           (and (= (peek charstack) \return)
                                (= (peek (pop charstack)) \newline)))) true
                :else false))

(defn next-char-result [character charstack]
     (cond (and (empty? charstack)
                (or (= character \space)
                                (= character \tab)
                                (= character \newline)
                                (= character \return))) charstack
                 (and (= character \space)
          (= (peek charstack) \space))  charstack
         (and (= character \tab)
          (= (peek charstack) \space))  charstack
             (and (= character \tab)
              (= (peek charstack) \tab))  (conj (pop charstack) \space) ; should never happen
          (= character \tab) (conj charstack \space)
                      (= character \return) charstack
                 (and (= character \newline)
                      (= (peek charstack) \space)) charstack
                 (and (= character \newline)
                      (= (peek charstack) \return)) (conj (pop charstack) \space) ; should never happen
                      (= character \newline) (conj charstack \space)
                 :else (conj charstack character)))

I’ll leave it at that, although it’s clear that we can and probably should extend it in many different ways. Here are the adapted functions to use these:

(defn read-next-sentence
   ([rdr]
       (read-next-sentence rdr (vector) (vector)))
   ([rdr seen result]
       (let [chr (.read rdr)]
           (if (and chr
                    (>= chr 0))
               (let [character (char chr)]
                   (if (sentence-end-p character seen)
                    result
                        (recur rdr (conj seen character)
                           (next-char-result character result))))
               result))))

(defn read-sentences [x]
    (letfn [(lfs-helper [rdr]
                (lazy-seq
                (if-let [sentence (read-next-sentence rdr)]
                    (cons (apply str sentence) (lfs-helper rdr))
                    (do (.close rdr) nil))))]
        (lfs-helper (clojure.java.io/reader x))))

The result is here that we now have a read-next-sentence function which just reads (non-lazily) and a (local) helper function which uses it to build up lazy sequence of sentences. Let’s test it briefly:

kata14-trigrams.core> (pprint 
                     (map #(ngram %1 3) 
                  (tokenize-sentences 
                  (take 2 
                        (read-sentences test-file)))))
((("The" "Project" "Gutenberg")
 ("Project" "Gutenberg" "EBook")
 ("Gutenberg" "EBook" "of")
 ("EBook" "of" "Tom")
 ("of" "Tom" "Swift")
 ("Tom" "Swift" "and")
     ("Swift" "and" "his")
     ("and" "his" "Airship")
  ...

Although one would probably now integrate more functionality from tokenize-sentences into read-next-sentence, I’ll won’t elaborate this now and see task 2 as solved. As a side note, this looks as if it’s only restricted to files now, but it really isn’t, as clojure.java.io/reader will happily accept StringReader arguments:

kata14-trigrams.core> (import java.io.StringReader)
java.io.StringReader
kata14-trigrams.core> (take 2 (read-sentences (StringReader. "This is a sentence. And another one")))
("This is a sentence." "And another one")

Coding katas Clojure -- Anagrams

Kata 6 is concerned with anagrams. An anagram is a word that consists of characters which, when combined in a different order, form a different word. Now, when I started out with this kata, I was sitting on a train without internet connection, so I just went ahead with what I remembered from a quick glance over the kata description I had done the week before. So I thought that all that needed to be solved was to determine whether two words were anagrams of each other (the complete code, btw. can be found in my github repository for the Clojure codekatas). My initial idea how to solve this is to generate the sets of the characters of both words and compare those:

(defn remove-blanks [word]
         (str/replace word " " ""))

(defn anagram-set? [word1 word2]
         (let [w1 (remove-blanks word1)
            w2 (remove-blanks word2)] 
       (= (set w1) (set w2))))

This time, I opted for using midje for running the tests, in particular due to the possibility to run tests continually via lein-midje. Midje takes a slightly different approach / syntax to writing tests, adding the notion of facts that are then verified. I.e. tests with midje look like this:

(facts "Testing the set implementation for checking anagrams"
      (fact "Set anagram can find anagrams"
         (anagram-set? "the law" "wealth") => true)
     (fact "Set anagram is too simplistic"
         (anagram-set? "the lalalaw" "wealth") => true))

You can already see from the latter fact what is wrong with the initial solution: it’s too simplistic with regard to handling the number of occurences of some character. (Some might say, the introduction of remove-blanks also is too complicated, but I wanted to handle Anne Clark’s “The law is an anagram of wealth”.

Traditional solution

When I finally had some more time to read the kata description more carefully, I recognized that the task actually is to find all anagrams of a given word, checking back against a given wordlist. So that means that the kata consists of two tasks: generate all possible combinations for a given character sequence and check in this wordlist whether some candidate character sequence amounts to a known word. Now, if you take a step back, it’s easy to see that anagrams are nothing else than permutations of the elements of a given (character) sequence, with the additional restriction that all such permutations must be (known) words again. So, we end up with a skeleton which looks like this:

(defn generate-anagrams [word]
   "Generate all anagrams of word"
   (generate-permutations word))

(defn find-anagrams [word words]
   "Finds all anagrams of word in (the sequence of) words"
   (let [anagrams (generate-anagrams word)
              wordset (set words)]
     (loop [candidates anagrams
            result []]
        (if (empty? candidates)
            result
            (recur (rest candidates)
                 (if (and (not (= (first candidates) word))
                     (contains? wordset (first candidates)))
                (concat result (list (first candidates)))
                result))))))

Which now, of course leaves us with the task to implement a permutation algorithm. I must admit I had a pretty hard time to come up with something on my own without resorting to looking at other people’s code. Given that the task of code katas is not primarily to invent algorithms on the fly, but to practice coding, I finally read the wikipedia paragraph on computing permutations in lexicographic order, which has a blue print of an algorithm which is attributed to Naranya Pandita, who invented it in the 14th century already. I took a very verbatim and top-down approach this time and ended up with this as the next piece of code:

(defn next-permutation [squence]
   (when-let [k (find-largest-index-with-bigger-successor squence)]
        (let [l (find-largest-index-bigger-value squence k)
              swapped (swap-positions squence k l)
              current-perm (reverse-tail swapped (inc k))]
          current-perm)))

(defn generate-permutations [squence]
   (let [start-perm (sort squence)]
      (loop [permutation (next-permutation start-perm)
             result (list start-perm)]
         (if (or (not permutation)
             (empty? permutation))
         result
         (recur (next-permutation permutation)
                (concat result (list permutation)))))))

I went back to writing tests:

   (fact "finding the largest index with bigger successor"
         (find-largest-index-with-bigger-successor [1 2]) => 0
         (find-largest-index-with-bigger-successor [1 2 3 4]) => 2
         (find-largest-index-with-bigger-successor [1 2 4 3]) => 1
         (find-largest-index-with-bigger-successor [1 3 4 2]) => 1
         (find-largest-index-with-bigger-successor [1 4 3 2]) => 0)

The test results are taken straight out of the wikipedia article. Implementing this is pretty straight-forward: we just iterate through the list, keeping track of the current position and check whether the following element is bigger than the current element. If so, we keep the current position, otherwise we keep what we had so far as the result. When we reach the end of the sequence (or there is no subsequent to compare to), we have found the largest position (index) that has a successor with a bigger value. One thing is worth pointing out: the usage of (comp pos? compare) is necessary because > does only work on numbers, but no on characters (or keywords). Why Clojure does not follow Python (which provides a general purpose operators, which use something like compare under the hood which you can override for your data types) in this aspect is beyond me.

(defn find-largest-index-with-bigger-successor [squence]
     (loop [restsq (seq squence)
           curpos 0
                   curresult nil]
        (cond (or (empty? restsq)
                  (empty? (rest restsq)))
                curresult
             ((comp pos? compare) (second restsq) (first restsq))
               (recur (rest restsq) (inc curpos) curpos)
                        :else
               (recur (rest restsq) (inc curpos)     curresult))))

The next step is finding the position of some value that is bigger than the position that we just determined. Again this is straight-forward:

   (fact "finding the largest index that has a bigger value than some other position"
         (find-largest-index-bigger-value [1 2 3 4] 2) => 3
         (find-largest-index-bigger-value [1 2 4 3] 1) => 3
         (find-largest-index-bigger-value [1 3 4 2] 1) => 2
         (find-largest-index-bigger-value [1 4 3 2] 0) => 3)

   (defn find-largest-index-bigger-value [squence index]
     (let [compval (nth (vec squence) index)]
    (loop [restsq (seq squence)
             curpos 0
             curresult nil]
            (cond (empty? restsq)
            curresult
            ((comp pos? compare) (first restsq) compval)
               (recur (rest restsq)  (inc curpos)  curpos)
                  :else
              (recur (rest restsq) (inc curpos) curresult)))))

We now have to swap these two elements which is easy enough to do with vectors:

   (fact "swapping two positions in a sequence"
         (swap-positions [1 2 3 4] 2 3) => [1 2 4 3]
         (swap-positions [1 2 4 3] 1 3) => [1 3 4 2]
         (swap-positions [1 3 4 2] 1 2) => [1 4 3 2]
         (swap-positions [1 4 3 2] 0 3) => [2 4 3 1])

   (defn swap-positions [squence k l]
    (let [seqvec (vec squence)]
      (assoc (assoc seqvec k (nth seqvec l))
         l (nth seqvec k))))

I first fiddled around with take and drop to avoid converting the input sequence to a vector but this makes the code much more complex. Why there is no general-purpose position-based replace is, again, beyond me — there are a number of discussions around (the lack of) a general-purpose subsequence function which point out issues with complexity (code-and performance-wise), but I doubt that most manually-crafted workarounds lead to any better solutions. Maybe I’m missing something obvious here.

Next, we need to reverse the rest of the sequence behind the position which we just swapped. The example in the wikipedia article is not entirely clear for longer remainders, but some tests revealed that the right position is really the one we just used, like this:

   (fact "reverse the tail of a sequence"
         (reverse-tail [1 2 4 3] 2) => [1 2 3 4]
         (reverse-tail [1 3 4 2] 1) => [1 2 4 3]
         (reverse-tail [1 4 3 2] 0) => [2 3 4 1])

   (defn reverse-tail [squence tail-position]
    (let [prefix (take tail-position squence)
       tail (drop tail-position squence)
       revtail (reverse tail)]
    (concat prefix revtail)))

So, with this we now have all pieces in our hands and can test the entire algorithm:

   (fact "finding the next permutation"
         (next-permutation [1 2 3 4]) => [1 2 4 3]
         (next-permutation [1 2 4 3]) => [1 3 2 4])

Which will, surprise, surprise, give the expected results. So, with this we are able to generate all 24 permutations of [1 2 3 4] and we can go back to our anagram task. Turns out that the tests would fail: I hadn’t thought about the fact that the destructuring of the character sequence (i.e. the word) would require subsequent combination of the permutation results. That’s easy enough to correct by applying str to all permutation results.

(defn generate-anagrams [word]
    (map (partial apply str) (generate-permutations word)))

Now, when you run this code with the test data given in the original kata:

   (facts "Testing the anagram implementation"
       (fact "Generating all anagrams"
         (generate-anagrams "ftw") => '("ftw" "fwt" "tfw" "twf" "wft" "wtf"))
          (let [words (split-lines (slurp "wordlist.txt"))]
            (find-anagrams "kinship" words) => '("pinkish")
            (find-anagrams "enlist" words) => '("inlets" "listen" "silent")
            (find-anagrams "boaster" words) => '("boaters" "borates")
            (find-anagrams "sinks" words) => '("skins")
            (find-anagrams "knits" words) => '("stink")
            (find-anagrams "rots" words) => '("sort")
            (find-anagrams "thelaw" words) => '("wealth")))

I ran into a StackOverflowException for “boaster” though. Looking at the code, it’s immediately obvious that there the only possible cause for this can be in generate-permutations which generates the result eagerly. So, let’s change that to a lazy variant.

(defn- gen-perms [squenze]
    (lazy-seq
        (when-let [permutation (next-permutation squenze)]
            (cons permutation (gen-perms permutation)))))

(defn generate-permutations [squence]
    (let [start-perm (sort squence)]
        (cons start-perm (gen-perms start-perm))))

I use an external helper here because we need to add the start permutation to the final result up-front and that doesn’t lend itself to a self-recursive function. Anyway, this concludes the first solution using a rather traditional algorithm.

Declarative solution

For the next solution, I intended to use something else. Last year, I had the chance to hear David Nolen talk about core.logic which reminded me a lot of the old days in which I was using Prolog for computational linguistics and logic programming. In particular I was thinking of a permutation implementation in Prolog described in Richard O’Keefe’s Craft of Prolog, which I briefly discuss below:

permutation(Xs, Ys) :-
    permutation(Xs, Ys, Ys).

permutation([],[],[]).
permutation([X|Xs], Ys1, [_|Bound]) :-
    permutation(Xs, Ys, Bound),
    insert(Ys, X, Ys1).

insert(L, X, [X|L]).
insert([H|T], X, [H|L]) :-
    insert(T,X,L).

If you would want to generate all permutations for a list [1,2,3], you would call permutation([1,2,3],Q) and your Prolog interpreter of choice (e.g. SWI-Prolog) would generate the first possible result for Q and via backtracking generate all other possible permutations.

?- permutation([1,2,3],Q).
Q = [1, 2, 3] ;
Q = [2, 1, 3] ;
Q = [2, 3, 1] ;
Q = [1, 3, 2] ;
Q = [3, 1, 2] ;
Q = [3, 2, 1].

Let’s briefly discuss the Prolog solution, this will make it easier to discuss some issues when translating this to core.logic later on. Prolog uses facts and rules to prove some query. E.g., permutation([],[],[]). is a fact asserting that the permutation of an empty list is the empty list. Anything involving :- is a rule. Prolog uses unification — hang on, you’ll see in a second what this is. Second, you see all those [X|Xs] constructions. These are basically list (de-)construction operations: they split off the first element or add an element (head) and some rest (tail) to form a new list. The point here is that if you’re calling permutation([1,2,3],Q,Q) Prolog will try to unify [1,2,3] with [X|Xs] which is possible when X=1 and Xs=[2,3]; i.e. Prolog automatically tries argument unification. The _ construct means “ignore”, “don’t care”. If we consider only the insert fact (i.e. the first statement), this fact can be used by Prolog via unification to answer queries about any value of the predicate:

?- insert([2,3],1,Q).
Q = [1, 2, 3] 
?- insert([2,3],Q,[1,2,3]).
Q = 1 
?- insert(Q,1,[1,2,3]).
Q = [2, 3] 

The key to understand how permutation works is considering how insert works: the insert rule will deconstruct the first argument (assuming it’s a list) and insert the second argument to it. This way, X will be inserted in all possible positions of the list:

?- insert([2,3],1,Q).
Q = [1, 2, 3] ;
Q = [2, 1, 3] ;
Q = [2, 3, 1].

Now, if you take a closer look at the permutation/3 rule, you’ll recognize that it first of all contains a recursive call to itself. This will basically decompose the first argument (if given) until it reaches the permutation fact governing the base case, i.e. the empty list. It will then insert the elements according to the behavior discussed above. You can think of all comma , as and including a notion of order, i.e. the insert clause will only be used after having processed the recursive call to permutation on each level, respectively. This basically implies a depth-first search — i.e. for generating the multiple values for Q, Prolog will try to find different possible combinations by retrying parts of the computation. This will in particular trigger the computation of the different results of insert/3.

Now let’s come back to Clojure’s core.logic which provides an implementation of many useful things for logic programming based on miniKanren. However, as an add-on to a functional programming language, we will have to use some special operators to translate the Prolog code. The first thing needed is the declaration of the query variables (e.g. Q) within the call to run*, without it you would never see any results (besides run* causing the inference machinery to, well, run). The next operator is == which is used for unification, which is used just as = would be in Prolog inside some rule. Sometimes you need temporary logic variables which you can introduce with fresh. There is also an explicit operator conde (similar to cond) which can be thought of as providing disjunction (or). You need this to be able to mimick Prolog’s multiple facts/rules with the same predicate, e.g. having a simple fact and a rule for permutation/3. There are also further predicates, e.g. conso which can be used to splice/construct lists. This is all nicely explained in the core.logic Primer. I actually started out trying to convert the Prolog code with not much else, like this:

(defn insert-broken [x l nl]
  (conde
    [(conso x l nl)]
    [(fresh [h t]
           (conso h t l)
           (insert-broken x t l)
           (conso h l nl))]))

You’ll note that I exchanged the position of single argument and list in order to match it with the usual argument positions of conso (or conj). Otherwise this looks like a pretty straight translation of the Prolog rules above: it’s either we can directly (via conso) (de)construct the list or we recurse. This version is broken in multiple ways, though. First of all, when you test this version, the recursive call to insert is not constrained enough wrt. the value of l, which will trigger an infinite recursion. You need to put the recursive call behind the second call to conso (cf. the discussion of my inquiry on StackOverflow). However, there is another issue lurking which you can see when comparing the results:

(defn insert-still-broken [x l nl]
  (conde
    [(conso x l nl)]
    [(fresh [h t]
          (conso h t l)
          (conso h l nl)
          (insert-still-broken x t l))]))

FAIL "Checking insert - Simple insert" at (logic_test.clj:11)
Expected: ((1 2) (2 1))
Actual: ((1 2))

FAIL "Checking insert - Simple insert" at (logic_test.clj:12)
Expected: ((1 2 3) (2 1 3) (2 3 1))
Actual: ((1 2 3))

As you can see, this version generates only a single result, inserting the element just in the first position, not in the other positions of l. The reason for this is that we are constraining the solution too much: by using l in the recursive call, we’re constraining the “result” (the value of the third argument) to the initial value of l . This is not what we are doing in the Prolog version, there l is just a temporary value generated in the recursive call. I.e. I fooled myself by basically running into a variable capture problem. So, the correct version of insert looks like this, introducing another fresh variable l1.

(defn insert [x l nl]
  (conde
    [(conso x l nl)]
    [(fresh [h t l1]
          (conso h t l)
          (conso h l1 nl)
          (insert x t l1))]))

However, the discussion on StackOverflow also pointed me to the matching predicates which are also shown, but not explained at all in the examples section of the core.logic wiki. In particular, core.logic offers a defne macro which basically provides a pattern matching facility which is remarkably close to what Prolog provides wrt. argument matching. Consider the following version of the same predicate using defne:

(defne inserto [L X NL]
    ([L X (X . L)])
    ([(H . T) X (H . L1)]
        (inserto T X L1)))

defne will basically expand into a set of conde expressions, but will also generate fresh variables and matching/unify expressions as appropriate. If you compare this version with the Prolog version, it’s easy to see the parallels: in the second rule, the given arguments to the parameter list L X NL will be tried to unify with [(H . T) X (H . L1)] (note that inserto uses the same parameter order as the Prolog version), thereby decomposing any sequence given as L into head H and tail T — this is basically the same as (conso H T L).

Having covered all those nitty-gritty details of insert, understanding permuto/3 should be straight-forward:

(defne permuto3 [I O L*]
  ([nil nil nil])
  ([() () ()])
  ([(X . Xs) Ys1 (_ . Bound)]
      (fresh [Ys]
                 (permuto3 Xs Ys Bound)
                 (inserto Ys X Ys1))))

We have two (empty) base cases and a recursive clause again. We’re decomposing the input I into (X . Xs) and unify O (typically the query variable) to Ys1. Using a fresh new variable we recurse with the sublist Xs down to produce permutations of the sublist, eventually inserting X into them. For reference, this is what the non-matching version looks like which makes the argument unification and the value decomposition much more obvious:

(defn permutation
    ([xs ys] (permutation xs ys ys))
    ([xl yl res]
    (conde 
       [(== xl '()) (== yl '()) (== res '())]
       [(== xl nil) (== yl nil) (== res nil)]
       [(fresh [_ x xs ys bound]
        (conso x xs xl)
        (permutation xs ys bound)
        (conso _ bound res)
        (insert x ys yl))])))

We can finally wrap this permutation into the same surrounding code we used for the traditional solution to this anagram kata to compute nearly the same results (the order will differ).

Wrapping up, this kata was actually quite hard to solve and took quite a while. I spend too much time trying to find the traditional solution myself before focussing on translating it to Clojure. And then it took me also way more time than I had imagined getting into core.logic, which could use quite a bit more documentation besides the primer on the basics. Anyway, core.logic looks like a very nice addition to the Clojure universe.


Coding katas Clojure -- Bloom filters

Kata5 is an interesting one to do in Clojure as Bloom filters, which are a probalistic data-structure for determining set membership, are all about spending as few bits as possible (or required) to store data. When bit manipulations are required, not many programmers would jump to Lisp or Java and indeed most descriptions are about implementations in C or one of it’s derivates. This probably does not come as a surprise, but as we will see is not entirely justified (wrt. to Lisp or Java).

But before discussing this in detail, let’s dive in with this kata. The description of the kata already tells us pretty exactly what we’re supposed to build: a bunch of hash functions and an array of bits which are then set or checked. We’ll start with the hash functions first. Hash functions are dime-a-dozen, Java provides one, Clojure, too. Still it is interesting to go beyond the readily provided functions and to implement some hashing functions. The goal here is not to build perfect hash functions, but to get a feel how an implementation of one looks like in Clojure (as this blog post series is about Clojure, not about Computer Science — the code is also available in my github repository for Clojure codekatas).

As the task is here to build a Bloom filter for strings, all hash functions basically boil down to iterating over a sequence of characters, converting each character into a numerical value (i.e. applying int) and then using this in an accumulating computation of the total hash value. We start out with more or less the simplest approach possible: we simply sum up the integer values of the characters (also known as the Kernighan & Ritchie “lose-lose” hash algorithms).

  (defn sum-chars 
    "Sum up the chars of a given string"
    [charseq]
    (reduce + (map int charseq)))

Nothing interesting to see but a straight-forward map/reduce, so let’s move on and take a stab at what Dan Bernstein (djb) suggested. In its most innocent version this iooks pretty similar, but uses bit-shifting and has some magic numbers thrown in for good measure.

(defn djb-string-hash
     "Use djb's method for hashing a string"
     [charseq]
     (reduce (fn [curhash charval]
               (+ 
                 (+ (bit-shift-left curhash 5) curhash) 
               charval))
         (cons 5381 (map int charseq))))

This is already were it got interesting, because if you run this with the simple string “foobar”, you will run directly into an overflow. The clojure documentation on unchecked-add could have told me so directly, of course. I had to use a quite a bit of websearch-fu, wildly testing around and knocking my head on the table to come up with this version:

    (defn djb-string-hash
      "Use djb's method for hashing a string"
      [charseq]
  (reduce (fn [curhash charval]
               ^long (unchecked-add 
                       (unchecked-add 
                             (bit-shift-left curhash 5) 
                             curhash) 
               charval))
    (cons 5381 (map int charseq))))

This looks quite good, but is still not bulletproof, as testing it with a longer string (> 11 characters) will show. The real issue is actually due to Java: unchecked-add uses internally a data structure that can (and will) result in a Java IntegerOverflowException. Java does not have unsigned numeric types and also defaults to throwing exceptions on overflow and Clojure, leveraging the JVM, is directly affected by that issue.

      kata5-bloom-filters.core> (djb-string-hash "foobar")
      6953516687550
      kata5-bloom-filters.core> (djb-string-hash "foobarfoobar")
      ArithmeticException integer overflow  
      clojure.lang.Numbers.throwIntOverflow (Numbers.java:1388)

Compare for instance the behavior of Common Lisp: the straight-forward translation of the exact same naive implementation will not overflow (unless aggressively optimizing against safety or promising to the compiler that only certain results will occur) due to automated boxing.

(defun djb-string-hash (charseq)
  "Use djb's method for hashing a string"
  (reduce #'(lambda (curhash charval)
          (+ (ash curhash 5) curhash charval))
       (cons 5381 (map 'list #'char-code charseq))))

Clojure does provide a similar behavior with e.g. the BigInteger type, however this is of no use as Clojure’s bit-operations don’t know how to handle BigInteger data — they are only defined for the primitive data types that Java provides. Worse, converting back from a BigInteger to a primitive data type (e.g. long) will not — as one might naively expect — simply truncate the data but yield -2:

kata5-bloom-filters.core> (bit-shift-left (bigint 20) 2)
IllegalArgumentException bit operation not supported for: 
       class clojure.lang.BigInt  
       clojure.lang.Numbers.bitOpsCast (Numbers.java:1008)
kata5-bloom-filters.core> (* 2 (bigint Integer/MAX_VALUE))
4294967294N
kata5-bloom-filters.core> (int (* 2 (bigint Integer/MAX_VALUE)))
IllegalArgumentException Value out of range for int: 4294967294
       clojure.lang.RT.intCast (RT.java:1115)
kata5-bloom-filters.core> (unchecked-int (* 2 (bigint Integer/MAX_VALUE)))
-2

Similar issues arose with all of the other hash functions (i.e. sdbm and fnv), cf. the code on github — nothing interesting here, so let’s move on to the bloom filter itself.

Multiple options come to mind when thinking about the data structure to use. Let’s start out with the simplest possible option: just use a simple number. This defaults to Java long, i.e. a 64-bit number. This implies of course that we already have a rather arbitrary upper limit on the size of the bloom filter, which influences the number of possible entries and the number of false positives, cf. this overview article about the garden variety of bloom filters.

(defn bloom-add [bloom charseq & {:keys [hashfns] 
                                      :or {hashfns *hash-functions*}}]
  (reduce #(bit-set %1 %2) 
          (conj (map #(% charseq) hashfns)
                bloom)))

(defn bloom-contains? [bloom charseq & {:keys [hashfns] 
                                    :or {hashfns *hash-functions*}}]
  (every? #(bit-test bloom %) 
      (map #(% charseq) hashfns)))

(defn build-bloom [wordfile & {:keys [hashfns] 
                       :or {hashfns *hash-functions*}}]
  (reduce #(bloom-add %1 %2 :hashfns hashfns)
        (cons 0 (string/split-lines (slurp wordfile)))))

The code here is pretty straight-forward, maybe with the possible exception that we’re mapping over a list of functions in bloom-add and bloom-contains?. We could extract this part to a simple function which makes the code a little more readable.

     (defn hash-string [charseq & {:keys [hashfns] 
                                   :or {hashfns *hash-functions*}}]
        (map #(% charseq) hashfns))

This very naive implementation will run into problems right away: The hash functions will yield hash values that are itself 64 bit in size whereas the biggest bit that can be set is 63. A straight-forward fix for that is to consider the size of the ‘bloom filter’ (i.e. 64 bit) and to truncate the hash values accordingly via the modulo operation. I.e., instead of calling (bit-set bloom value) we do (bit-set bloom (mod value 64)).

Now, if you think about it, using a simple number is probably not the optimal data structure: for one, we just limited us to bit arrays that are 64 bits in size (which for instance implies that with the /usr/share/dict/words file you’ll end up with Integer/MAX_VALUE, i.e. all bits set to 1) but due to the immutable nature of the mathematical operations, we actually require a lot more space than just one long, thereby very much defeating an important characteristic of Bloom filters.

So let’s use a completely different idea and use one of Java’s mutable data structures: BitSets. This leads to the following naive, non-thread-safe implementation:

(defn bloom-add [bloom charseq & {:keys [hashfns] 
                                      :or {hashfns *hash-functions*}}]
        (let [size (.size bloom)]
             (doseq [hashval (hash-string charseq :hashfns hashfns)]
               (.set bloom (Math/abs (mod hashval size)) true))
     bloom))

(defn bloom-contains? [bloom charseq & {:keys [hashfns] 
                                            :or {hashfns *hash-functions*}}]
    (let [size (.size bloom)
          hashvals (hash-string charseq :hashfns hashfns)]
           (every? #(= (.get bloom (Math/abs (mod % size))) true) hashvals)))

(defn build-bloom [wordfile & {:keys [bloom-filter size hashfns]
                                   :or {size 1024
                                        hashfns *hash-functions*}}]
    (let [bloom (or bloom-filter (BitSet. size))]
           (reduce #(bloom-add %1 %2 :hashfns hashfns)
              (cons bloom (string/split-lines (slurp wordfile))))
        bloom))

We basically just exchanged the bit-set/bit-test functions with the respective BitSet methods and use a dynamic size. This hints at a possible generalization: we could consider multiple bloom filter implementations (types, if you want to) that need to support some sort of bit-setting and getting operation plus size. This would be the internal protocol, while bloom-add/bloom-contains? (and maybe build-bloom) form the external API.

Now, of course, we would like to fix the problem that this code is not thread-safe. As is made pretty clear in Fogus etal. book “Joy of Clojure”, Clojure’s reference types are of no use here:

“Wrapping a mutable object in a Clojure reference type provides absolutely no guarantees for safe concurrent modification. Doing this will at best explode immediately or, worse, provide inaccurate results.”

The advice Fogus etal. offer is to use the locking macro. If we combine the above idea of using an internal protocol, we can at least apply it where it is necessary, i.e. around the calls to .get/.set.

(defprotocol BloomFilterImpl
    (bloom-size [filter])
    (bloom-bit-get [filter position])
    (bloom-bit-set [filter position value]))

(extend-type BitSet
    BloomFilterImpl
    (bloom-size [filter]
        (.size filter))
    (bloom-bit-get [filter position]
        (locking filter
            (.get filter position)))
    (bloom-bit-set [filter position value]
        (if (< position (bloom-size filter))
            (locking filter
                (.set filter position value))
            (throw (IllegalArgumentException. 
                                    "position outside of bloom filter size")))))

If you wonder why the protocol does not have the add or contains? functions, this is because these operations would be part of some dictionary protocol or some such (although it is somewhat debatable if dictionaries should guarantee the absence of false-positives).

Let’s dig some more into the concurrency issue: it’s surprisingly hard to come up with a scenario where the mutability of the BitSet could be problematic. For one, we are always only adding entries by manipulating a single bit and do that in an atomic fashion that does not rely on the previous value of the BitSet in any way. For another, we don’t have any delete operation, so we can’t possibly run into the situation where some bit / some dictionary entry goes missing — assuming, of course, that all modifications to the bloom filter happen through the functions we supplied only and not by some other means directly on the BitSet outside our control. The only scenario that comes to my mind would be where one wants to keep the state of the bloom filter fixed in one thread, i.e. for some time we want to be able to deny having seen some value / word (which another thread just tried to sneak in while we were not looking). I can’t imagine a real world usage for this scenario, but that probably says more about my creativity than about anything else.

Let’s briefly discuss the options to account for this scenario: the locking scenario above is not enough as the locking occurs as part of the getting/setting operations — there is no way in which one thread could prohibit modifications to the Bloom filter during a specified amount of time with this. Of course, as locks nest you could add locking outside the calls to add new elements to the bloom filter. The other option would be to use one of Clojure’s reference types. But as discussed above, these are not useful for protecting mutable data structures, so we would need to go back to using one of Clojures persistent data structures. So, let’s briefly step aside and compare the speed of Java arrays generated via Clojure and using Clojure vectors, both on booleans:

(defn make-random-boolean-array [size]
  (boolean-array
   (take size (repeatedly #(rand-nth [true false])))))

(defn make-random-boolean-vector [size]
  (into [] (take size (repeatedly #(rand-nth [true false])))))

(defn print-flipped-boolean-array [ba]
  (let [size (count ba)]
    (loop [indx 0
           result ""]
      (if (= indx size)
        result
        (do
          (aset ba indx (not (aget ba indx)))
          (recur (inc indx)
                 (string/join
                  [result
                   (if (aget ba indx) 1 0)])))))))

(defn print-flipped-boolean-vector [bv]
  (let [size (count bv)]
    (loop [indx 0
           vec bv
           result ""]
      (if (= indx size)
        result
        (let [newvec (assoc vec indx (not (get vec indx)))]
          (recur (inc indx)
                 newvec
                 (string/join
                  [result
                   (if (get newvec indx) 1 0)])))))))

kata5-bloom-filters.core> (time (do 
                                      (print-boolean-array 
                                        (make-random-boolean-array 10000)) 
                                      true))
"Elapsed time: 587.247368 msecs"
true
kata5-bloom-filters.core> (time (do 
                                      (print-boolean-array 
                                        (make-random-boolean-array 10000)) 
                                      true))
"Elapsed time: 592.598888 msecs"
true
kata5-bloom-filters.core> (time (do 
                                      (print-boolean-vector 
                                        (make-random-boolean-vector 10000)) 
                                      true))
"Elapsed time: 76.657272 msecs"
true
kata5-bloom-filters.core> (time (do 
                                      (print-boolean-vector 
                                        (make-random-boolean-vector 10000)) 
                                      true))
"Elapsed time: 69.666769 msecs"
true
kata5-bloom-filters.core> (time (do 
                                      (print-boolean-vector 
                                        (make-random-boolean-vector 10000)) 
                                      true))
"Elapsed time: 71.897087 msecs"
true

Now if I run this a reasonable number of times, it appears that for simple element access the boolean vector is outperforming the boolean array, even if I’m basically doing building up a new partial copy of the vector all the time / for all elements. That was a welcome surprise for me. So, using a vector of booleans would be a viable next possibility. The road ahead, however, is hindered by the fact that Clojure does not derive a complex type for vectors, not even if you especially declare the vector to be of a certain type. You’ll always end up with a vector or Vec:

kata5-bloom-filters.core> (into (vector-of :boolean) [true false])
[true false]
kata5-bloom-filters.core> (type *1)
clojure.core.Vec

This is problematic as we don’t want to just extend-type the general type. The way out of it is the use of reify which is just like extend-type for objects (that’s actually quite misleading, if you look at the documentation of reify, go check yourself — I would also recommend reading up about this in “Joy of Clojure”).

(defn make-bloom-vector [size]
  (let [bloom-vect
        (ref (into (vector-of :boolean)
                   (take size (repeatedly #(identity false)))))]
    (reify BloomFilterImpl
      (bloom-size [_]
        size)
      (bloom-bit-get [_ position]
        (nth @bloom-vect position))
      (bloom-bit-set [_ position value]
        (alter bloom-vect assoc position value)))))

As you can see, I’m using a closure to store the vector of booleans in a ref. We then deref the vector on get and alter it on set. This is especially set up such that any transaction handling (i.e. dosync) occurs outside of the implementation details, to account for the scenario discussed above. Of course, whether refs are really best suited for handling any concurrency situation is likely to depend ultimately on the application specific context or concurrency requirements.

I will leave it at that. This time, we have seen quite some more features that are pretty much unique to Clojure: explicit protocols, reification of instances and references, i.e. using Clojure’s STM. There was also quite a bit of discussion of some consequences of Clojure’s decision to leverage the JVM, some bad (overflows), some nice (plugging in a readily available data structure). Overall, this kata was quite an interesting exercise.


Coding katas Clojure -- Data munging

Kata4 is concerned with data munging, basically reading some file, pulling out some data and comparing the data in order to determine some result. Actually, it is even simpler than that as in both cases we are asked to determine the minimum of the data set. Given a weather task and a soccer task, we are asked to fuse the resulting solutions and extract the commonalitites and minimize both solutions, basically allowing for code-reuse as much as possible. Unfortunately, I had not read the opening sentences of the kata closely, instead I read over the entire description. Hence I missed the call to solve each part separately and not to read ahead and directly started with this idea of code re-use in mind. However, as we will see, I didn’t really notice one aspect of code reuse until I solved the second task (soccer).

The first thing I did was take a closer look at the data file. Probably as much as everybody else who had some exposure to Perl, I was initially tempted to approach the data extraction part with regular expressions. But as JWZ said, “some people when faced with a problem say “I know, I’ll use regular expressions. Now they have two problems”, and indeed this is the case here. This is not to say that the particular problem, fetching the date, minimum and maximum temperature from the provided data file would not be solvable with regular expressions, but more that the data file is really more of a fixed width nature. Unfortunately, there are some irregular elements, e.g. the missing values in the WxType row or the post-fix ‘*’ added to exactly one value in both MnT and MxT rows, likely to mark the absolute minimum and maximum temperature of the month, respectively, which causes the data in the corresponding “cell” to hang-over into the white-space pre-fix area of the next row. The last column is also of interest, as it starts out with “mo” in contrast to the numeric values for the days of the month. It also has rational values for some rows (e.g. for the temperature rows) where all other values have integer values. This looks like a computed result line, showing the averages of the respective values for the entire month and should hence not be part of the computation. And finally, of course, the data file has some completely irrelevant lines, which we need to skip over.

Having come to some idea of how the data file was structured, I came to the conclusion to parse each line using exact positions, e.g. to determine the MnT value by extracting the substring of each line from position 9 to 14. So, the top-down approach is like this: we will open a file and for each line in it and we will try to parse it according to some pattern specification where the pattern specification would specify the positions of some part and a parsing function.

Clojures approach to file handling is a little bit surprising for somebody coming from Common Lisp, which provides three main concepts you need to grasp: filenames aka pathnames, files and streams. Typically, you open a file with with-open-file which will guarantee that the file will be closed after you leave the block. Clojures with-open macro abstracts this idea of safe file handling to the next level in that it is not restricted to files. However, with-open does not simply work with a filename as an argument, you need to pass in a resource which follows the open/close protocoll which a simple filename string does not. That Clojure leverages the Java IO library for this is not surprising, but that it leaks this Java dependency to the users is. I assume this implies that (with-open (clojure.java.io/reader "/some/filename.txt")) will not work on ClojureCLR (apparently slurp does). read-line is also a false friend for a Common Lisp programmer, as it can only be used for reading a line from the REPL but not for reading from some stream like in CL. While I appreciate the possibility to call Java methods from Clojure, I prefer using any abstractions that Clojure provides, so I went with line-seq instead of calling .read. Finally, as Clojure does not allow for re-assignment, so we need a recursive approach with an accumulator to hold intermediate results while looping over the lines. So, the skeleton looks something like this in Clojure:

 (defn read-some-file
   "Skeleton for reading some file"
   [filename]
   (with-open [rdr (io/reader filename)]
      (loop [lines (line-seq rdr) result]
        (if (empty? lines)
             result
      ; todo for the weather task: 
      ; compare the result of parsing a line with the current 'best' 
      ; i.e. minimum spread value so far and recur, possibly with different data
              (recur (next lines) result)))))))

This is more or less the equivalent of Perl’s -n command-line switch or while (<>) {...} construct.

The comparison mentioned in the code’s comment is actually very easy and what needs to be put in place for the result is also obvious. But in order to do that we have to parse the lines first. Following up on the idea of how to parse a line, the relevant function is subs which returns a substring from start to end. Now, we basically need to parse different substrings for the various data fields, quite often returning just an integer. Again, the simplest way of parsing a string to an integer is provided by a thin layer on top of a Java library. Neither the re-find nor the exception handling would be necessary if we could guarantee that the data would always only consist of correct data, this way we just silently skip over invalid data.

(defn string-to-int
  "Parses a consecutive set of numbers into an integer or return nil"
  [string]
  (try
    (Integer/parseInt (re-find #"\d+" string))
    (catch Exception e nil)))

parse-line is fairly straight-forward: the basic idea is that we have some specification of how a line is structured. This specification is a map of start and end positions plus parsing functions. So for the weather data, the pattern looks like this:

(def day-pattern
  ;this pattern is not complete and could be extended
  (hash-map :day [1 4 #(first-word %)]
            :MxT [5 8 #(string-to-int %)]
            :MnT [9 14 #(string-to-int %)]
            :AvT [15 20 #(string-to-int %)]))

first-word is another small helper function which basically just retrieves the first continous non-whitespace characters of a string:

(defn first-word
  "Returns first consecutive non-whitespace chars from string"
  [string]
  (re-find #"\S+" string))

parse-line than just loops over all parts of a pattern, extracts the substrings and calls the parsing function. It recursively conjures up a hash-map with the extracted data or returns nil if some parsing error occurs.

(defn parse-line [line pattern]
  "Parse a line with data in fixed positions using pattern.
Pattern should be a map consisting of a key for the data to return,
a start and end position and a parsing function for each data element.

Returns a map with all extracted data or nil for unparsable lines."
  ; loop solution with accumulator for results
  (loop [remkeys (keys pattern) linemap {}]
    (if (empty? remkeys)
      linemap
      (let [key (first remkeys)
            [start end parsefn] (get pattern key)
            value (parsefn (try
                             (subs line start end)
                             (catch Exception e nil)))]
        (if value
          (recur (rest remkeys)
                 (conj linemap
                       (hash-map key value)))
          ; silently skip any parsing errors
          nil)))))

This then allows us to put things together: we only need to compare the difference between MxT and MnT of the current line with the previous smallest temperature spread. We will use destructuring of the result of hte parse-day results to retrieve the needed data, sprinkle in some sanity checks and are done.

(defn parse-day
  "Parse a day from a line"
  [line]
  (parse-line line day-pattern))

(defn find-lowest-temperature
  "Return day in weatherfile with the smallest temperature spread"
  [weatherfile]
  (with-open [rdr (io/reader weatherfile)]
    (loop [lines (line-seq rdr) minday 0 minspread 0]
      (if (empty? lines)
        minday
        (let [{mnt :MnT mxt :MxT curday :day} (parse-day (first lines))            
              curspread (when (and mnt mxt) (- mxt mnt))]
          (if (and curday curspread
                   (or (= minspread 0)
                       (< curspread minspread)))
            (recur (next lines) curday curspread)
            (recur (next lines) minday minspread)))))))

When I started working on the second task, solving the soccer issue, I did a simple copy and paste of the find-lowest-temperature, added a new pattern for extracting the data and made the small changes to adapt to the different fields. I also understand the comparison requirement to look at the absolute difference. This leads to the following functions:

(defn abs 
  "Returns the absolute value of x" 
  [x]
  (if (pos? x) x (- x)))

(def soccer-team-pattern
  ; this pattern is not complete
  (hash-map :pos [1 5 #(first-word %)]
            :team [7 22 #(first-word %)]
            :fval [43 45 #(string-to-int %)]
            :aval [50 52 #(string-to-int %)]))

(defn parse-soccer-team
  "Parse a soccer-team from a line"
  [line]
  (parse-line line soccer-team-pattern))

(defn find-minimum-goal-difference 
  "Return team in soccerfile with the smallest difference in for and against goals"
  [soccerfile]
  (with-open [rdr (io/reader soccerfile)]
    (loop [lines (line-seq rdr) minteam 0 mindiff 0]
      (if (empty? lines)
        minteam
        (let [{aval :aval fval :fval curteam :team} 
              (parse-soccer-team (first lines))            
              curdiff (when (and aval fval) (abs (- fval aval)))]
          (if (and curteam curdiff
                   (or (= mindiff 0)
                       (< curdiff mindiff)))
            (recur (next lines) curteam curdiff)
            (recur (next lines) minteam mindiff)))))))

This, of course, led straight to the insight that it should be simple to extract the slight differences and make them parameters to some find-*-difference function. The following things are differently: the parsing pattern, the extraction function for the result value and the function used to compute the difference between values. If you would want to it would also be possible to make the comparison function configurable.

(defn find-some-difference 
  "Return some result from a data file which has some lowest difference"
  [filename parse-pattern resultkey diffn]
  (with-open [rdr (io/reader filename)]
    (loop [lines (line-seq rdr)
           result nil
           mindiff 0]
      (if (empty? lines)
        result
        (let [data-map (parse-line-map (first lines) parse-pattern)
              curresult (get data-map resultkey)
              curdiff (diffn data-map)]
          (if (and curresult curdiff
                   (or (= mindiff 0)
                       (< curdiff mindiff)))
            (recur (next lines) curresult curdiff)
            (recur (next lines) result mindiff)))))))

(defn find-mingoal-diff-fusion
  "Return team in soccerfile with the smallest goal difference, using the fusion fn."
  [soccerfile]
  (find-some-difference soccerfile soccer-team-pattern :team
                        (fn [{aval :aval fval :fval curteam :team}]
                          (when (and aval fval)
                            (abs (- fval aval))))))

There there was another itch I wanted to scratch: the parse-line function has some ugliness to it. For starters, it is handling possible exceptions from subs directly. It is also checking return values for nil. Both cases are what Common Lisp would see as conditions rather than real exceptions — it is rather unfortunate that Clojure opted for the more simple, although more traditional exception concept from Java. To remedy the uglyness of parse-line we can simply replace the direct call to subs with a small handcrafted call which manages any exceptions and also change the behavior of parse-line to simply return nil for all unparsable elements. But there is more that makes parse-line ugly: I dislike the recursive nature of the solution and the linear result handover in the let declaration (well, this handover was intentional to not have a functional train-wreck of calls). I wanted to see whether I couldn’t come up with a more elegant map/reduce solution. Here you go:

(defn substring
  "Returns substring from start to end from string or nil"
  [string start end]
  (try
    (subs string start end)
    (catch Exception e "")))

(defn parse-line-reduce [line pattern]
  "Parse a line with data in fixed positions using pattern.
Pattern should be a map consisting of a key for the data to return,
a start and end position and a parsing function for each data element.

Returns a map with all extracted data which maybe empty."
  ; map-reduce version
  (reduce #(conj %1 %2)
          (concat [{}]
                (map
                 (fn [[key [start end parsefn]]]
                   {key (parsefn (substring line start end))})
                 (seq pattern)))))

We are simply mapping over the entire pattern and use argument destructuring again to extract the relevant parts of it, but this time, due to the call to seq a pattern part will be a sequence, not a map. The map call with the anonymous function will produce a sequence of hashmaps with key and parsing results, which then gets reduced to a single map. In order to use reduce, you have to provide a function taking two arguments: the first will consume the intermediate result, the second will be the next value of the sequence to reduce. This is the reason why we have this ugly concat [()] ... in front of the call to map: we need to provide the initial value for reduce which in this case is an empty hashmap. An even more concise version replaces the call to reduce with into, resulting in a version which looks pretty idiomatic to me and is also way easier to understand then the lengthy recursive version above.

(defn parse-line-map [line pattern]
  "Parse a line with data in fixed positions using pattern.
Pattern should be a map consisting of a key for the data to return,
a start and end position and a parsing function for each data element.

Returns a map with all extracted data which maybe empty."
  (into {}
        (map
         (fn [[key [start end parsefn]]]
           {key (parsefn (substring line start end))})
         (seq pattern))))

Of course, we can use a similar approach for the find-*-difference function. We could also take a slightly different approach and sort the parsing results and then treat the minimal value as the result. If we combine this with slurp the resulting code also becomes way more compact. In order to filter out only partially parseable lines, we need to supply a list of keys that the parsing result must have values for.

(defn sort-diff-map
  "Return some result from a data file which has some lowest difference"
  [filename parse-pattern desiredkeys diffn]  
  ; map-filter version
  (sort-by diffn 
           (filter #(every? (partial get %) desiredkeys)
                   (map #(parse-line-map % parse-pattern)
                        (string/split-lines (slurp filename))))))


(defn find-mingoal-map
  "Return team in soccerfile with the smallest goal difference, using the sort-map fn."
  [soccer-file]
  (get (take 1 
             (sort-diff-map soccer-file soccer-team-pattern 
                            [:team :fval :aval]
                            (fn [{aval :aval fval :fval curteam :team}]
                              (when (and aval fval)
                                (abs (- fval aval))))))
       :team))

Summing up, what have we seen during this kata? Clojure’s platform dependent approach to reading files, usage of regular expressions, destructuring (again), anonymous functions (again) and map/reduce. Overall not very exciting. I know it’s not a fair comparison, but I would always opt for solving such tasks with Perl, especially if they are so trivial as in this case in which you can solve each task with a one-liner, basically. There is room for using more elaborate languages (e.g. Python, Ruby, Clojure) if parsing and processing become so elaborate that it makes sense to have more structure in the code. But for a task of the size of this kata, the amount of code required is usually not worth the effort.


Coding katas Clojure -- introduction and overview

I wanted to learn Clojure for a while but never made much progress beyond the level of where I had a running slime connection to Clojure and had a simple initial leiningen project file for a web project with ring. I just got distracted by other stuff but more importantly, I had no goal for a web project, so I already failed with thinking about what I would build. Consequently, I stopped reading Fogus and Housers nice book Joy of Clojure. After quite a while (close to a year), I stumbled upon some posts reminding me of the code katas on the pragmatic programmers web site (cf. code katas). Somehow this triggered the idea that doing these katas might be a good way to learn Clojure finally.

In the linked blog posts I’ll describe the various solutions I worked out. This post here will serve as the main overview and get updates as I move along.

The code (as well as the original markdown for these articles) can be found on github.

Kata 14 concludes my little journey into the original katas. It took far too much time writing these, as I didn’t had enough time to work on them continuously. Besides, many of the katas I’ve left out seemed rather uninteresting and some are thought experiments.


Coding katas Clojure -- setup

While not being a kata, setup of the environment in which it’s possible to do the programming for them is a task that needs to be fulfilled anyway. I hence see that as some sort of a separate kata, to familiarize oneself with various development environments.

These are the requirements for setting up a Clojure development environment on Windows, using a portable apps approach. In more detail these are the exact requirements:

  • All portable applications are available on drive J: This is a USB stick in my case.
  • We’ll need git, emacs and clojure of course.
  • We’ll also need leiningen at some later stage.
  • Configuration of the application will be stored on the external drive (e.g. J:) as well where possible.
  • We’ll use nrepl instead of slime/swank. I never got the latter set up to work correctly as portable apps.

This describes the setup I’m currently using. First of all, the downloads. For the stuff that has repositories on github, I would suggest using git clone <repositoryURI> (after you’ve installed git in the first place, of course).

  • Emacs (24.2 at the time of writing) can be downloaded from the FSF Emacs server
  • git (1.7.6 at the time of writing) can be downloaded from msys
  • leiningen requires wget, which can be installed e.g. from here. Another option would be to install MinGW with git and wget, cf. MinGW
  • clojure (1.5.0 at the time of writing) can be downloaded from Clojure.org, of course.
  • leiningen version 2 can be downloaded from leininigen github repo
  • clojure-mode version 2 can be installed via Marmalade/ELPA or manually from it’s github repository
  • nrepl.el can be also be installed via Marmalade/ELPA or manually from it’s github repository
  • I’ll throw in magit for smooth Emacs interaction with git, to be fetched via Marmalade/ELPA or manually from magit’s repository

I’ll use the following directory layout: All applications are stored under J:\\progs\, e.g. Emacs 24.2 will end up as J:\\progs\emacs-24.2\. I put the clojure.jar into J:\\progs\\clojure\ and will put lein.bat along with it into the clojure directory. The following shows the resulting directory as shown by dired:

  J:
  insgesamt 5124
  drwx------  3 schauer schauer   16384 Nov  1  2011 emacs
  drwx------ 12 schauer schauer   16384 Okt 16  2012 progs

  J:\\progs:
  drwx------  8 schauer schauer     16384 Nov  1  2011 clojure
  drwx------  8 schauer schauer     16384 Okt  7  2012 emacs-24.2
  drwx------ 11 schauer schauer     16384 Nov  1  2011 git
  drwx------  8 schauer schauer     16384 Nov  1  2011 wget

My Emacs configuration resides in a separate directory on J:, namely in J:\\emacs\. As I already have quite a lot of emacs configuration, I’m going to put all configuration options into separate files, which are placed in J:\\emacs\elisp\config\. Code from other people will go in separate directories as well, with J:\\emacs\elisp\others\ as the top-level folder. clojure-mode hence goes to J:\\emacs\elisp\others\clojure-mode. nrepl.el is a mode but a single file and goes straight into J:\\emacs\elisp\others\. Emacs looks for default.el or site-start.el during startup to look for personal or site-wide configuration. Both files can be placed in the site-lisp directory, i.e. in J:\\progs\emacs-24.2\site-lisp\

  J:\\emacs:
  drwx------  6 schauer schauer   16384 Nov  1  2011 elisp

  J:\\emacs\elisp:
  drwx------ 3 schauer schauer 16384 Nov  1  2011 config
  drwx------ 4 schauer schauer 16384 Nov  1  2011 development
  drwx------ 6 schauer schauer 16384 Nov  1  2011 others

Next we need to adopt the load-path, i.e. where Emacs looks for libraries. This means we need to put some content in J:\\progs\emacs-24.2\site-lisp\default.el that takes care of figuring out the drive letter and sets paths correctly:

(defun get-drive-from-filename (filename)
  "Returns a windows drive letter if filename contains a drive letter."
  (if (string-match "^\\(.:\\)/" filename)
      (match-string 1 filename)))

(defun get-drive-for-emacspath ()
  "Returns windows drive letter for the drive emacs can be found on."
  (get-drive-from-filename (getenv "EMACSPATH")))

(let ((emacsdrive (get-drive-for-emacspath))
       loadpath-additions)
  (dolist (dirname
       '("/emacs/elisp/"
         "/emacs/elisp/config/" 
         "/emacs/elisp/others/"
         "/emacs/elisp/others/clojure-mode/"))
    (setq loadpath-additions
      (cons (concat emacsdrive dirname) loadpath-additions)))
  (setq load-path
    (append loadpath-additions load-path)))

(require 'nrepl)        
(require 'clojure-mode)
(setq clojure-mode-inf-lisp-command 
      (concat (get-drive-for-emacspath)
           "/progs/clojure/lein.bat repl"))


(require 'magit)
(setq magit-git-executable
      (concat (get-drive-for-emacspath)
           "/progs/git/bin/git"))

The next step is to install leiningen. There are two ways: either downloading lein.bat and running it from cmd or downloading lein, the shell script and running it via the git bash prompt. I chose the latter. You will probably need to adjust your path to where you put the lein shell script, e.g. (bash syntax):

export PATH=$PATH:/j/progs/clojure/

To install leiningen locally (i.e. not in your %HOME%), you have to set the LEIN_HOME environment variable, i.e. like this (bash syntax):

export LEIN_HOME=/j/progs/clojure

Remember to always set this variable afterwards before running leiningen commands. Point your classpath to where you installed clojure:

export CLASSPATH=/j/progs/clojure/clojure-1.5.0/clojure-1.5.0

If you don’t want to set all these variables all the time, you can put them either in a .profile file in your %HOME% or in the global profile file that comes with git which resides in /j/progs/git/etc/. I added the following lines:

CLOJUREPATH=/j/progs/clojure 
if test -x $CLOJUREPATH
then 
     export PATH=$PATH:$CLOJUREPATH
     export LEIN_HOME=$CLOJUREPATH
     export CLASSPATH=$CLOJUREPATH/clojure-1.5.0/clojure-1.5.0
else
     echo "Can not access /j/progs/clojure"
     exit 1
fi

To figure out how to get rid of the hardcoded drive letter in bash is left as an exercise to the reader.

If you also want to keep the files / jars which leiningen retrieves in a local, non-standard maven repository, you need to set a variable in your $LEIN_HOME/profiles.clj file, like this:

{:user {:local-repo "j://progs/clojure/.m2/"
        :repositories  {"local" {:url "file://j/progs/clojure/.m2"
                                  :releases {:checksum :ignore}}}
        :plugins [[lein-localrepo "0.5.2"]]}}

Then run lein self-install. Afterwards, a lein repl should give you a Clojure read-eval-print-loop.

Now if you want to use nrepl and would like to use the support for nrepl/inferior-lisp which comes with clojure-mode you need to add a corresponding dependency to your project.clj for each project, cf. nrepl installation


Page 1 of 1, totaling 6 entries