Nov 16

In a blog post on dependency inversion in Clojure I’ve discussed what this DI principle actually is about and the solutions Clojure offers to support it. There is one aspect that bugged me a little: For me, a fundamental challenge with DI in a language like Clojure is that you often have simple functions depending on simple other functions (simple here in contrast to protocols). In the article linked to above I discussed one way of resolving function dependencies by using an indirection over a service locator. However, in practice writing service locator lookups by hand was getting tiresome soon. So instead I decided to have some fun throwing together some macros that handling function signatures, which resulted in a small library.

Enter funsig

funsig shoots lower than Clojure protocols: it provides dependency management on a per-function level. What this means is simply that you can define a function signature with defsig and then provide implementations with defimpl. Implementations will depend on the signature. Let’s say we have some application code that depends on a printer function:

(ns my.onion)

(defn printer [string]
    (println string))

(defn print-account-multiplied [account multiplier]
    (let [result (* account multiplier)]
        (printer result)))

One might want more flexibility on how and where to print. In other words, one might want the application code (print-account-multiplied) to depend on an abstraction (printer) only and not on the concrete implementation as in this example.

Funsig allows you to inverse the dependency of the printer implementation. You would define the signature with defsig and have the appplication code depend on the signature like this:

(ns my.onion
    (:require [de.find-method.funsig :as di :refer [defsig defimpl]]))

(defsig printer [string])

(defn print-account-multiplied [account multiplier]
    (let [result (* account multiplier)]
        (printer result)))

You can then provide the implementation with defimpl:

(ns my.onion.simle-printer
    (:require [de.find.method.funsig :as di :refer [defimpl]]
              [my.onion :as mo :refer [printer]]))

(defimpl printer [string]
    (println string))

Note that the implementation has a dependency on the signature, not the other way around. Also, application code (print-account-multiplied) simply depends on the signature — here the signature is in the same file, but reference to the var in another namespace (i.e. using require\:refer) also works normally. For application code, this looks like dependency injection.

funsig will also allow you to have multiple implementations and then select the one you want.

Have fun!

You’ll find all the gory details in the README and / or in the intro document. The latter also explains the relationship to the service locator pattern mentioned above.

Feedback welcome!

Posted by Holger Schauer

Defined tags for this entry:
Oct 8

Technical debt assumes that you have an existing system and you know already about the areas where the duct tape is becoming thin. Let’s discuss the problem, the conflict around technical debt a little. It’s basically always the same battle: the guys maintaining the system probably have good reasons why they want to pay back on the accumulated technical debt, whereas the product owner believes more functionality is more important.

This is a typical clash of interests between people from different tribes and usually from different departments: the developers report to a technical manager, the product owner to a business guy. The IT manager is typically under pressure to minimize costs, in particular costs for support and maintenance, so he’s interested in paying back on technical debt during projects. The business manager, however, is under pressure to convince new clients with new features. So, that there is a difference in how technical and non-technical people set priorities between technical debt and new features doesn’t come as a surprise: naturally, to the business side this is all about technical issues which they don’t consider to be their concern. It’s only natural that they believe that tackling such technical issues should be solved by the technical folks “somehow”. After all, it were these guys which build the system with all these problems in the first place, and now they want to spend more time and money on this? Although I’ve exaggerated here a little maybe and this line of argument is way too simplistic (given that often debt is accumulated over time), it’s still quite popular. Such product owners seem to believe that they are only responsible for developing new functionality. Corporate culture can contribute to this: some companies have a culture where every new glitter is taken for a star and gets way more attention than the cash cow functionality that keeps existing clients happy.

So some product owners will see technical debt as a separate issue which needs to go on a maintenance budget, a budget that somebody else is responsible for, e.g. the IT department. The problem with that idea is that in a culture where building new features is the only thing that counts, typically no one ever gets around to clean up the technical debt. All what will happen is that you’ll have some people that fix bugs with this maintenance budget. So you don’t actually ever spend the money on technical debt but only on the results of it which of course doesn’t address the root of the technical issues. Also, if you finally do get an approval for a refactoring project, these are really the most boring and horrible types of projects to work on. So it’s not likely to see a lot of happy, motivated people on such projects — and sure enough the most competent people are probably assigned to more valuable projects anyway. As a result and because there is no business pressure on such refactoring projects, as soon as something else comes up, such refactoring initiatives are abandoned or at least down-sized to a minimum. Peter Seibel describes in a recent, lovely written article the need for an Engineering Effectiveness team at Twitter in which he lays out in some details how hard it is to really put up with technical debt.

Let’s drill down on technical debt from the point of view of requirements. Wait a minute, this doesn’t sound right — how could technical debt be a requirement? Obviously, it isn’t, but there is a close relation. Technical debt can come in different flavors: e.g. you really should install a second machine and a load balancer so you’re prepared for failure or we really need to rewrite module foobar.clx to finally get rid of all the spaghetti code that’s slowing us down to oh, we really need to support responsive design, we’re getting more mobile users every day. Now, if you take a look at these three simple examples, they are related to quality requirements: the first is about availability, the second about maintainability and the last about usability. Take a look at the ISO 25000 standard for software product quality page listing all the quality attributes one might want to consider. Technical debt is always tied to some quality that your system should offer, but doesn’t.

A reason for this is that quality attributes are often not made explicit during requirements gathering, regardless of whether you’re following an agile approach or not. The less technical the product owner, the more often they seem to assume that performance, scalability, security and the other -ilities will come out just right by magic. Technicians, on the other hand, know perfectly well that they will not. It’s worse if they don’t that: they will build something that might fulfill the functional requirements but not the non-functional requirements. “Works on my machine” might be fine for a naive developer, but nobody will be happy if takes 5 minutes to load the page for the gazillions of parallel requests in production. There is a reason why software architecture is mostly concerned with quality, if you don’t plan and build for scalability, it’s unlikely you end up with a highly performing system. Take a look at the picture: this bridge wasn’t planned to cope with the amount of traffic it sees nowadays, some time in the 80s somebody decided it would be okay to have three lanes instead of two and didn’t think through the long term consequences.

Rheinbrücke A40 Duisburg-Neuenkamp
The bridge of the highway A40 over the Rhein in Duisburg. It is damaged so bad that trucks are no longer allowed to cross it.
Picture made by kaʁstn Disk/Catstitched by Daniel Schwen, licensed under CC BY-SA 3.0 de over Wikimedia Commons

Of course, technical debt can also accumulate over time, one might rightly point out: sure, one machine was enough for the requirements at the start, but nobody followed up on the increase in users and nobody cared (or had time) to clean up the code in module foobar.clx back then (nor in the following six years of quick bug fixes and minor one-off patches that have grown on it like leeches). This is a sign that nobody actively had an eye on how the world in and around the systems changes and to take action early on. For code quality, Uncle Bob points to the Boy Scout rule which says that you should always clean up (regardless of who messed it up) — another way of trying to ease maintainability by paying back on technical debt every day.

The over-aching point here is that you, as a technician and you, too, as a product owner need to think about the quality requirements of your system and to do that over the entire life cycle. What was a fitting solution at a time might turn into technical debt over time. This means means your system now doesn’t hold up to the quality requirements you and your clients needs and no matter what was the cause for it, you’re better of fixing it now than accumulate even more technical debt.

Isaac Sacolick describes in his article on How to get an agile product owner to pay for technical debt how to address the common problem of technical debt heads on, mostly from a managers perspective via process and by making people (mostly the technical lead) responsible. However, I think it makes a lot more sense to try to ensure a common understanding between the developers and the product owner. One way to go about this is to use quality scenarios: as a member of the dev team, ask your product owner how many users the system needs to handle. Ask her if she thinks it’s okay if somebody might find a security issue earlier than you. Or if she thinks it’s okay when a seemingly small change will take three weeks because the code is an intangible mess that nobody understands since Dieter left? These sort of questions hopefully open up to discussions based on business value, ROI and cost of delay. You are then using the language product owners understand and you might also learn something along the line (no we expect so many users to do X in the future, so investing in module Z doesn’t make sense).

Now, granted that doesn’t always work. An old colleague of mine always denounced the old system he worked on as being exactly the way it was ordered. Isaac’s article might offer some advice here, of which the most important one is probably that paying back technical debt is way easier if you have a CIO or someone with similar power supporting you.

Posted by Holger Schauer

Defined tags for this entry: , ,
Jul 9

I’m happy that there are apparently still people writing articles about Scrum that are not ranting and venting about how agile sucks: Colin Higgins explains that Scrum is not our saviour. It basically boils down to the old no silver bullet saying:

No, SCRUM, or any of the other agile methodologies will not just magically fix your throughput problems after you adopt every facet of them religiously. These methodologies will just show you how badly your development practices suck.

I agree completely. But there are some agile development practices or tools which can help you. Go look up extreme programming (XP) and what it urged developers to do. CI/CD, TDD and pair programming are by now widely accepted development practises which can help, because they aim at bringing up problems way before they end up in production.

However, there is a boundary to what these tools alone can offer. A lack of knowledge of the technologies or a general lack of problem solving skills are fundamental issues which you can overcome only by learning and gaining experience, if at all. Another quote from the linked article:

Agile process alone won’t fix your development problems, you need technical excellence for successful projects. The two are necessary for a continually successful organization.

I would agree in principle, but there are two important issues here:

1) “A fool with a tool is still a fool” and “technical excellence” are the two extreme ends of a large scale. It’s the same as with car driving: most developers like to see themselves among the top 5% of drivers developers, but as we all know there is not enough room for everyone within the 5%. Still, too many developers would shrug the call to technical excellence away, because they believe they are so great or they have no idea what technical excellence actually means. For these developers, there is always someone else to blame.

2) Requiring the proverbial “rock star developer” is — for reasons discussed above — in my opinion not a viable option. In other words, while Scrum might not be a saviour, we’re not doing anybody a favor if we instead declare “technical excellence” as the silver bullet that can lead to success. Yes, we should strive for technical excellence and do our best to improve our capabilities (technical and otherwise). This is what the article rightfully demands of developers: do the best you can and try to continuously improve. But there still needs to be a middle ground, where it’s possible to run projects / products successfully with the people you currently have. Agile helps with that because it makes clear what is realistically possible and it aims for continuous improvement, e.g. with retrospectives. To repeat the quote from above: “These methodologies will just show you how badly your development practices suck.” That’s a great starting point. It’s a learning opportunity. Go use it.

Note that this is important not only on a personal level, but also on an organizational level: Scrum Masters should try to change the company culture to value learning. This requires room for experiments and for failures, where room translates to time and safety (i.e. support). On the sprint level, this might translate to using spikes, but on an organizational level it might involve things like prototypes, split testing and other ideas from the lean startup camp.

Posted by Holger Schauer

Defined tags for this entry:
May 14

For a recent project, I need to extract data from a sqlite3 database. Writing the Clojure code to retrieve data was very straight-forward with and java-jdbc/dsl. Naturally, I wanted to have some tests for this code as well. In a previous Python project, I had a lot of fun using sqlite’s in-memory feature to run very speedy database tests, so of course I wanted this for my current Clojure project, too. This turned out not to be so easy as I had expected though, so I’m documenting it here for the next naive soul. My initial attempt with, java-jdbc/dsl and midje looked basically like this:

    (def testdbspec
      {:subprotocol "sqlite"
       :subname ":memory:"})

    (defn make-bookmark-table []
      (jdbc/with-transaction [db testdbspec]
         (jdbc/db-do-prepared db
           (ddl/create-table :bookmarks
                      [:id :int :primary :key]
                      [:type :int]
                      [:title "longvarchar"]))))

    (defn add-bookmark []
      (jdbc/with-transaction [db testdbspec]
         (jdbc/db-do-prepared db
              "INSERT INTO bookmarks (id, type, title) "
              "VALUES ('12453', '2', 'a bookmark')"))))

    (defn setup-database [db]

    (facts "Testing database access to bookmarks"
       (with-state-changes [(before :facts (setup-database))]
            (fact "We can retrieve a list of bookmarks"
                (fetch-tags :dataspec testdb) => [{:title "a bookmark"}]))))

This will fail quite early, because basically as soon as the with-transaction in make-bookmark-table has finished its work, the connection to the database will be closed. As a result, when the next with-transaction or jdbc\query is run, you’ll connect to a fresh in-memory database which doesn’t have the tables you just created. My old Python test code didn’t have this problem, because the setUp method of the TestCase would create the database connection (via sqlalchemy’s create_engine) and would keep it alive until the TestCase tearDown method would run.

I tried giving back the database connection from make-bookmark-table, but this just results in a “connection closed” error. Unfortunately, doesn’t support opening and closing the connection yourself. Sure, you can use get-connection, but you can’t feed this into either with-transaction or query. query uses with-open internally, which will conveniently close the connection for you. In a post on the perils of dynamic scope Stuart Sierra calls this the Dynamically-Scoped Singleton Resource and files it under ‘anti-pattern’. I gotten bitten pretty exactly by what Stuart describes: when dealing with sqlite’s in-memory feature, we would like to manage the connection ourselves, but we can’t.

After banging my head against this for a while, the only option I could come up with resorts to extract the relevant with-transaction from the setup code. Instead you have to wrap the tests with the transaction and then call the setup code, like this:

    (defn make-bookmark-table [db]
      (jdbc/db-do-prepared db
           (ddl/create-table :bookmarks
                      [:id :int :primary :key]
                      [:type :int]
                      [:title "longvarchar"])))

    (defn setup-tables [db]
       (make-bookmark-table db))

    (defn add-bookmark [db]
       (jdbc/db-do-prepared db
              "INSERT INTO bookmarks (id, type, title) "
              "VALUES ('12453', '2', 'a bookmark' )")))

    (defn remove-bookmark [db]
       (jdbc/db-do-prepared db
            (str "DELETE FROM bookmarks WHERE id = '12453")))

    (facts "Testing database access to bookmarks"
       (jdbc/with-db-transaction [db testdbspec]
            (setup-tables db)
            (with-state-changes [(before :facts (add-boomark db))
                                          (after :facts (remove-boomark db))]
                 (fact "We can retrieve a list of bookmarks"
                     (fetch-tags :dataspec db) => [{:title "a bookmark"}]))))

This works as expected.

Posted by Holger Schauer

Defined tags for this entry: ,
Dec 2

The problem

I was recently reading a nice German book on Effective Software Archictecture by Gernot Starke and stumbled upon a discussion of the dependency inversion principle, which got me thinking. Gernot Starke first discusses the problem with an allusion to traditional procedural programming (translation mine):

Classical designs for procedural languages show a very characteristic structure of dependencies. As (the following figure) illustrates, these dependencies typically start from a central point, for instance the main program or the central module of an application. At this high level, these modules typically implement abstract processes. However, they are directly depending on the implementation of the concrete units, i.e. the very functions. These dependencies are causing big problems in practice. They inhibit changing the implementation of concrete functions without causing impacts on the overall system.

Classical dependencies in procedural systems He then goes on to introduce the idea of programming against abstractions and introduces the idea of the dependency inversion principle, first coined in Bob Martin’s DIP article (see also another thorough discussion in Brett Schucherts article on DIP). Basically, the idea is that the integrating process refers only to abstractions (i.e. interfaces) which are then implemented by concrete elements (classes), cf. the next figure.

Integrate with abstractions When I take a look at some of my recent Clojure code or at some older code I’ve written in Common Lisp, I immediately recognize dependencies that correspond to those in a classical procedural system. Let’s go for an example and take a look at one specific function in kata 4, data munging:

(ns kata4-data-munging.core
    (:require [kata4-data-munging.parse :refer [parse-day]]
              [ :as 'io]))

(defn find-lowest-temperature
    "Return day in weatherfile with the smallest temperature spread"
    (with-open [rdr (io/reader weatherfile)]
         (loop [lines (line-seq rdr) minday 0 minspread 0]
        (if (empty? lines)
            (let [{mnt :MnT mxt :MxT curday :day} (parse-day (first lines)) ;<-- dependency!
              curspread (when (and mnt mxt) (- mxt mnt))]
            (if (and curday curspread
                  (or (= minspread 0)
                  (< curspread minspread)))
               (recur (next lines) curday curspread)
               (recur (next lines) minday minspread)))))))

The dependency here is on the concrete implementation of parse-day, you can basically ignore the rest for the argument here. Given that this was a small coding kata, this is not unreasonable (and in the course of the kata, the code changes to be more general), but the issues here are obvious:

  • if we would like to parse a weather-file with a different structure, we have to change find-lowest-temperature to call out to a different function,
  • if the result of the new function differs, again we have to change the implementation of find-lowest-temperature,
  • we also have to change the namespace declaration, i.e. we probably want to require a different module.

Clojure’s built-in solutions

The application of the dependency inversion principle is typically shown in the context of object-oriented programming languages, like Java where you use interfaces and classes implementing those interfaces for breaking the dependency on concrete implementations, cf. the figure above again. But as we’ll see the principle can be applied independently of object-orientation. I’ll discuss higher-order functions, protocols and multimethods as potential solutions.

Higher order functions

For starters and probably painfully obvious is to make use of the fact that Clojure treats functions as first-class objects and supports higher-order functions. This simply means that we can pass the parsing function as an argument to find-lowest-temperature.

(defn find-lowest-temperature
    "Return day in weatherfile with the smallest temperature spread"
    [weatherfile parsefn] ; <-- function as parameter
    (with-open [rdr (io/reader weatherfile)]
         (loop [lines (line-seq rdr) minday 0 minspread 0]
        (if (empty? lines)
            (let [{mnt :MnT mxt :MxT curday :day}  (parsefn (first lines))
              curspread (when (and mnt mxt) (- mxt mnt))]
            (if (and curday curspread
                  (or (= minspread 0)
                  (< curspread minspread)))
               (recur (next lines) curday curspread)
               (recur (next lines) minday minspread)))))))

This way, we can simply call (find-lowest-temperature "myweatherfile" parse-day) and freely substitute whatever file format and accompanying parse function we need. What does this buy us?

  • We no longer have to modify find-lowest-temperature when we want to use a different parse-day function.
  • The namespace containing find-lowest-temperature also no longer requires the (namespace containing the) parse function.

But there is also a down-side: find-lowest-temperature assumes that all parsing functions it will get fed adhere to a signature that is entirely implicit: parsefn needs to take exactly one line and needs to return a map with given key-names. Higher-order functions don’t provide a solution for this per-se, so in order to solve the implicit signature issue we need to look elsewhere. This is nothing Clojure specific: Assuming you’ve passed in an object either as a method parameter or via Setter-Methods or Constructor-Injection (cf. dependency injection), Python’s or Ruby’s duck-typing basically works the same way: the caller of a method simply assumes that the callee offers a method with the right signature. It is the responsibility of the caller (of find-lowest-temperature) to provide a matching function for parse-fn.

However, this actually amounts to just move the problem from one level to the next: now some other level has to decide which concrete parse function needs to be used. This next level will have again the exact same problems: it will depend on both concrete implementations of find-lowest-temperature and parse-day (or any other parse function). If you think this through, it’s obvious that in general at one point or another, you have code that determines which function to call and which parameters to use. The question is only if we can use abstractions or whether we have to use concrete implementations. We’ll return to this issue, that now at some other level you need to handle the problem, later.

Continue reading "Dependency inversion in Clojure"

Posted by Holger Schauer

Defined tags for this entry:
Sep 12

Kata 14 is a seemingly simple one that is concerned with, as Dave Thomas puts it, “the heuristics of processing” texts, using trigrams to produce (more or less) random new texts.

Trigrams are not a new concept for me. Although the underlying concept is simple, they can be used for many interesting applications. Trigrams are a special case of N-grams, where N=3 turns out to be especially useful (as in “giving better results as other values for N”) for natural language processing, at least for western languages. Nearly a decade ago, I had the pleasure to collaborate with some rather smart people who used trigrams to identify “matching” text snippets between dictionary entries. The idea was similar to what is described in this article on using trigram vectors and nearest neighborhood calculation for document classification.As I’m generally interested in NLP and not only in doing coding katas, I will mainly focus on the trigram aspects in this kata, not so much on the random text generation part.

If you followed the link to the Wikipedia article, it’s clear from the kata description that we need word-level trigrams, not character-level trigrams. The kata description also already reveals the data structure to use for solving the task, a HashMap and the algorithm is also described in enough detail to be straight-forward.

Let’s augment the kata a little and decompose the tasks:

  1. split some string into n-grams with default n=3, where we might want to apply different criteria to apply on where we can split the string (e.g. after each character or after each word)
  2. parse a file into n-grams, where we need to consider sentence boundaries
  3. parse a collection of files into n-grams concurrently (just to speed up parsing of a larger file collection and also to introduce another possibility to learn a little more about Clojure’s specific tools to handle concurrency)
  4. do some analysis on the trigrams found in the recommended Tom Swift and his aircraft text
  5. modify the n-gram computation to yield the “first two words as key / list of all third words value” map described in the kata
  6. build a lazy-seq version of the text generation algorithm (because, as the example in the description already shows, there might be circles which could lead to infinite results)
  7. maybe implement the nearest neighborhood classification scheme described in the paper linked just for fun

But first things first: let’s parse some string into trigrams. This, first of all, requires tokenization. As a first obvious naive idea, we start out with simple string splitting, using clojure.string. First let’s split on all whitespace #"\s", using the first sentence in the Tom Swift text:

kata14-trigrams.core> (str/split "Are you all ready, Tom?" #"\s")
["Are" "you" "all" "ready," "Tom?"]

This already shows the issues surrounding punctuation that Dave Thomas mentions in the kata description. Basically, we have to consider what we want to do with sentence boundaries. Fortunately, we’re ultimately using Java’s Pattern class, so we can also match (or split) on punctuation, although probably not on all punctuation, but only on those which signify a sentence boundary (i.e. the charset [.!?] followed by either whitespace #"\s+" or end of line $):

kata14-trigrams.core> (str/split "Are you all ready - Tom?" #"\s*\p{Punct}\s*")
["Are you all ready" "Tom"]
kata14-trigrams.core> (str/split "Are you all ready, Tom?" #"\s*[.?!](\s+|$")
["Are you all ready, Tom"]
kata14-trigrams.core> (map #(str/split %1 #"\s+")
                      (str/split "Are you all ready, Tom? I want to go."
(["Are" "you" "all" "ready," "Tom"] ["I" "want" "to" "go"])

This still leaves the question open of what we want to do with the comma or any other interleaving punctuation. It’s clear that we want to get rid of it somehow, but it’s not too clear whether we would like to see “Tom” as a valid consecutive element in the text generation part. Probably not, so an idea here would be to try to make the remaining punctuation elements visible as separate tokens.

Let’s put this issue aside and move on to the actual n-gram generation. Quite obviously, “computing” an n-gram is really just a simple sequence operation: you move through the sequence, always taking n elements as needed, until you’re done. This is completely straight-forward to accomplish with a simple accumulator (acc) to collect the results that we take while looping through the sequence. (Code is on github, as always.)

(defn ngram
   "Given a sequence sq and a number n, returns a sequence of new contiguous sequences of n items that appear in sq."
   ([squence n]
     (ngram squence n []))
   ([squence n acc]
         (if-let [sq (seq squence)]
             (recur (rest sq) n (conj aux (take n sq)))

Given that we might want to run this on longer strings (texts, books), it makes sense to make this lazy by wrapping the call to the accumulator version in a lazy sequence.

 kata14-trigrams.core> (ngram [1 2 3 4 5 6] 3)
 ((1 2 3) (2 3 4) (3 4 5) (4 5 6) (5 6) (6))
 kata14-trigrams.core> (realized? (ngram [1 2 3 4 5 6] 3))
 kata14-trigrams.core> (take 2 (ngram [1 2 3 4 5 6] 3))
 ((1 2 3) (2 3 4)

Okay, let’s combine this with our clojure.string/split experiments:

kata14-trigrams.core> (map #(str/split %1 #"\s+")
                      (str/split "Are you all ready, Tom? I want to go."
(["Are" "you" "all" "ready," "Tom"] ["I" "want" "to" "go"])
kata14-trigrams.core> (map #(ngram %1 3) *1)
((("Are" "you" "all") ("you" "all" "ready,") ("all" "ready," "Tom") ("ready," "Tom") ("Tom"))
 (("I" "want" "to") ("want" "to" "go") ("to" "go") ("go")))

Okay, this looks like we’ve basically have everything we need in hand, now let’s make it a little bit more formal. First of all, the tokenization step. So far, we have basically done two things in one step, sentence boundary detection and in-sentence tokenization. I already hinted at the need to do further work on the in-sentence tokenization wrt. punctuation and there might be other steps that we might want to add in the future, for instance, stemming or further morphological analysis. I’ll not go in the direction of a more thorough tokenization method, which would require to go beyond regular expressions for many languages, but at least let’s communicate the intention of how the tokenize method works clearly.

(fact "Tokenize an input string, splitting sentences along the way"
      (tokenize "Are you ready, Tom? I want to go.") => '(("Are" "you" "ready" "," "Tom")
                                                                  ("I" "want" "to" "go")))

(defn tokenize
  "Tokenize a string"
  (-> string

This is basically the top-level function for tokenizing an incoming string, threading the result of splitting sentences into a tokenization function. Let’s take a look at the details, which shouldn’t be surprising at all. First we have split-sentences, which is basically str/split on sentence end markers. Then we have split-on-whitespace, which we’ve also already seen. split-off-punctuation is basically handling all punctuation not used up during sentence boundary detection, which we will want to keep. And then we have two wrappers tokenize-sentence(s) which do nothing more than handling mapping over the various bits and pieces. This concludes the tokenization step, phew.

(fact "Split sentences in a string"
      (split-sentences "Are you ready, Tom? I want to go.") => ["Are you ready, Tom" "I want to go"])

 (defn split-sentences
   "Split a string into a sequence of sentences"
   (str/split string #"\s*[.!?](\s+|$)"))

    (fact "Splitting on whitespace"
     (split-on-whitespace "Are you  ready") => ["Are" "you" "ready"]
     (split-on-whitespace "Are") => ["Are"])

(defn split-on-whitespace
  "Take a string and split it's content on whitespace, removing the whitespace"
      (str/split string #"\s+"))

(fact "Splitting but keeping punctuation if any"
     (split-off-punctuation "ready,") => ["ready" ","]
     (split-off-punctuation "ready") => ["ready"]
     (split-off-punctuation "!+#?") => [""])

(defn split-off-punctuation
   "Take a string and split it's content, keeping punctuation as new tokens"
  (let [match (re-find #"(\w+)(\p{Punct})?" string)
            result (rest (keep identity match))]
     (if (seq result)
    (vector ""))))

(fact "Tokenize a sentence"
      (tokenize-sentence "Are you ready, Tom?") => '("Are" "you" "ready" "," "Tom" "?"))

(defn tokenize-sentence
  "Take a single sentence and return a sequence of tokens for it"
  (flatten (map split-off-punctuation
               (split-on-whitespace sentence))))

(fact "Tokenize some sentences"
       (tokenize-sentences ["Are you ready, Tom?" "I want to go."]) => '(("Are" "you" "ready" "," "Tom" "?")
                                                                         ("I" "want" "to" "go" ".")))

(defn tokenize-sentences
   "Take a sequence of sentences and return a sequence of tokens for  each sentence"
  (map tokenize-sentence sentences))

When combining this with the ngram function, this is already pretty close to what we’ll need to solve the original kata, although we will need some further adjustment to the data structure, which I’m going to tackle later.

kata14-trigrams.core> (map #(ngram %1 3)
                       (tokenize "Are you ready, Tom? I want to go."))
((("Are" "you" "ready") ("you" "ready" ",") ("ready" "," "Tom") ("," "Tom") ("Tom"))
(("I" "want" "to") ("want" "to" "go") ("to" "go") ("go")))

So, let’s move to the next part which is generating ngrams for an entire file. First of all, I think that again we better do this in a lazy fashion, no need to do lots processing of huge files that might not be completely necessary. Looking back we can see that for any given string tokenize processes the same string at least thrice: we’re first splitting on sentence boundaries, then handle punctuation and finally split on whitespace. If you think about reading files, it’s quite obvious that the readLine method of which is behind Clojure’s line-seq is also processing buffers quite similarly looking for line ends to split on. Maybe, we can combine some of the work? Let’s start out with figuring out how to process a file char by char lazily. An answer to a stackoverflow question on processing files per character in Clojure strictly follows line-seq:

(defn char-seq
  [^ rdr]
   (when-let [chr (.read rdr)]
      (if (>= chr 0)
          (cons (char chr) (lazy-seq (char-seq rdr))))))

This is a start but not too helpful as discussed in this other stackoverflow thread on processing large text files, as the result of line-seq and char-seq is a cons and the lazy part of it doesn’t help you much when you’re not processing the file right away. Instead one might want to return a lazy sequence of results, closing the file only afterwards. This could like this:

; cf.
(defn lazy-file-chars [file]
   (letfn [(lfl-helper [rdr]
               (if-let [chr (.read rdr)]
                      (when (> chr 0)
                          (cons (char chr) (lfl-helper rdr)))
                      (do (.close rdr) nil))))]
      (lfl-helper ( file))))

When you look at this simple piece of code, besides reading characters from disc and building up a lazy-seq, it’s also a) doing a sanity check on the input and b) building up a particular structure to return. Sounds exactly like the hooks we might want to consider for parsing sentences on read. Let’s rip the code apart and combine it with the guts of split-sentences (matching explicitly on characters instead of using regular expression character classes):

 (defn read-next-sentence [rdr aux]
    (if-let [chr (.read rdr)]
       (let [character (char chr)]
         (cond (= \. character) aux
                       (= \? character) aux
                       (= \! character) aux
                       (= \tab character) (recur rdr (conj aux \ ))
          :else (recur rdr (conj aux character))))

(defn file-sentences [file]
     (letfn [(lfs-helper [rdr]
               (if-let [sentence (seq (read-next-sentence rdr (vector)))]
                 (cons (apply str sentence) (lfs-helper rdr))
                 (do (.close rdr) nil))))]
                (lfs-helper ( file))))

read-next-sentence has some obvious deficiencies: it now splits sentences on every occurrence of .?!, not only on those occurrences which are followed by whitespace. Second, it should handle (only) multiple occurrences of \return\newline characters (CRLF) as sentence delimiters, too. Solving both of these issues requires to go in the direction of real parsers where we would have to see aux as a stack of previously read characters. And we might not only want to deal with tabs specially (turning them into a space), e.g. we might want to replace multiple spaces/tabs into a single space etc. I’ll just draw a sketch here that we might want to elaborate further:

(fact "Test for sentence end"
      (sentence-end-p \space   [\g \o \.])                    => true
      (sentence-end-p \space   [\r \e \a \d \y \?])           => true
      (sentence-end-p \newline [\y \return \newline \return]) => true
      (sentence-end-p \newline [\y \newline])                 => true
      (sentence-end-p "B"      [\.])                          => false
      (sentence-end-p \newline [\y \return])                  => false
      (sentence-end-p \newline [\y])                          => false)

   (fact "Parse result for characters depends on previous reads"
      (next-char-result \space   [\g \o \space])  => [\g \o \space]
      (next-char-result \tab     [\g \o])         => [\g \o \space]
      (next-char-result \tab     [\g \o \space])  => [\g \o \space]
      (next-char-result \tab     [\g \o \tab])    => [\g \o \space]
      (next-char-result \return  [\g \o])         => [\g \o]
      (next-char-result \newline [\g \o \return]) => [\g \o \space]
      (next-char-result \newline [\g \o])         => [\g \o \space])

(defn sentence-end-p [character charstack]
    (cond (and (= character \space)
                       (some (partial = (peek charstack)) [\. \? \!])) true
                (and (= character \return)
                       (some (partial = (peek charstack)) [\. \? \!])) true
                (and (= character \newline)
                       (some (partial = (peek charstack)) [\. \? \!])) true
                (and (= character \newline)
                       (or (= (peek charstack) \newline)
                           (and (= (peek charstack) \return)
                                (= (peek (pop charstack)) \newline)))) true
                :else false))

(defn next-char-result [character charstack]
     (cond (and (empty? charstack)
                (or (= character \space)
                                (= character \tab)
                                (= character \newline)
                                (= character \return))) charstack
                 (and (= character \space)
          (= (peek charstack) \space))  charstack
         (and (= character \tab)
          (= (peek charstack) \space))  charstack
             (and (= character \tab)
              (= (peek charstack) \tab))  (conj (pop charstack) \space) ; should never happen
          (= character \tab) (conj charstack \space)
                      (= character \return) charstack
                 (and (= character \newline)
                      (= (peek charstack) \space)) charstack
                 (and (= character \newline)
                      (= (peek charstack) \return)) (conj (pop charstack) \space) ; should never happen
                      (= character \newline) (conj charstack \space)
                 :else (conj charstack character)))

I’ll leave it at that, although it’s clear that we can and probably should extend it in many different ways. Here are the adapted functions to use these:

(defn read-next-sentence
       (read-next-sentence rdr (vector) (vector)))
   ([rdr seen result]
       (let [chr (.read rdr)]
           (if (and chr
                    (>= chr 0))
               (let [character (char chr)]
                   (if (sentence-end-p character seen)
                        (recur rdr (conj seen character)
                           (next-char-result character result))))

(defn read-sentences [x]
    (letfn [(lfs-helper [rdr]
                (if-let [sentence (read-next-sentence rdr)]
                    (cons (apply str sentence) (lfs-helper rdr))
                    (do (.close rdr) nil))))]
        (lfs-helper ( x))))

The result is here that we now have a read-next-sentence function which just reads (non-lazily) and a (local) helper function which uses it to build up lazy sequence of sentences. Let’s test it briefly:

kata14-trigrams.core> (pprint 
                     (map #(ngram %1 3) 
                  (take 2 
                        (read-sentences test-file)))))
((("The" "Project" "Gutenberg")
 ("Project" "Gutenberg" "EBook")
 ("Gutenberg" "EBook" "of")
 ("EBook" "of" "Tom")
 ("of" "Tom" "Swift")
 ("Tom" "Swift" "and")
     ("Swift" "and" "his")
     ("and" "his" "Airship")

Although one would probably now integrate more functionality from tokenize-sentences into read-next-sentence, I’ll won’t elaborate this now and see task 2 as solved. As a side note, this looks as if it’s only restricted to files now, but it really isn’t, as will happily accept StringReader arguments:

kata14-trigrams.core> (import
kata14-trigrams.core> (take 2 (read-sentences (StringReader. "This is a sentence. And another one")))
("This is a sentence." "And another one")

Continue reading "Coding katas Clojure -- Trigrams"

Posted by Holger Schauer

Defined tags for this entry: ,
Apr 13

Christophe Rhodes’ post on http-content-negotiation and generalized specializers in CLOS (Common Lisp Object System) made an ugliness in a small Clojure web application jump right into my face. I’m using liberator to setup so-called resources (side-note: While this post assumes some familiarity with liberator, the main aspects is actually multi-method handling in Clojure — I hope it’s useful even if you don’t know or care about liberator). Resources are serving as ring handlers (typically used with compojure) and are used to deal with most aspects of request handling in a fairly declarative manner, including content negotiation. Liberator provides decision points and handlers, moving a so-called context around between the various functions that you need to associate with resources — map-like data returned from a decision function will be merged with the existing context. So far, so good. The bad part, however, was that I used a single resource definition for providing multiple media types. More exactly my code / resource definition has an anonymous handler function which uses a simple value check to serve the correct media type (we’re talking about the Accept-Header of the incoming request, cf. RfC 2616, Sec. 14.1, like this (as you can imagine, that’s a somewhat simplified version):

 (defresource users
   :available-media-types ["text/html" "application/json"]
   :method-allowed? (request-method-in :get)
   :exists? (fn [context]
                {:users (find-users)})
   :handle-ok (fn [context] 
                  (let [media-type
                       (get-in context [:representation :media-type])]
                    (condp = media-type
                          (generate-string (get context :users))
                          (usersview (:users context)))))
   :handle-not-found (fn [context]
                         (let [media-type
                              (get-in context [:representation :media-type])]
                            ;; TODO: Handle not found for HTML
                            (condp = media-type
                                   (generate-string {:error "No such user"})))))

From a functional point of view, there is not much wrong with this. It’s very close to the description in the relevant part of the liberator tutorial on content negotiation. But from an aesthetical point of view, the condp expressions to determine finally how to present the resource data is plainly ugly. To get rid of this ugliness, the inspiration I took from Christophe’s article is to rely on Clojures method dispatch (which is the simple part from Christophe’s post only). The idea is straight-forward: Instead of using a simple anonymous function which convolutes two different media-types, introduce a multi-method like users-handle-ok that dispatches on media-type: We can simply define a dispatch method (via defmulti), moving the code which determines the accepted media-type (e.g. “application/json”). This value is then used by Clojure to determine the right method to use.

(defmulti users-handle-ok 
  "Handle OK for users resource for different media-types"
  (fn [context]
  (get-in context [:representation :media-type])))
(defmethod users-handle-ok "application/json" [context]
  (generate-string (get context :users)))
(defmethod users-handle-ok "text/html" [context] 
(usersview (get context :users)))

;; some code elided here ...

(defresource users
  :available-media-types ["text/html" "application/json"]
  :method-allowed? (request-method-in :get)
  :exists? #(users-exists? %)
  :handle-ok #(users-handle-ok %)
  :handle-not-found #(users-handle-not-found %))

From a clean code perspective, this has two benefits: we now have mainly code left which does one thing at a time (SRP), which is what we should aim for and which makes unit testing also somewhat easier and more to the point. It also slims down the amount of code in the resource definition considerably. It’s now much more obvious that the resource definition is (from the application developer point of view) not much more than an integration point for different other functions.

Of course, we can use a similar approach for all of the other methods as well. Let’s assume that I have a resource that can generate HTML and JSON, but expects that all incoming POST requests contain JSON only. This will look utterly similar to the approach above, only this time we dispatch on the request-method. If we are now POST-ing to this resource with a different Content-Type, we’ll receive a “415 Unsupported media type” reply from liberator.

(defmulti known-content-type?
  "Determine known content types depending on request-method"
  (fn [context]
  (get-in context [:request :request-method])))
(defmethod known-content-type? :post [context]
  "Allow only application/json for POST requests"
  (when-let [content-type (get-in context [:request :content-type])]
    (condp = content-type
           "application/json" true
(defmethod known-content-type? :default [_]

(defresource someresource
  :available-media-types ["text/html" "application/json"]
  :method-allowed? (request-method-in :get :post)
  :known-content-type? #(known-content-type? %)
  :exists? (fn [context]
       (when-let [data (find-daa)]
             {:data data}))
  :handle-ok #(handle-ok %)
  :post! #(handle-post! %)
  :post-redirect? (fn [context] {:location (url-in-context "someurl")}))

As you might guess, this method known-content-type? is probably applicable for most resources. But how would you handle the exception to the exception? This is actually quite easy, as it turns out. In line with most examples of multi-methods I’ve seen so far, we’ve used a simple value to dispatch on. But of course a map can be a value, too. Given the need to override (specialize) the method for some resource, the idea is to define the dispatch method to return a map with the request-method and the resource. We then define the methods with appropriate values. The nice thing about this is that it’s very easy to arrange for default behavior for a request method by just leaving out the resource key — the dispatch function takes precautions not to add a superfluous :resource key in case none is added to the context by the resource.

(defmulti known-content-type?
  "Determine known content types depending on request-method"
  (fn [context]
(logging/info (str "Found resource: " (:resourceclass context)))
(logging/info (str "Method: " (get-in context [:request :request-method])))
(let [dispatchval {:request-method (get-in context [:request :request-method])}]
  (if-let [resource (:resourceclass context)]
    (assoc dispatchval :resourceclass resource)
(defmethod known-content-type? {:request-method :post} [context]
  "Allow only application/json for POST requests"
  (logging/info "Determining known content-type for :post!")
  (when-let [content-type (get-in context [:request :content-type])]
(condp = content-type
  "application/json" true
(defmethod known-content-type? :default [_]
  (logging/info "Determining known content-type for :default!") 

(defresource some-resource
  :available-media-types ["text/html" "application/json"]
  :method-allowed? (request-method-in :get :post)
  :known-content-type? #(known-content-type? %)
  :exists? (fn [context]
       (when-let [data (find-daa)]
             {:data data}))
  :handle-ok #(handle-ok %)
  :post! #(handle-post! %)
  :post-redirect? (fn [context] {:location (url-in-context "someurl")}))

(defresource special-resource
  :available-media-types ["text/html" "application/json"]
  :method-allowed? (request-method-in :get :post)
  :known-content-type? #(known-content-type? (assoc % :resource special-resource))
  :exists? (fn [context]
     (when-let [data (find-special-data)]
       {:data data}))
  :handle-ok #(special-handle-ok %)
  :post! #(special-post! %)
  :post-redirect? (fn [context] {:location (url-in-context "specials")}))

(defmethod known-content-type? {:request-method :post :resource special-resource} [context]
  (logging/info "Determining known content-type for :post and special-resources!")
  (when-let [content-type (get-in context [:request :content-type])]
(condp = content-type
  "application/json" true
  "application/x-www-form-urlencoded" true

With these definitions in place, the default for POST requests using this known-content-type? method would be to accept only application/json. However, the special-resource “overrides” this behavior to also accept regular form data. Posting to the various resources will produce output like the following:

2014-04-14 15:04:31,088 [main] INFO  utils - Found resource: liberator.core$resource$fn__3268@688dbd21
2014-04-14 15:04:31,088 [main] INFO  utils - Method: :get
2014-04-14 15:04:31,089 [main] INFO  utils - Known content-type for :default!
2014-04-14 15:27:14,974 [main] INFO  utils - Found resource: liberator.core$resource$fn__3268@688dbd21
2014-04-14 15:27:14,974 [main] INFO  utils - Method: :post
2014-04-14 15:27:14,975 [main] INFO  utils - Known content-type for :post!
2014-04-14 15:04:31,127 [main] INFO  utils - Found resource: liberator.core$resource$fn__3268@688dbd21
2014-04-14 15:04:31,127 [main] INFO  utils - Method: :post
2014-04-14 15:04:31,127 [main] INFO  utils - Known content-type for :post and special-resources!

Please note that known-content-type? has to be a known symbol (defined or at least declared) prior to be usable in the resource definition, whereas adding the more specialized method requires the special-resource to be defined — declaring it won’t be enough.

Using maps as dispatch values seems to be a nice and powerful tool to know about. There are, however, still some points where I see room for improvement:

  • We would probably like to use the same mechanism for a ton of functions, always highly similar. E.g. the methods for handle-not-found and known-content-type? look highly similar on the structural level. Also, when you have multiple resources, the dispatch function for one method type (i.e. something like handle-ok) are probably always the same, so are the dispatch arguments (i.e. the media types our web application will handle). Maybe a macro would be useful here, but I haven’t thought it through yet.
  • Handling the Accept header is actually way more complicated. Fortunately, liberator takes care already of choosing the “right” media-type (cf. again the liberator tutorial on content negotiation. However, as also discussed in the same section of said tutorial, there are more negotiable parameters which might come into play, e.g. language or encoding. This quite obviously could lead to some combinatorial explosion. While the approach using a map outlined above is a way to handle it, this approach is essentially mimicking CLOS’ dispatch on multiple arguments via a single argument dispatch.
  • I haven’t even started to think about how one would approach the more advanced problem that Christophe is solving by using his MOP trickery generalized-specializers.
  • The name of the method users-handle-ok isn’t really telling. Of course, a name like users2json or serve-users-view seem better suited to describe what the respective methods are doing, but this obviously would defeat the idea of using multi-methods and the associated benefits. Still, the name should probably not be tied so close to the resource definition. Using the name parameter of methods is one way to remedy this particular issue.
  • Finally, apparently the slots of a liberator resource expect function objects. Liberator won’t just take the name of a function and do the right thing, it’ll throw an exception. Not a big deal, given that we might need to mangle the implicit argument (the context) anyway, cf. the :known-content-type?slot of the special-resource.

Posted by Holger Schauer

Defined tags for this entry:
Mar 2

I recently did some coaching on how agile teams work, what roles there are and what are their respective responsibilities. It’s probably no surprise that one of the main topics was self-organization and what this implies for how agile teams work, also how to get there and how to solve the various puzzles (like distributed teams) occurring when working with bigger team setups. You’ll find the slides below. I was asked to talk about these issues because the team was having trouble understanding what was expected of them and how to organize themselves. In my experience, that’s a quite common situation to run into when the team members (or at least some of them) have no or only little experience working in an agile setup. But even when you already have some experience, it might happen that the team or at least important members of it lose the understanding of how to work together. And of course, that’s much more likely to occur when you interrupt the team working together, e.g. by putting some people onto different jobs, even if only for a short time. I guess this is what people outside of a team (i.e. managers) often forget: teams are usally only able to perform great over a longer time if they can focus on “it”, i.e. working on their tasks and working on their understanding of how to work together as a team.

Anyway, I started by thinking about why people are always so eager to talk about roles and responsibilities. I came to the conclusion that it’s mainly a matter of safety or lack of it. That’s why you have these discussions mainly at the beginning of a collaboration. The important point that became apparent to me is that roles and responsibilities don’t matter in a performing team (performing in the sense of Tuckman’s model). And the corollary to that is: When you don’t know what is expected of you (as a team member), your team is really just a group of people and you don’t know how to work together as a team (yet). Let me elaborate this: I believe for a group of people to work together successfully it’s most vital to have a common understanding of what needs to be done and to just do it (as the old sport company’s saying goes). Of course, you will need somebody who will just do it, but does this imply that you have to have a formal org-chart upfront that tells you who has which role? What if you happen upon things that need to be done that you did not think about when compiling your org-chart? You will need somebody to step up anyway, so why not just leave it at just do what needs to be done? Even better if all of your team have a common understanding on how to do things (cf. for example a commonly agreed upon definition of done).

My main (and very obvious) advice on solving the issue of “I don’t know what is expected of me” is to compile a list of things that need to be worked on and to get started. Whether this “list of things” is a backlog or just a todo list, whether it’s on a physical board, in an Excel file or in some other system, is not as important as actually working with this list: re-visit this compiled list every so often to update it, prioritize items, move things around, augment and transform it as needed. And of course, remove things to do from the list that are actually done. But do this in a way that it is clearly visible all the time to all team members. If you modify this list openly so that everyone can participate, everybody will learn about what is going on, you can have discussions that will help team members understand why something needs to be done first and which things are problematic. And, last but not least, people will communicate over helping each other out, give input on difficult questions and work together.

The other important ingredient for me is giving and receiving feedback. If you’re working in a Scrum-like fashion, review and retrospective meetings at the end of an iteration are the ideal places for this. I believe that the awkwardness that in particular the retrospectives have for people new to agile is an indicator of a company culture, team setup or personal attitude that does not honour open discussions and feedback. Retrospectives make people feel uncomfortable especially if they are feeling insecure. However, without feedback it’s difficult to learn, so organizations should in my opinion strive to provide safe environments for their people. A simple example of this is to have the retrospective facilitator use the “setting the stage phase” to ensure that the people feel safe (cf. this article on creating safe environments for retrospectives), maybe by having the team come up with rules for safe communication on their own or by simply reminding them of already existing rules. However, the important point for me is that team members should learn to give and take feedback in any situation. No, I don’t mean that it’s appropriate to interrupt your colleague every five minutes to complain about some weird code or that you should accept getting interrupted every so often. But, e.g., if Maria believes there are many issues with the code that Peter checks in everyday, it should be okay for both of them when she arranges a discussion between her and Peter on how to solve the problem. If Maria can provide her feedback in a way that is both personal and helpful, this will help building trust which leads to a feeling of safety wrt. working with each other. Typically, such discussions in between (as well as during retrospectives) will address many of the fears, uncertainty and doubt that trouble people about what is expected of them.

I believe that discussions about how a team should organize itself and surrounding tools for this (e.g. the RACI model) are mainly waste. Remember, during the time you’re discussing your organization, you don’t add anything of value for your clients. I would rather spend the time on making progress on the tasks that are waiting for attention and figure out how we work together while actually accomplishing something.

Finally, if you have some feedback on my slide desk, I would be eager to hear it.

Posted by Holger Schauer

Defined tags for this entry: , ,
Feb 17

Kata 6 is concerned with anagrams. An anagram is a word that consists of characters which, when combined in a different order, form a different word. Now, when I started out with this kata, I was sitting on a train without internet connection, so I just went ahead with what I remembered from a quick glance over the kata description I had done the week before. So I thought that all that needed to be solved was to determine whether two words were anagrams of each other (the complete code, btw. can be found in my github repository for the Clojure codekatas). My initial idea how to solve this is to generate the sets of the characters of both words and compare those:

(defn remove-blanks [word]
         (str/replace word " " ""))

(defn anagram-set? [word1 word2]
         (let [w1 (remove-blanks word1)
            w2 (remove-blanks word2)] 
       (= (set w1) (set w2))))

This time, I opted for using midje for running the tests, in particular due to the possibility to run tests continually via lein-midje. Midje takes a slightly different approach / syntax to writing tests, adding the notion of facts that are then verified. I.e. tests with midje look like this:

(facts "Testing the set implementation for checking anagrams"
      (fact "Set anagram can find anagrams"
         (anagram-set? "the law" "wealth") => true)
     (fact "Set anagram is too simplistic"
         (anagram-set? "the lalalaw" "wealth") => true))

You can already see from the latter fact what is wrong with the initial solution: it’s too simplistic with regard to handling the number of occurences of some character. (Some might say, the introduction of remove-blanks also is too complicated, but I wanted to handle Anne Clark’s “The law is an anagram of wealth”.

Traditional solution

When I finally had some more time to read the kata description more carefully, I recognized that the task actually is to find all anagrams of a given word, checking back against a given wordlist. So that means that the kata consists of two tasks: generate all possible combinations for a given character sequence and check in this wordlist whether some candidate character sequence amounts to a known word. Now, if you take a step back, it’s easy to see that anagrams are nothing else than permutations of the elements of a given (character) sequence, with the additional restriction that all such permutations must be (known) words again. So, we end up with a skeleton which looks like this:

(defn generate-anagrams [word]
   "Generate all anagrams of word"
   (generate-permutations word))

(defn find-anagrams [word words]
   "Finds all anagrams of word in (the sequence of) words"
   (let [anagrams (generate-anagrams word)
              wordset (set words)]
     (loop [candidates anagrams
            result []]
        (if (empty? candidates)
            (recur (rest candidates)
                 (if (and (not (= (first candidates) word))
                     (contains? wordset (first candidates)))
                (concat result (list (first candidates)))

Which now, of course leaves us with the task to implement a permutation algorithm. I must admit I had a pretty hard time to come up with something on my own without resorting to looking at other people’s code. Given that the task of code katas is not primarily to invent algorithms on the fly, but to practice coding, I finally read the wikipedia paragraph on computing permutations in lexicographic order, which has a blue print of an algorithm which is attributed to Naranya Pandita, who invented it in the 14th century already. I took a very verbatim and top-down approach this time and ended up with this as the next piece of code:

(defn next-permutation [squence]
   (when-let [k (find-largest-index-with-bigger-successor squence)]
        (let [l (find-largest-index-bigger-value squence k)
              swapped (swap-positions squence k l)
              current-perm (reverse-tail swapped (inc k))]

(defn generate-permutations [squence]
   (let [start-perm (sort squence)]
      (loop [permutation (next-permutation start-perm)
             result (list start-perm)]
         (if (or (not permutation)
             (empty? permutation))
         (recur (next-permutation permutation)
                (concat result (list permutation)))))))

I went back to writing tests:

   (fact "finding the largest index with bigger successor"
         (find-largest-index-with-bigger-successor [1 2]) => 0
         (find-largest-index-with-bigger-successor [1 2 3 4]) => 2
         (find-largest-index-with-bigger-successor [1 2 4 3]) => 1
         (find-largest-index-with-bigger-successor [1 3 4 2]) => 1
         (find-largest-index-with-bigger-successor [1 4 3 2]) => 0)

The test results are taken straight out of the wikipedia article. Implementing this is pretty straight-forward: we just iterate through the list, keeping track of the current position and check whether the following element is bigger than the current element. If so, we keep the current position, otherwise we keep what we had so far as the result. When we reach the end of the sequence (or there is no subsequent to compare to), we have found the largest position (index) that has a successor with a bigger value. One thing is worth pointing out: the usage of (comp pos? compare) is necessary because > does only work on numbers, but no on characters (or keywords). Why Clojure does not follow Python (which provides a general purpose operators, which use something like compare under the hood which you can override for your data types) in this aspect is beyond me.

(defn find-largest-index-with-bigger-successor [squence]
     (loop [restsq (seq squence)
           curpos 0
                   curresult nil]
        (cond (or (empty? restsq)
                  (empty? (rest restsq)))
             ((comp pos? compare) (second restsq) (first restsq))
               (recur (rest restsq) (inc curpos) curpos)
               (recur (rest restsq) (inc curpos)     curresult))))

The next step is finding the position of some value that is bigger than the position that we just determined. Again this is straight-forward:

   (fact "finding the largest index that has a bigger value than some other position"
         (find-largest-index-bigger-value [1 2 3 4] 2) => 3
         (find-largest-index-bigger-value [1 2 4 3] 1) => 3
         (find-largest-index-bigger-value [1 3 4 2] 1) => 2
         (find-largest-index-bigger-value [1 4 3 2] 0) => 3)

   (defn find-largest-index-bigger-value [squence index]
     (let [compval (nth (vec squence) index)]
    (loop [restsq (seq squence)
             curpos 0
             curresult nil]
            (cond (empty? restsq)
            ((comp pos? compare) (first restsq) compval)
               (recur (rest restsq)  (inc curpos)  curpos)
              (recur (rest restsq) (inc curpos) curresult)))))

We now have to swap these two elements which is easy enough to do with vectors:

   (fact "swapping two positions in a sequence"
         (swap-positions [1 2 3 4] 2 3) => [1 2 4 3]
         (swap-positions [1 2 4 3] 1 3) => [1 3 4 2]
         (swap-positions [1 3 4 2] 1 2) => [1 4 3 2]
         (swap-positions [1 4 3 2] 0 3) => [2 4 3 1])

   (defn swap-positions [squence k l]
    (let [seqvec (vec squence)]
      (assoc (assoc seqvec k (nth seqvec l))
         l (nth seqvec k))))

I first fiddled around with take and drop to avoid converting the input sequence to a vector but this makes the code much more complex. Why there is no general-purpose position-based replace is, again, beyond me — there are a number of discussions around (the lack of) a general-purpose subsequence function which point out issues with complexity (code-and performance-wise), but I doubt that most manually-crafted workarounds lead to any better solutions. Maybe I’m missing something obvious here.

Next, we need to reverse the rest of the sequence behind the position which we just swapped. The example in the wikipedia article is not entirely clear for longer remainders, but some tests revealed that the right position is really the one we just used, like this:

   (fact "reverse the tail of a sequence"
         (reverse-tail [1 2 4 3] 2) => [1 2 3 4]
         (reverse-tail [1 3 4 2] 1) => [1 2 4 3]
         (reverse-tail [1 4 3 2] 0) => [2 3 4 1])

   (defn reverse-tail [squence tail-position]
    (let [prefix (take tail-position squence)
       tail (drop tail-position squence)
       revtail (reverse tail)]
    (concat prefix revtail)))

So, with this we now have all pieces in our hands and can test the entire algorithm:

   (fact "finding the next permutation"
         (next-permutation [1 2 3 4]) => [1 2 4 3]
         (next-permutation [1 2 4 3]) => [1 3 2 4])

Which will, surprise, surprise, give the expected results. So, with this we are able to generate all 24 permutations of [1 2 3 4] and we can go back to our anagram task. Turns out that the tests would fail: I hadn’t thought about the fact that the destructuring of the character sequence (i.e. the word) would require subsequent combination of the permutation results. That’s easy enough to correct by applying str to all permutation results.

(defn generate-anagrams [word]
    (map (partial apply str) (generate-permutations word)))

Now, when you run this code with the test data given in the original kata:

   (facts "Testing the anagram implementation"
       (fact "Generating all anagrams"
         (generate-anagrams "ftw") => '("ftw" "fwt" "tfw" "twf" "wft" "wtf"))
          (let [words (split-lines (slurp "wordlist.txt"))]
            (find-anagrams "kinship" words) => '("pinkish")
            (find-anagrams "enlist" words) => '("inlets" "listen" "silent")
            (find-anagrams "boaster" words) => '("boaters" "borates")
            (find-anagrams "sinks" words) => '("skins")
            (find-anagrams "knits" words) => '("stink")
            (find-anagrams "rots" words) => '("sort")
            (find-anagrams "thelaw" words) => '("wealth")))

I ran into a StackOverflowException for “boaster” though. Looking at the code, it’s immediately obvious that there the only possible cause for this can be in generate-permutations which generates the result eagerly. So, let’s change that to a lazy variant.

(defn- gen-perms [squenze]
        (when-let [permutation (next-permutation squenze)]
            (cons permutation (gen-perms permutation)))))

(defn generate-permutations [squence]
    (let [start-perm (sort squence)]
        (cons start-perm (gen-perms start-perm))))

I use an external helper here because we need to add the start permutation to the final result up-front and that doesn’t lend itself to a self-recursive function. Anyway, this concludes the first solution using a rather traditional algorithm.

Declarative solution

For the next solution, I intended to use something else. Last year, I had the chance to hear David Nolen talk about core.logic which reminded me a lot of the old days in which I was using Prolog for computational linguistics and logic programming. In particular I was thinking of a permutation implementation in Prolog described in Richard O’Keefe’s Craft of Prolog, which I briefly discuss below:

permutation(Xs, Ys) :-
    permutation(Xs, Ys, Ys).

permutation([X|Xs], Ys1, [_|Bound]) :-
    permutation(Xs, Ys, Bound),
    insert(Ys, X, Ys1).

insert(L, X, [X|L]).
insert([H|T], X, [H|L]) :-

If you would want to generate all permutations for a list [1,2,3], you would call permutation([1,2,3],Q) and your Prolog interpreter of choice (e.g. SWI-Prolog) would generate the first possible result for Q and via backtracking generate all other possible permutations.

?- permutation([1,2,3],Q).
Q = [1, 2, 3] ;
Q = [2, 1, 3] ;
Q = [2, 3, 1] ;
Q = [1, 3, 2] ;
Q = [3, 1, 2] ;
Q = [3, 2, 1].

Let’s briefly discuss the Prolog solution, this will make it easier to discuss some issues when translating this to core.logic later on. Prolog uses facts and rules to prove some query. E.g., permutation([],[],[]). is a fact asserting that the permutation of an empty list is the empty list. Anything involving :- is a rule. Prolog uses unification — hang on, you’ll see in a second what this is. Second, you see all those [X|Xs] constructions. These are basically list (de-)construction operations: they split off the first element or add an element (head) and some rest (tail) to form a new list. The point here is that if you’re calling permutation([1,2,3],Q,Q) Prolog will try to unify [1,2,3] with [X|Xs] which is possible when X=1 and Xs=[2,3]; i.e. Prolog automatically tries argument unification. The _ construct means “ignore”, “don’t care”. If we consider only the insert fact (i.e. the first statement), this fact can be used by Prolog via unification to answer queries about any value of the predicate:

?- insert([2,3],1,Q).
Q = [1, 2, 3] 
?- insert([2,3],Q,[1,2,3]).
Q = 1 
?- insert(Q,1,[1,2,3]).
Q = [2, 3] 

The key to understand how permutation works is considering how insert works: the insert rule will deconstruct the first argument (assuming it’s a list) and insert the second argument to it. This way, X will be inserted in all possible positions of the list:

?- insert([2,3],1,Q).
Q = [1, 2, 3] ;
Q = [2, 1, 3] ;
Q = [2, 3, 1].

Now, if you take a closer look at the permutation/3 rule, you’ll recognize that it first of all contains a recursive call to itself. This will basically decompose the first argument (if given) until it reaches the permutation fact governing the base case, i.e. the empty list. It will then insert the elements according to the behavior discussed above. You can think of all comma , as and including a notion of order, i.e. the insert clause will only be used after having processed the recursive call to permutation on each level, respectively. This basically implies a depth-first search — i.e. for generating the multiple values for Q, Prolog will try to find different possible combinations by retrying parts of the computation. This will in particular trigger the computation of the different results of insert/3.

Now let’s come back to Clojure’s core.logic which provides an implementation of many useful things for logic programming based on miniKanren. However, as an add-on to a functional programming language, we will have to use some special operators to translate the Prolog code. The first thing needed is the declaration of the query variables (e.g. Q) within the call to run*, without it you would never see any results (besides run* causing the inference machinery to, well, run). The next operator is == which is used for unification, which is used just as = would be in Prolog inside some rule. Sometimes you need temporary logic variables which you can introduce with fresh. There is also an explicit operator conde (similar to cond) which can be thought of as providing disjunction (or). You need this to be able to mimick Prolog’s multiple facts/rules with the same predicate, e.g. having a simple fact and a rule for permutation/3. There are also further predicates, e.g. conso which can be used to splice/construct lists. This is all nicely explained in the core.logic Primer. I actually started out trying to convert the Prolog code with not much else, like this:

(defn insert-broken [x l nl]
    [(conso x l nl)]
    [(fresh [h t]
           (conso h t l)
           (insert-broken x t l)
           (conso h l nl))]))

You’ll note that I exchanged the position of single argument and list in order to match it with the usual argument positions of conso (or conj). Otherwise this looks like a pretty straight translation of the Prolog rules above: it’s either we can directly (via conso) (de)construct the list or we recurse. This version is broken in multiple ways, though. First of all, when you test this version, the recursive call to insert is not constrained enough wrt. the value of l, which will trigger an infinite recursion. You need to put the recursive call behind the second call to conso (cf. the discussion of my inquiry on StackOverflow). However, there is another issue lurking which you can see when comparing the results:

(defn insert-still-broken [x l nl]
    [(conso x l nl)]
    [(fresh [h t]
          (conso h t l)
          (conso h l nl)
          (insert-still-broken x t l))]))

FAIL "Checking insert - Simple insert" at (logic_test.clj:11)
Expected: ((1 2) (2 1))
Actual: ((1 2))

FAIL "Checking insert - Simple insert" at (logic_test.clj:12)
Expected: ((1 2 3) (2 1 3) (2 3 1))
Actual: ((1 2 3))

As you can see, this version generates only a single result, inserting the element just in the first position, not in the other positions of l. The reason for this is that we are constraining the solution too much: by using l in the recursive call, we’re constraining the “result” (the value of the third argument) to the initial value of l . This is not what we are doing in the Prolog version, there l is just a temporary value generated in the recursive call. I.e. I fooled myself by basically running into a variable capture problem. So, the correct version of insert looks like this, introducing another fresh variable l1.

(defn insert [x l nl]
    [(conso x l nl)]
    [(fresh [h t l1]
          (conso h t l)
          (conso h l1 nl)
          (insert x t l1))]))

However, the discussion on StackOverflow also pointed me to the matching predicates which are also shown, but not explained at all in the examples section of the core.logic wiki. In particular, core.logic offers a defne macro which basically provides a pattern matching facility which is remarkably close to what Prolog provides wrt. argument matching. Consider the following version of the same predicate using defne:

(defne inserto [L X NL]
    ([L X (X . L)])
    ([(H . T) X (H . L1)]
        (inserto T X L1)))

defne will basically expand into a set of conde expressions, but will also generate fresh variables and matching/unify expressions as appropriate. If you compare this version with the Prolog version, it’s easy to see the parallels: in the second rule, the given arguments to the parameter list L X NL will be tried to unify with [(H . T) X (H . L1)] (note that inserto uses the same parameter order as the Prolog version), thereby decomposing any sequence given as L into head H and tail T — this is basically the same as (conso H T L).

Having covered all those nitty-gritty details of insert, understanding permuto/3 should be straight-forward:

(defne permuto3 [I O L*]
  ([nil nil nil])
  ([() () ()])
  ([(X . Xs) Ys1 (_ . Bound)]
      (fresh [Ys]
                 (permuto3 Xs Ys Bound)
                 (inserto Ys X Ys1))))

We have two (empty) base cases and a recursive clause again. We’re decomposing the input I into (X . Xs) and unify O (typically the query variable) to Ys1. Using a fresh new variable we recurse with the sublist Xs down to produce permutations of the sublist, eventually inserting X into them. For reference, this is what the non-matching version looks like which makes the argument unification and the value decomposition much more obvious:

(defn permutation
    ([xs ys] (permutation xs ys ys))
    ([xl yl res]
       [(== xl '()) (== yl '()) (== res '())]
       [(== xl nil) (== yl nil) (== res nil)]
       [(fresh [_ x xs ys bound]
        (conso x xs xl)
        (permutation xs ys bound)
        (conso _ bound res)
        (insert x ys yl))])))

We can finally wrap this permutation into the same surrounding code we used for the traditional solution to this anagram kata to compute nearly the same results (the order will differ).

Wrapping up, this kata was actually quite hard to solve and took quite a while. I spend too much time trying to find the traditional solution myself before focussing on translating it to Clojure. And then it took me also way more time than I had imagined getting into core.logic, which could use quite a bit more documentation besides the primer on the basics. Anyway, core.logic looks like a very nice addition to the Clojure universe.

Posted by Holger Schauer

Defined tags for this entry: , , ,
Feb 11

Tom Boulet has recently triggered a discussion in the Agile+ community on Google+ about user stories, cf. the following quote from his inquiry:

A couple weeks ago, I heard someone say “The stories weren’t detailed enough for the development team..” I thought, Wha? Stories are meant to be high level and if you use acceptance criteria this comes after the team starts looking at them, doesn’t it? It sounded like the person felt that Stories had to be “fleshed out” somehow before development could get started. […] Also, I’ve recently seen talk of how stories should be small enough to code in a day or two (I’d call that a task) and another reference to “codable stories.” None of this sounds right to me. It sounds like a large number of Agilists are seeing stories as part of the design pre-development. Has the definition of Story shifted on me?

The resulting discussion is highly insightful. I particularly like one reply by Alan Dayley, which I quote in some parts:

It is hard to describe in brief sentences all the forces that pull organizations to keep them thinking in lots of details. A few might be: - A culture and history of Big Design Up Front that makes it hard for feature definition to happen just in time. - A continued need for long range planning requiring estimates for work that will possibly take months before it actually starts. - Skilled technical people who are accustomed to talking about technical details instead of the needs of the user. And more…

The most powerful and subtle force to cause “user stories” to grow is a continued lack of focus for the people defining features and those creating the features.

For example, if I am Product Owner of 5-12 products working with 3 teams who are all working on multiple products, I am not able to think of a feature, collaborate on it’s definition as it is built, see it created and then think about the next feature. I have to document my thinking of every feature because one hour from now I have to think about other features that also might not be built right away.

The key to being truly Agile is to finish stuff. The more inventory of ideas, features, undeployed code, etc. that we have, the less Agile we can be.

This highly resonates with my thinking. But it also got me thinking and reminded me of another discussion I participated in in one of Xing’s agile groups: for many companies, being able to keep stories high-level, have the Product Owner engage with the developement team just in time and clarify all those nitty-gritty details in short-enough time that the team can implement it, sound like the wet dream of agile heaven. I, too, have mostly seen teams in which there was substantial time invested by the Product Owner (at least), often accompanied by the Scrum Master and some development team members, to “groom the backlog” and to prepare the stories for the next sprint. This preparation, which more often than not involves adding lots of details according to some definition of ready, typically also includes story splitting. I’ve seen development teams straight out refuse to work on stories “too big” according to some arbitrarily set size limit. I guess the reasons behind this basically boil down to fighting FUD (fear, uncertainty and doubt), cf. this article listing potential benefits of small stories:

There are lots of benefits from having smaller stories.

From a user and usage perspective, reducing the story size forces me to understand users and usage better. It forces me to separate what really is valuable or useful to the users from what’s merely interesting. From a release management perspective having lots of smaller stories allows us more flexibility with planning, prioritizing, and cutting scope. From a development and detailed design perspective, a smaller story often implies we know more about what to build - there’s less unknown, less risk – less risk of the development work mushrooming into something much larger.

There are two obvious problems with this:

  • The development and release management perspective and their respective goals are not a good match for the business value that is driving a story. As a result of this, when you decompose stories into slizes too thin, the business value proposition, i.e. the “in order to ” part of Mike Cohn’s classical user story template, often becomes very awkward and only understandable in the direct context of some user interaction. Quick, tell me, what is the real business value of “As an order maker, I need to enter my credentials to access my orders” (cf. Godjo Adzic’s rant that writing “As a user” doesn’t make it a user story)?

  • The other problem is that often “understand users and usage better” is actually design work, as also asked by Tom Boulet in the discussion: “Can requirements really be separated from design? As requirements become more detailed don’t they really become design? Does this constrain the design part of development? […] And ultimately, can good design really be done without coding trial and error? Does a customer and PO really know what they want up front?” Here, the article by Mary Poppendieck on the Product Owner problem is helpful: according to her, the most important role of the development team is product design:

    The entire team needs to be part of the design decision process. Team members should have the level of knowledge of the problems and opportunities being addressed necessary for them to contribute their unique perspective to the product design. Only when decisions cannot be made by the development team would they be resolved by a product leader. The main team-facing responsibility of the product leader is to ensure the people doing the detailed design have a clear understanding of the overall product direction.

It would seem a development team with a “definition of ready” that requires lots of details and splitting stories into thin slizes is lacking this “clear understanding” and arguably refusing to take responsibility for designing the product. Again, my gut feeling is that this is a sign of the maturity of an organization, wrt. to agility: e.g., building this clear understanding typically requires a longevity of development teams which often clashes with the “project-based” approach taken by many organizations. And if development teams don’t know the market, the customers they are building the product for, where should this understanding come from?

In the discussion, Heiko Stapf also referred to the agile principle of “individuals and interactions over processes and tools” in the discussion, noting that “Having PO/Tester/Techguys sitting together finding a common understanding and language” is more akin to favor individuals and interactions, whereas “having the PO writing down requirements and passing them (in written form)” is more about process/tool. I find this to be indeed the case, where sometimes the underlying cause is that it’s difficult to arrange for enough room (time) to have the necessary interaction. But it might also be simply lack of collaboration, i.e. when developers just feel it’s not their job to work on (the formulation and details of) user stories.

I’ve also seen teams run into problems with getting the right information in time. IMHO they are either due to lack of knowledge on the Product Owner’s side or lack of empowerment. In the former case, when some detail needs to be clarified, the Product Owner might not have the necessary domain expertise required to answer the question right away and she needs to go back to some domain expert (which might not be immediately available, delays occur etc.). The later case is where the Product Owner has an opinion on some matter (e.g., whether to use this or that design) but feels she needs to go back to someone major stakeholder (potentially multiple). Also, when the team (PO+dev team) haven’t yet found the path to close collaboration, the resolution of some issue might require too much time to allow for implementation. In such scenarios, the idea of using a definition of ready with small stories is actually aimed at ensuring that some work can actually be finished in some time box at all (this is linked to ensuring a story is “feasible”, cf. Roman Pichler’s take on the “definition of ready”.

Of course, there are also legitimate reasons to split stories which are linked to other organizational issues: e.g., when the story is big it can’t possibly implemented by a single team in given small time box and hence needs to be split over multiple teams or multiple time boxes. Or when the requirement is yet more an epic than a “ready” story and when split, business value can still still be obvious for the smaller stories. And then there is also the process where you start out with a vague “feature idea” and need to understand it more before it would be ready for implementation, as I’ve described in my post on agile conception.

In summary, I would say that a large backlog with lots of tiny stories is a clear indicator of a lack of maturity of the team and / or organization. In an ideal world, all our teams would be mature enough that nobody would ever want to have too many detailed written specifications upfront, but alas in our reality we have to work hard to get there. Unfortunately, there is no easy answer to this than to try and persist: gain the knowledge and the confidence, build the tools and the trust you need.

ObTitle: Modest Mouse, from “The Moon & Antarctica”

Posted by Holger Schauer

Defined tags for this entry: ,
Jan 29

I recently started working on a web application in Clojure. This being my first contact with quite a lot of tools / libraries for web programming with Clojure, I ran into quite some stuff that was not obvious. First some background: It’s basically a very small application which will have only a very small number of users, so we don’t need a highly interactive, reactive UI. So, no need for a single-page application, a classical web application where the server serves up a small set of pages and has all the logic is just fine.

Starting out with ring, lein-ring and compojure was pretty straight-forward, going beyond the easy stuff not so much. One such problematic issue I ran into was feeding some configuration to Jetty (JNDI resources), as lein-ring does not seem to offer any way of reading an existing jetty-web.xml file. This implies that you’re basically down to using and configuring Jetty as described in the embedding Jetty examples. This snippet is what I’m using to declare my database connection (you’ll need to have depends for clj-dbcp, org.mortbay.jetty/jetty, org.mortbay.jetty/jetty-plus, org.mortbay.jetty/jetty-naming and javax.servlet in your project.clj):

  (ns myapp.jetty-config
     (:require [clj-dbcp.core :refer [make-datasource]])
     (:import ( Resource)
                 (java.util Hashtable)
                 (javax.naming InitialContext Context)))

  (defn make-my-datasource []
     {:adapter :mysql :host 'localhost :database 'mydb
      :username "myuser" :password "mypw"}))

  (defn setup-jetty-context []
     (let [ht (Hashtable.)]
       (.put ht Context/INITIAL\_CONTEXT\_FACTORY "org.mortbay.naming.InitialContextFactory")
       (.put ht Context/PROVIDER\_URL "org.mortbay.naming")
       (InitialContext. ht)
       (Resource. "java:comp/env/jdbc/etrans" (make-my-datasource))))

This being solved, during the course of the project, I aimed for running the application on Tomcat. Figuring out how to develop with Jetty and make it run on Tomcat also took quite a while. First of all, you don’t want to have the above Jetty configuration and dependencies dragged in when running lein ring war but instead rely on the “normal” persistence.xml mechanism to define persistence units. The key to the first issue is making use of leiningen’s profiles in combination with the :init keyword of the configuration options for lein ring. I.e., I removed all Jetty configuration and dependencies from the top-level ring configuration in my project.clj and added a :dev profile to it which has the required additional dependencies mentioned above plus a :ring section with an :init key pointing to a setup function which calls the required Jetty configuration functions:

    (defproject myproject "0.1.0-SNAPSHOT"
  :description "A small web application"
  :dependencies [[org.clojure/clojure "1.5.1"]
                 [org.clojure/tools.logging "0.2.6"]
                             ;; .. other dependencies, but nothing Jetty related
  :plugins [[lein-ring "0.8.8"]]
  :ring {:handler myapp.handler/app
  :web-xml "war-resources/web.xml"}
  {:dev {:dependencies [[javax.servlet/servlet-api "2.5"]
                      [org.mortbay.jetty/jetty "6.1.23"]
                      ;; etc.
  :ring {:handler myapp.handler/app
         :nrepl {:start? true}
     :init myapp.jetty-config/setup-jetty}}})

Making use of the persistence.xml file turned out to be pretty easy: You can place “normal” Tomcat configuration files in the directory specified by the :war-resources key. E.g. you can your persistence.xml in war-resources/META-INF/. You can also specify the location of your web.xml explicitly. I did an initial lein ring war run and have it generate an initial web.xml and modified it later on as needed. Another issue was that I wanted to run the application under Tomcat in parallel with other applications, i.e. with it’s own named web application context, which is simple enough. But running under Tomcat as some non-exclusive webapp (i.e. installing not as ROOT.war) forces you to use relative links. But how do you find out the right prefix for servlet global resources? Useful info gets added to your request in the :context, :path-info, :servlet-context and :servlet-context-path, according to ring.util.servlet. I used this wrapper for figure out what I needed:

  (defn wrap-show-request-context [handler]
     (fn [request]
        (when-let [context (:context request)]
           (logging/info (str "Request with context " context)))
        (when-let [pathinfo (:path-info request)]
           (logging/info (str "Request with path-info " pathinfo)))
        (when-let [servlet-context (:servlet-context request)]
           (logging/info (str "Request with servlet-context " servlet-context)))
        (when-let [servlet-context-path (:servlet-context-path request)]
           (logging/info (str "Request with servlet-context-path " servlet-pathinfo)))
        (-> request

Finally, I had a dependency on a local library which I resolve via the localrepo extension to leiningen. However, lein-ring knows nothing about local-repo extension, hence lein ring uberwar will not include the dependencies that only exist in your ~/.m2/repository/. The workaround is to make a WEB-INF/lib directory and place all required Jars in it, although figuring out all Jars that you need is a very tedious process.

Given the small scale of the web application, I decided very early on that hiccup should fit my small requirements, although I wanted to use bootstrap, too. The one problem I ran into is a typical functional programming issue: information becomes available at some point which is required exclusively in some other function way down the chain of callers. The concrete problem I ran into was actually using the right servlet-context in the main-page layout template to request the right resources. Handing down the parameter (or the request itself) to the view seems wrong, because then I basically make use of internal knowledge about the inner workings of my pages. I finally extended the wrapper above to bind a dynamic variable:

    (def ^:dynamic \*app-context\* nil)

    (defn wrap-context [handler]
       (fn [request]
         ;; ... logging code elided ...
        (binding [\*app-context\* (str (:context request) "/")]
           (-> request

I then refer/use this variable during the page setup. I decided against using an atom or some such, because I specifically don’t want to have other requests/threads to see/modify the current value from some other thread — this is actually intended to be a request-only accessible (global) variable. Not sure whether this is the best way to go, though, I guess there are more elegant solutions.

I had seen a talk on liberator at EuroClojure 2013, which promises a pretty declarative way of sorting out how to react to web requests, which looked nice enough to give it a try. Although things can get somewhat complex rather quickly, I still like it. One such issue is that the interaction between liberator’s context argument and compojure parameters isn’t always obvious. Basically all handlers need to take a context argument (possibly ignored), as is detailed somewhat more in the documentation of the execution model. Resources (i.e. their definitions) can have arguments as well, however, which are not related to the context — they need to be provided then when your handle calls the resource. I.e. check the following resource definition:

 (defresource someresource [someid]
       :available-media-types ["text/html" "application/json"]
       :method-allowed? (request-method-in :get)
       :exists? (fn [context]
                     {:something (find-something someid)})
           :handle-ok (fn [context]
                         (myview (get context :something))))

     (defroutes myapp
        (GET "/something/:someid" [someid] (someresource someid)))

The route definition uses compojures parameter extraction to hand someid over to the resource. The context argument to the decision function :exists? and to the handler :handle-ok doesn’t contain this, but you can access it via the parameter someid of the resource. Another thing which had me scratching my head was how to do a redirect on a GET request which I wanted as a result of logging out of the application. You have to combine liberator.representation/ring-response with ring.util.response/redirect. moved-permanently? and handle-moved-permanently or handle-see-other seem not to be intended for this, at least I was not able to use them for this purpose.

(defresource logout
  :available-media-types ["text/html"]
  :method-allowed? (request-method-in :get)
  :handle-ok (fn [context]
             (str (get-in context [:request :context]) "/")))))

I also have resources which handle GET and POST/PUT requests (the same resource). You can combine this and in principle it’s as simple as it sounds, but figuring out which handler gets called when is not: for example, is handle-ok called after put! created something? Liberator does come with quite some documentation and it’s tracing facility is really helpful, but it really takes some time getting into it. For instance, I don’t think it’s documented that the list of available media-types determines the default media-type generated. I.e. if your request doesn’t specify that you prefer to accept application/json and your available media-types has text/html as the first element, you’ll get text/html. I ran into this with my unit tests: you may need to use ring.mock.response/header to set the “accept” header (and don’t get fooled by ring.mock.request/content-type). I think, in the middle to long term my biggest concern is the question how stable this API is. It’s currently at 0.10, which leaves quite a lot of numbers before implying any notion of stability.

Then, I also used friend for the authentication, which is an interesting library whose current version of 0.2 again makes me wonder about the stability of the current API. I’m currently staying very close to the bare minimum of features and so far haven’t really run into any bigger technical issues. There is one thing that is quite apparent though: while it provides role-based authorization, friend is currently missing any idea of access rights, i.e. it’s lacking a connection of roles to rights. Hence you guard functionality with calls to authorize, not with a declaration of the required rights. Friend is also seriously lacking documentation, e.g. about *identity* or current-authentication which you might want to use to determine data about the currently logged-in user. I also had issues with testing my code after introducing friend, because it’s not at all obvious how to provide the needed authorization from test code. I ended up using midje’s mocking machinery (i.e. provided) to mock authorized? which is underlying friend’s authorize macro. Note that provided assumes that the one checkable directly above is actually calling the mocked code, you can’t have provided mocking for two or more checkables or in some let construct (background pre-requisites can help with this).

Finally, I used java-jdbc for accessing the database. Again, I’m wondering how stable is this API? There were large changes with 0.3, which broke compatibility with at least one of the SQL DSLs mentioned on the project page. This includes splitting out the old SQL DSL into it’s own module java-jdbc/dsl. And alas, documentation and examples are also somewhat lacking. At least the unit tests were useful to figure things out.

So, what have I seen so far? I guess a lot of interesting technology, notable lack of satisfying documentation and APIs in widely varying degrees of stability.

ObTitle: Franz Ferdinand, from “Right thoughts, right words, right action”

Posted by Holger Schauer

Defined tags for this entry: ,
Oct 28

Two things are clear for every project and every product: there is always a start and an end. I’ll ignore the end for now and would like to take a closer look at the start. One has to wonder: where do all these projects or new products come from? Obviously, somebody finds out at one point that there is a need for something and this something is the nucleus of the perspective development. If this idea requires significant effort, it’s typically necessary to convince the people responsible for the money that is worthwhile to pursue the idea. Let’s take a closer look at what this something actually is and how it changes and needs to change before the developers can finally go make it real.

To use a more descriptive tag, I will use the term feature idea for now, although we could well talk about ideas for a new product or product line. Feature ideas in a very early stage are rarely more than a single, simple statement or even just a few descriptive keywords like “I would like a button here where I can save my settings” or “remember me functionality”. Of course, whoever came up with the idea has typically second thoughts, more or less fuzzy but these are not communicated, neither discussed nor finalized.

A feature idea
Quite often these second thoughts are related to some existing systems or solutions, even if it’s just to describe what the respective solutions do not provide (note that these solutions might not be your own). Typically feature ideas are about some specific expected behavior of some system and basically boil down to what needs to be build. This is all good: feature ideas don’t need further elaboration or lengthy descriptions in an early state.

How do you proceed from here to the nitty-gritty details that finally need to be done? In traditional settings, to get from the idea for a larger feature to development you need to jump over the budget hurdle, which usually requires detailed analysis results, including business cases, return on investment calculations, and implementation plans. The latter consist of solution outline, milestones, resource plans and cost requirements. It is important to note that to come up with all this substantial efforts need to be invested already. The bigger the need for precise information, the bigger the required investment. And then at some point you hopefully get the okay to move ahead. But in case the idea is turned down, all this investment in detailed plans was just wasted. In agile setups, we would like to avoid that waste. Still, of course, the need for getting a decision on where the business wants to invest its money does not go away. But in order to do so, does the senior management really need all those details? They don’t. There is a need for accuracy, not for precision (cf. for a description of the difference). Asking for all the details is a sure sign for lack of trust, but I have never seen any senior management trying to verify that all those precise details were actually accurate, which probably explains why the demand for documents and spreadsheets is all there is. This is then typically combined with taking all plans as promises that need to be kept and not as means to enable successful investments (command-and-control vs. plan-to-replan).

What is surely needed boils down to just one thing: understanding the value of an investment. It’s often much easier to determine what you want or can invest than to precisely estimate upfront what you need to invest (cf. this discussion of the no estimates idea which shows the wonderful relevant Dilbert strip on the same issue). On top-level, investment is essentially tied to strategy: e.g. in the Scaled Agile Framework (SAFe, cf. Leffingwell et al.), investment themes determine the relation between budget (resources) and the program portfolio of business epics. This essentially requires an understanding how each possible business epic or feature idea contributes to the overall company strategy (this is basically applying the Lean principle of “optimize the whole” to feature ideas).

We essentially danced around the big elephant in the room now: feature ideas often describe expected behavior, but we need to know about the value of an idea. But if all we have is a simple statement (and not the full-fledged analysis we would like to avoid) of some idea, how can we know its value? The essential insight is that it might not be necessary to understand the precise value yet. The earlier in the process we are, the less is it necessary to perform lengthy and detailed analysis: determining the value of some idea is only a means to further a decision on where to invest further. However, at least it should be possible to judge the value of one feature idea in relation or better said in comparison to other ideas, as Johanna Rothman discusses in her article on why cost is the wrong question for evaluation projects: at least in an early stage, it might be enough to use a very simple kind of rough value sizing, e.g. putting an “M” on one feature idea in comparison to a “L” for another. It is important to recognize that the value of ideas will change over time: in particular so when looked at in relation to other things your business needs to think about, as new ideas come up , some ideas get implemented and other discarded. Accounting for these changes over time is why planning to replan is important in agile methodologies. An important concept in agile methodologies that helps making this clear is the cost of delay: the idea is that you try to understand or predict how the value contribution of some feature will change over time, in particular what impact not pursuing the idea will have. This is most useful for prioritizing by cost of delay of what to invest in — obviously, as CoD is all about loss of value over time, as time moves on you want to reconsider your priorities (where you want to drill down further with an analysis etc.). (An alternative coming from the lean camp is to use an impact-effort matrix, but such a matrix needs an estimate about the effort required and also treats value as an absolute over time, which it usually isn’t. Cf. this article about cost of delay vs. WSJF in SAFe for a discussion why cost of delay and value considerations are not the same).

Inevitably, when you try to think about the value of some feature idea, you will think about what this simple statements like “remember me functionality” should actually mean. Very often the way in which we name or describe a potential feature is often a confusion (and at best a conflation) of problem and solution space and that can make it quite difficult to judge it’s value. As such, naive feature descriptions can quickly become a big impediment. For one thing, they are — due to being explicitly simple — usually ambiguous and / or often hard to understand. Does “remember functionality” describe that I stay logged in on the website or does it mean that the website remembers the visitor and shows specific ads related to his interests or both? This is often the reason why some people attribute completely different value to an idea. Worse, they typically don’t give any clue about the need behind the wanted behavior nor about the value this behavior brings. Just as bad is that simple statements of the feature idea in terms of expected behaviour (“I want a button that …”) limits creativity by already pointing in the direction of a possible solution: maybe there are other options which are much better suited, both from a usability point of view but also with regard to the required functionality. So the conflation of problem and solution is actually problematic for understanding the problem and for designing the solution.

So what is the process to move from a feature idea to the solution design? I think it’s useful to see feature idea statements as promises for analysis. The analysis of a feature idea should essentially focus on understanding the user and / or business needs the idea is related to. The stereotypical user story template “As a <type of persona>, I want to <perform X> in order <to achive Y>” is a useful tool in this respect as it is explicitly constructed to include and name the need (cf. Gojko Adzic rant Writing “As a user” does not make it a user story how this can go wrong when taken on too easy). Now you might not have such a story or might not have it yet and maybe it’s not clear what this simple feature idea is about. That’s the time to engage with whoever seems able to contribute, but first and foremost it should probably be the primary beneficiaries of the feature idea. Traditional requirements engineering practice would suggest to involve all relevant stakeholders (perspective users, management, marketing, legal department, …) and to employ whatever RE tool seems reasonable to use (e.g. brainstorming, workshops, open interviews, etc.). In agile setups, determining who actually is a stakeholder which needs to be heard is the responsibility of the product owner. I find that it is usually also a good idea to involve the team, because they will typically have a good understanding of the solution space and should also know about the envisioned user needs. Don’t take it too far, though: If you are still trying to analyze in order to understand the value of some idea, i.e. you are still at an early stage (inception phase as the disciplined agile delivery folks would have it) of your planned feature, you should probably try to keep it light-weight.

You will likely get a lot of input on your feature idea and many in the form of comparable simple statements to what you originally started out with. Understanding the needs requires going beyond the level of simple statements or simple suggestions of simple solutions (Laura Klein has a nice description of how difficult that can be: Sometimes users will tell you that they want a toaster in their car, when what they really mean is that they don’t have time to make breakfast in the morning.). And although the famous 5 Whys might come across somewhat strange or invasive in the course of an interview, it might take that many attempts to actually understand the problem that needs to be solved. If you’re lucky, a potential user will offer some simple insight as to the problem with simple statements such as “I would like a button here so I can save my settings, so I don’t have to adjust them again and again”. Further analysis might reveal that the real problem is that there is a usability problem and the user forgets to adjust the settings too often or that the user is a regular user which would like to spend less time with repetitive tasks. The main point here is that this process is about understanding the needs, which typically requires challenging any superficial and obvious ideas: this is both about maximizing the amount of work not done (simplicity) and about building solutions that satisfy the user.

Change in understanding over time
Change in understanding over time

This is now where the team basically takes over: the team is responsible for working out the solution, not only constructing / implementing it, but essentially designing the solution. This is not to say that the product owner or other stakeholders cannot or should not contribute ideas on how to solve it, quite to the contrary. It is quite essential to have stakeholders around to enable the knowledge transfer of needs from the users or stakeholders to the team. So while the separate paragraphs of this text might suggest a handover from product owner to development team, from needs to solutions, ideally this should be a smooth transition with lots of participation of the entire team and face-to-face communication. Tools like Roman Pichler’s product canvas are useful, but there is a reason why the agile manifesto recommends people interactions over tool usage. Again, for working out the solutions many ideas and tools can and should be exploited, from brainstorming over UI scribbles to models and prototypes. Typically when designing the solution, many alternatives come to mind and different people from different backgrounds will have different opinions and favorites. This is actually good, because you can embrace this diversity to think about the best possible solution. However, the best solution might not actually be the solution you want to implement, often the simplest thing that could possibly work is a reasonable way to move forward: this minimizes the effort required, thereby enabling faster time to market and thereby enabling earlier feedback on whether this solution actually works (this is the tension between trying to build the best thing vs. learning fast from feedback about what actually proves to work). People with lots of experience in designing solutions will typically also come up with all kind of challenging questions which might require drilling further down into the needs (e.g. questioning if the feature is targeted at the right people or what the legal department has to say to some feature). This will trigger more learning about the feature under consideration. But this should not be taken as an excuse to run endless analysis or design meetings: it is often a better approach to end a discussion with a decision and move ahead and find out later whether this decision was actually flawed and needs correction. It is often a good idea to consider the risks associated with some idea: the more risk, the closer you want to consider your options. In general, when in doubt about which option to chose, think about what data would be helpful to decide: performing light-weight experiments might be useful for minimizing these risks (e.g. technical spikes or usability tests with paper-pen sketches).

A brief terminology recap: an epic is just some requirement (feature idea) that is so “large” and / or uncertain that it needs to be broken down, i.e. nobody could take an epic and just implement it, the developers just have to come back and ask lots of questions. If you wanted to you could say after breaking down an epic, all you end up with are features but in agile circles these are usually split up into even smaller stories (cf. the image below). Of course, this is a completely arbitrary and often also highly context-dependent distinction. It’s also fairly common to use the user story template for describing epics. I used the term feature idea instead of epics to bring in some notion of time: epics start out as (feature or other) ideas. If you think about the timeline again, it’s clear that the closer you are to implementing some feature, the more details you need to know, i.e. the higher your level of understanding of the feature needs to be. And this is where the idea comes into play that you need to INVEST in your user stories, so that you can derive SMART tasks. This is obviously needed for the very next things you want to implement, and as any risk-aware tech guy or project lead will know, probably also a good idea for the next most things to do afterwards. Business cases are mainly needed when you are going to go for big investments and when it’s critical to understand the return on investment. But this brings us back to the discussion of value vs. cost and the discussion of precision vs. accuracy. Use cases, finally, are a useful tool when it comes to understanding some set of requirements in the context of a larger system. They can also be useful for driving discussions with stakeholders and are often a useful starting point if you like to model your problem / design before implementation. I see use cases as a useful tool on the feature level, to cover a broader range of more detailed stories. As such it does not matter much when to describe a use case: if it helps you drive your understanding of some problem, use it before drilling down to more fine-grained descriptions. If you already have a lot of details and you fear to lose the overview, it can also make sense to reconstruct some use case from more detailed stories.

Uncertainty vs. time
Uncertainty vs. time

Finally, let’s step back briefly and re-consider who is part of this solution team. It’s quite common that for feature ideas of different sizes, different people will make up the solution team. I.e. when your feature idea is concerned with some detail of some larger system, the team might just be the developer team (in the Scrum sense). For bigger features, projects or programs, it’s quite common that lead developers, design and usability specialists, solution and / or enterprise architects and maybe operational staff will play an important part in working out the solution. Whether these people are working in specialist teams (as e.g. SAFe assumes) or are part of the development teams is not of much importance, but having members of those teams working on the solution design is quite essential: if this is not the case, in the best case, the final developers will only have to rediscover and understand all the good reasons why you came up with the solution in the first place, in order to get it right. Worst, you might end up with frustrated developers which don’t take responsibility and don’t care about the build solution, because they feel they have no saying in its design.

Posted by Holger Schauer

Defined tags for this entry: ,
Oct 15

Kata5 is an interesting one to do in Clojure as Bloom filters, which are a probalistic data-structure for determining set membership, are all about spending as few bits as possible (or required) to store data. When bit manipulations are required, not many programmers would jump to Lisp or Java and indeed most descriptions are about implementations in C or one of it’s derivates. This probably does not come as a surprise, but as we will see is not entirely justified (wrt. to Lisp or Java).

But before discussing this in detail, let’s dive in with this kata. The description of the kata already tells us pretty exactly what we’re supposed to build: a bunch of hash functions and an array of bits which are then set or checked. We’ll start with the hash functions first. Hash functions are dime-a-dozen, Java provides one, Clojure, too. Still it is interesting to go beyond the readily provided functions and to implement some hashing functions. The goal here is not to build perfect hash functions, but to get a feel how an implementation of one looks like in Clojure (as this blog post series is about Clojure, not about Computer Science — the code is also available in my github repository for Clojure codekatas).

As the task is here to build a Bloom filter for strings, all hash functions basically boil down to iterating over a sequence of characters, converting each character into a numerical value (i.e. applying int) and then using this in an accumulating computation of the total hash value. We start out with more or less the simplest approach possible: we simply sum up the integer values of the characters (also known as the Kernighan & Ritchie “lose-lose” hash algorithms).

  (defn sum-chars 
    "Sum up the chars of a given string"
    (reduce + (map int charseq)))

Nothing interesting to see but a straight-forward map/reduce, so let’s move on and take a stab at what Dan Bernstein (djb) suggested. In its most innocent version this iooks pretty similar, but uses bit-shifting and has some magic numbers thrown in for good measure.

(defn djb-string-hash
     "Use djb's method for hashing a string"
     (reduce (fn [curhash charval]
                 (+ (bit-shift-left curhash 5) curhash) 
         (cons 5381 (map int charseq))))

This is already were it got interesting, because if you run this with the simple string “foobar”, you will run directly into an overflow. The clojure documentation on unchecked-add could have told me so directly, of course. I had to use a quite a bit of websearch-fu, wildly testing around and knocking my head on the table to come up with this version:

    (defn djb-string-hash
      "Use djb's method for hashing a string"
  (reduce (fn [curhash charval]
               ^long (unchecked-add 
                             (bit-shift-left curhash 5) 
    (cons 5381 (map int charseq))))

This looks quite good, but is still not bulletproof, as testing it with a longer string (> 11 characters) will show. The real issue is actually due to Java: unchecked-add uses internally a data structure that can (and will) result in a Java IntegerOverflowException. Java does not have unsigned numeric types and also defaults to throwing exceptions on overflow and Clojure, leveraging the JVM, is directly affected by that issue.

      kata5-bloom-filters.core> (djb-string-hash "foobar")
      kata5-bloom-filters.core> (djb-string-hash "foobarfoobar")
      ArithmeticException integer overflow  
      clojure.lang.Numbers.throwIntOverflow (

Compare for instance the behavior of Common Lisp: the straight-forward translation of the exact same naive implementation will not overflow (unless aggressively optimizing against safety or promising to the compiler that only certain results will occur) due to automated boxing.

(defun djb-string-hash (charseq)
  "Use djb's method for hashing a string"
  (reduce #'(lambda (curhash charval)
          (+ (ash curhash 5) curhash charval))
       (cons 5381 (map 'list #'char-code charseq))))

Clojure does provide a similar behavior with e.g. the BigInteger type, however this is of no use as Clojure’s bit-operations don’t know how to handle BigInteger data — they are only defined for the primitive data types that Java provides. Worse, converting back from a BigInteger to a primitive data type (e.g. long) will not — as one might naively expect — simply truncate the data but yield -2:

kata5-bloom-filters.core> (bit-shift-left (bigint 20) 2)
IllegalArgumentException bit operation not supported for: 
       class clojure.lang.BigInt  
       clojure.lang.Numbers.bitOpsCast (
kata5-bloom-filters.core> (* 2 (bigint Integer/MAX_VALUE))
kata5-bloom-filters.core> (int (* 2 (bigint Integer/MAX_VALUE)))
IllegalArgumentException Value out of range for int: 4294967294
       clojure.lang.RT.intCast (
kata5-bloom-filters.core> (unchecked-int (* 2 (bigint Integer/MAX_VALUE)))

Similar issues arose with all of the other hash functions (i.e. sdbm and fnv), cf. the code on github — nothing interesting here, so let’s move on to the bloom filter itself.

Multiple options come to mind when thinking about the data structure to use. Let’s start out with the simplest possible option: just use a simple number. This defaults to Java long, i.e. a 64-bit number. This implies of course that we already have a rather arbitrary upper limit on the size of the bloom filter, which influences the number of possible entries and the number of false positives, cf. this overview article about the garden variety of bloom filters.

(defn bloom-add [bloom charseq & {:keys [hashfns] 
                                      :or {hashfns *hash-functions*}}]
  (reduce #(bit-set %1 %2) 
          (conj (map #(% charseq) hashfns)

(defn bloom-contains? [bloom charseq & {:keys [hashfns] 
                                    :or {hashfns *hash-functions*}}]
  (every? #(bit-test bloom %) 
      (map #(% charseq) hashfns)))

(defn build-bloom [wordfile & {:keys [hashfns] 
                       :or {hashfns *hash-functions*}}]
  (reduce #(bloom-add %1 %2 :hashfns hashfns)
        (cons 0 (string/split-lines (slurp wordfile)))))

The code here is pretty straight-forward, maybe with the possible exception that we’re mapping over a list of functions in bloom-add and bloom-contains?. We could extract this part to a simple function which makes the code a little more readable.

     (defn hash-string [charseq & {:keys [hashfns] 
                                   :or {hashfns *hash-functions*}}]
        (map #(% charseq) hashfns))

This very naive implementation will run into problems right away: The hash functions will yield hash values that are itself 64 bit in size whereas the biggest bit that can be set is 63. A straight-forward fix for that is to consider the size of the ‘bloom filter’ (i.e. 64 bit) and to truncate the hash values accordingly via the modulo operation. I.e., instead of calling (bit-set bloom value) we do (bit-set bloom (mod value 64)).

Now, if you think about it, using a simple number is probably not the optimal data structure: for one, we just limited us to bit arrays that are 64 bits in size (which for instance implies that with the /usr/share/dict/words file you’ll end up with Integer/MAX_VALUE, i.e. all bits set to 1) but due to the immutable nature of the mathematical operations, we actually require a lot more space than just one long, thereby very much defeating an important characteristic of Bloom filters.

So let’s use a completely different idea and use one of Java’s mutable data structures: BitSets. This leads to the following naive, non-thread-safe implementation:

(defn bloom-add [bloom charseq & {:keys [hashfns] 
                                      :or {hashfns *hash-functions*}}]
        (let [size (.size bloom)]
             (doseq [hashval (hash-string charseq :hashfns hashfns)]
               (.set bloom (Math/abs (mod hashval size)) true))

(defn bloom-contains? [bloom charseq & {:keys [hashfns] 
                                            :or {hashfns *hash-functions*}}]
    (let [size (.size bloom)
          hashvals (hash-string charseq :hashfns hashfns)]
           (every? #(= (.get bloom (Math/abs (mod % size))) true) hashvals)))

(defn build-bloom [wordfile & {:keys [bloom-filter size hashfns]
                                   :or {size 1024
                                        hashfns *hash-functions*}}]
    (let [bloom (or bloom-filter (BitSet. size))]
           (reduce #(bloom-add %1 %2 :hashfns hashfns)
              (cons bloom (string/split-lines (slurp wordfile))))

We basically just exchanged the bit-set/bit-test functions with the respective BitSet methods and use a dynamic size. This hints at a possible generalization: we could consider multiple bloom filter implementations (types, if you want to) that need to support some sort of bit-setting and getting operation plus size. This would be the internal protocol, while bloom-add/bloom-contains? (and maybe build-bloom) form the external API.

Now, of course, we would like to fix the problem that this code is not thread-safe. As is made pretty clear in Fogus etal. book “Joy of Clojure”, Clojure’s reference types are of no use here:

“Wrapping a mutable object in a Clojure reference type provides absolutely no guarantees for safe concurrent modification. Doing this will at best explode immediately or, worse, provide inaccurate results.”

The advice Fogus etal. offer is to use the locking macro. If we combine the above idea of using an internal protocol, we can at least apply it where it is necessary, i.e. around the calls to .get/.set.

(defprotocol BloomFilterImpl
    (bloom-size [filter])
    (bloom-bit-get [filter position])
    (bloom-bit-set [filter position value]))

(extend-type BitSet
    (bloom-size [filter]
        (.size filter))
    (bloom-bit-get [filter position]
        (locking filter
            (.get filter position)))
    (bloom-bit-set [filter position value]
        (if (< position (bloom-size filter))
            (locking filter
                (.set filter position value))
            (throw (IllegalArgumentException. 
                                    "position outside of bloom filter size")))))

If you wonder why the protocol does not have the add or contains? functions, this is because these operations would be part of some dictionary protocol or some such (although it is somewhat debatable if dictionaries should guarantee the absence of false-positives).

Let’s dig some more into the concurrency issue: it’s surprisingly hard to come up with a scenario where the mutability of the BitSet could be problematic. For one, we are always only adding entries by manipulating a single bit and do that in an atomic fashion that does not rely on the previous value of the BitSet in any way. For another, we don’t have any delete operation, so we can’t possibly run into the situation where some bit / some dictionary entry goes missing — assuming, of course, that all modifications to the bloom filter happen through the functions we supplied only and not by some other means directly on the BitSet outside our control. The only scenario that comes to my mind would be where one wants to keep the state of the bloom filter fixed in one thread, i.e. for some time we want to be able to deny having seen some value / word (which another thread just tried to sneak in while we were not looking). I can’t imagine a real world usage for this scenario, but that probably says more about my creativity than about anything else.

Let’s briefly discuss the options to account for this scenario: the locking scenario above is not enough as the locking occurs as part of the getting/setting operations — there is no way in which one thread could prohibit modifications to the Bloom filter during a specified amount of time with this. Of course, as locks nest you could add locking outside the calls to add new elements to the bloom filter. The other option would be to use one of Clojure’s reference types. But as discussed above, these are not useful for protecting mutable data structures, so we would need to go back to using one of Clojures persistent data structures. So, let’s briefly step aside and compare the speed of Java arrays generated via Clojure and using Clojure vectors, both on booleans:

(defn make-random-boolean-array [size]
   (take size (repeatedly #(rand-nth [true false])))))

(defn make-random-boolean-vector [size]
  (into [] (take size (repeatedly #(rand-nth [true false])))))

(defn print-flipped-boolean-array [ba]
  (let [size (count ba)]
    (loop [indx 0
           result ""]
      (if (= indx size)
          (aset ba indx (not (aget ba indx)))
          (recur (inc indx)
                   (if (aget ba indx) 1 0)])))))))

(defn print-flipped-boolean-vector [bv]
  (let [size (count bv)]
    (loop [indx 0
           vec bv
           result ""]
      (if (= indx size)
        (let [newvec (assoc vec indx (not (get vec indx)))]
          (recur (inc indx)
                   (if (get newvec indx) 1 0)])))))))

kata5-bloom-filters.core> (time (do 
                                        (make-random-boolean-array 10000)) 
"Elapsed time: 587.247368 msecs"
kata5-bloom-filters.core> (time (do 
                                        (make-random-boolean-array 10000)) 
"Elapsed time: 592.598888 msecs"
kata5-bloom-filters.core> (time (do 
                                        (make-random-boolean-vector 10000)) 
"Elapsed time: 76.657272 msecs"
kata5-bloom-filters.core> (time (do 
                                        (make-random-boolean-vector 10000)) 
"Elapsed time: 69.666769 msecs"
kata5-bloom-filters.core> (time (do 
                                        (make-random-boolean-vector 10000)) 
"Elapsed time: 71.897087 msecs"

Now if I run this a reasonable number of times, it appears that for simple element access the boolean vector is outperforming the boolean array, even if I’m basically doing building up a new partial copy of the vector all the time / for all elements. That was a welcome surprise for me. So, using a vector of booleans would be a viable next possibility. The road ahead, however, is hindered by the fact that Clojure does not derive a complex type for vectors, not even if you especially declare the vector to be of a certain type. You’ll always end up with a vector or Vec:

kata5-bloom-filters.core> (into (vector-of :boolean) [true false])
[true false]
kata5-bloom-filters.core> (type *1)

This is problematic as we don’t want to just extend-type the general type. The way out of it is the use of reify which is just like extend-type for objects (that’s actually quite misleading, if you look at the documentation of reify, go check yourself — I would also recommend reading up about this in “Joy of Clojure”).

(defn make-bloom-vector [size]
  (let [bloom-vect
        (ref (into (vector-of :boolean)
                   (take size (repeatedly #(identity false)))))]
    (reify BloomFilterImpl
      (bloom-size [_]
      (bloom-bit-get [_ position]
        (nth @bloom-vect position))
      (bloom-bit-set [_ position value]
        (alter bloom-vect assoc position value)))))

As you can see, I’m using a closure to store the vector of booleans in a ref. We then deref the vector on get and alter it on set. This is especially set up such that any transaction handling (i.e. dosync) occurs outside of the implementation details, to account for the scenario discussed above. Of course, whether refs are really best suited for handling any concurrency situation is likely to depend ultimately on the application specific context or concurrency requirements.

I will leave it at that. This time, we have seen quite some more features that are pretty much unique to Clojure: explicit protocols, reification of instances and references, i.e. using Clojure’s STM. There was also quite a bit of discussion of some consequences of Clojure’s decision to leverage the JVM, some bad (overflows), some nice (plugging in a readily available data structure). Overall, this kata was quite an interesting exercise.

Posted by Holger Schauer

Defined tags for this entry: ,
Jul 16

Kata4 is concerned with data munging, basically reading some file, pulling out some data and comparing the data in order to determine some result. Actually, it is even simpler than that as in both cases we are asked to determine the minimum of the data set. Given a weather task and a soccer task, we are asked to fuse the resulting solutions and extract the commonalitites and minimize both solutions, basically allowing for code-reuse as much as possible. Unfortunately, I had not read the opening sentences of the kata closely, instead I read over the entire description. Hence I missed the call to solve each part separately and not to read ahead and directly started with this idea of code re-use in mind. However, as we will see, I didn’t really notice one aspect of code reuse until I solved the second task (soccer).

The first thing I did was take a closer look at the data file. Probably as much as everybody else who had some exposure to Perl, I was initially tempted to approach the data extraction part with regular expressions. But as JWZ said, “some people when faced with a problem say “I know, I’ll use regular expressions. Now they have two problems”, and indeed this is the case here. This is not to say that the particular problem, fetching the date, minimum and maximum temperature from the provided data file would not be solvable with regular expressions, but more that the data file is really more of a fixed width nature. Unfortunately, there are some irregular elements, e.g. the missing values in the WxType row or the post-fix ‘*’ added to exactly one value in both MnT and MxT rows, likely to mark the absolute minimum and maximum temperature of the month, respectively, which causes the data in the corresponding “cell” to hang-over into the white-space pre-fix area of the next row. The last column is also of interest, as it starts out with “mo” in contrast to the numeric values for the days of the month. It also has rational values for some rows (e.g. for the temperature rows) where all other values have integer values. This looks like a computed result line, showing the averages of the respective values for the entire month and should hence not be part of the computation. And finally, of course, the data file has some completely irrelevant lines, which we need to skip over.

Having come to some idea of how the data file was structured, I came to the conclusion to parse each line using exact positions, e.g. to determine the MnT value by extracting the substring of each line from position 9 to 14. So, the top-down approach is like this: we will open a file and for each line in it and we will try to parse it according to some pattern specification where the pattern specification would specify the positions of some part and a parsing function.

Clojures approach to file handling is a little bit surprising for somebody coming from Common Lisp, which provides three main concepts you need to grasp: filenames aka pathnames, files and streams. Typically, you open a file with with-open-file which will guarantee that the file will be closed after you leave the block. Clojures with-open macro abstracts this idea of safe file handling to the next level in that it is not restricted to files. However, with-open does not simply work with a filename as an argument, you need to pass in a resource which follows the open/close protocoll which a simple filename string does not. That Clojure leverages the Java IO library for this is not surprising, but that it leaks this Java dependency to the users is. I assume this implies that (with-open ( "/some/filename.txt")) will not work on ClojureCLR (apparently slurp does). read-line is also a false friend for a Common Lisp programmer, as it can only be used for reading a line from the REPL but not for reading from some stream like in CL. While I appreciate the possibility to call Java methods from Clojure, I prefer using any abstractions that Clojure provides, so I went with line-seq instead of calling .read. Finally, as Clojure does not allow for re-assignment, so we need a recursive approach with an accumulator to hold intermediate results while looping over the lines. So, the skeleton looks something like this in Clojure:

 (defn read-some-file
   "Skeleton for reading some file"
   (with-open [rdr (io/reader filename)]
      (loop [lines (line-seq rdr) result]
        (if (empty? lines)
      ; todo for the weather task: 
      ; compare the result of parsing a line with the current 'best' 
      ; i.e. minimum spread value so far and recur, possibly with different data
              (recur (next lines) result)))))))

This is more or less the equivalent of Perl’s -n command-line switch or while (<>) {...} construct.

The comparison mentioned in the code’s comment is actually very easy and what needs to be put in place for the result is also obvious. But in order to do that we have to parse the lines first. Following up on the idea of how to parse a line, the relevant function is subs which returns a substring from start to end. Now, we basically need to parse different substrings for the various data fields, quite often returning just an integer. Again, the simplest way of parsing a string to an integer is provided by a thin layer on top of a Java library. Neither the re-find nor the exception handling would be necessary if we could guarantee that the data would always only consist of correct data, this way we just silently skip over invalid data.

(defn string-to-int
  "Parses a consecutive set of numbers into an integer or return nil"
    (Integer/parseInt (re-find #"\d+" string))
    (catch Exception e nil)))

parse-line is fairly straight-forward: the basic idea is that we have some specification of how a line is structured. This specification is a map of start and end positions plus parsing functions. So for the weather data, the pattern looks like this:

(def day-pattern
  ;this pattern is not complete and could be extended
  (hash-map :day [1 4 #(first-word %)]
            :MxT [5 8 #(string-to-int %)]
            :MnT [9 14 #(string-to-int %)]
            :AvT [15 20 #(string-to-int %)]))

first-word is another small helper function which basically just retrieves the first continous non-whitespace characters of a string:

(defn first-word
  "Returns first consecutive non-whitespace chars from string"
  (re-find #"\S+" string))

parse-line than just loops over all parts of a pattern, extracts the substrings and calls the parsing function. It recursively conjures up a hash-map with the extracted data or returns nil if some parsing error occurs.

(defn parse-line [line pattern]
  "Parse a line with data in fixed positions using pattern.
Pattern should be a map consisting of a key for the data to return,
a start and end position and a parsing function for each data element.

Returns a map with all extracted data or nil for unparsable lines."
  ; loop solution with accumulator for results
  (loop [remkeys (keys pattern) linemap {}]
    (if (empty? remkeys)
      (let [key (first remkeys)
            [start end parsefn] (get pattern key)
            value (parsefn (try
                             (subs line start end)
                             (catch Exception e nil)))]
        (if value
          (recur (rest remkeys)
                 (conj linemap
                       (hash-map key value)))
          ; silently skip any parsing errors

This then allows us to put things together: we only need to compare the difference between MxT and MnT of the current line with the previous smallest temperature spread. We will use destructuring of the result of hte parse-day results to retrieve the needed data, sprinkle in some sanity checks and are done.

(defn parse-day
  "Parse a day from a line"
  (parse-line line day-pattern))

(defn find-lowest-temperature
  "Return day in weatherfile with the smallest temperature spread"
  (with-open [rdr (io/reader weatherfile)]
    (loop [lines (line-seq rdr) minday 0 minspread 0]
      (if (empty? lines)
        (let [{mnt :MnT mxt :MxT curday :day} (parse-day (first lines))            
              curspread (when (and mnt mxt) (- mxt mnt))]
          (if (and curday curspread
                   (or (= minspread 0)
                       (< curspread minspread)))
            (recur (next lines) curday curspread)
            (recur (next lines) minday minspread)))))))

When I started working on the second task, solving the soccer issue, I did a simple copy and paste of the find-lowest-temperature, added a new pattern for extracting the data and made the small changes to adapt to the different fields. I also understand the comparison requirement to look at the absolute difference. This leads to the following functions:

(defn abs 
  "Returns the absolute value of x" 
  (if (pos? x) x (- x)))

(def soccer-team-pattern
  ; this pattern is not complete
  (hash-map :pos [1 5 #(first-word %)]
            :team [7 22 #(first-word %)]
            :fval [43 45 #(string-to-int %)]
            :aval [50 52 #(string-to-int %)]))

(defn parse-soccer-team
  "Parse a soccer-team from a line"
  (parse-line line soccer-team-pattern))

(defn find-minimum-goal-difference 
  "Return team in soccerfile with the smallest difference in for and against goals"
  (with-open [rdr (io/reader soccerfile)]
    (loop [lines (line-seq rdr) minteam 0 mindiff 0]
      (if (empty? lines)
        (let [{aval :aval fval :fval curteam :team} 
              (parse-soccer-team (first lines))            
              curdiff (when (and aval fval) (abs (- fval aval)))]
          (if (and curteam curdiff
                   (or (= mindiff 0)
                       (< curdiff mindiff)))
            (recur (next lines) curteam curdiff)
            (recur (next lines) minteam mindiff)))))))

This, of course, led straight to the insight that it should be simple to extract the slight differences and make them parameters to some find-*-difference function. The following things are differently: the parsing pattern, the extraction function for the result value and the function used to compute the difference between values. If you would want to it would also be possible to make the comparison function configurable.

(defn find-some-difference 
  "Return some result from a data file which has some lowest difference"
  [filename parse-pattern resultkey diffn]
  (with-open [rdr (io/reader filename)]
    (loop [lines (line-seq rdr)
           result nil
           mindiff 0]
      (if (empty? lines)
        (let [data-map (parse-line-map (first lines) parse-pattern)
              curresult (get data-map resultkey)
              curdiff (diffn data-map)]
          (if (and curresult curdiff
                   (or (= mindiff 0)
                       (< curdiff mindiff)))
            (recur (next lines) curresult curdiff)
            (recur (next lines) result mindiff)))))))

(defn find-mingoal-diff-fusion
  "Return team in soccerfile with the smallest goal difference, using the fusion fn."
  (find-some-difference soccerfile soccer-team-pattern :team
                        (fn [{aval :aval fval :fval curteam :team}]
                          (when (and aval fval)
                            (abs (- fval aval))))))

There there was another itch I wanted to scratch: the parse-line function has some ugliness to it. For starters, it is handling possible exceptions from subs directly. It is also checking return values for nil. Both cases are what Common Lisp would see as conditions rather than real exceptions — it is rather unfortunate that Clojure opted for the more simple, although more traditional exception concept from Java. To remedy the uglyness of parse-line we can simply replace the direct call to subs with a small handcrafted call which manages any exceptions and also change the behavior of parse-line to simply return nil for all unparsable elements. But there is more that makes parse-line ugly: I dislike the recursive nature of the solution and the linear result handover in the let declaration (well, this handover was intentional to not have a functional train-wreck of calls). I wanted to see whether I couldn’t come up with a more elegant map/reduce solution. Here you go:

(defn substring
  "Returns substring from start to end from string or nil"
  [string start end]
    (subs string start end)
    (catch Exception e "")))

(defn parse-line-reduce [line pattern]
  "Parse a line with data in fixed positions using pattern.
Pattern should be a map consisting of a key for the data to return,
a start and end position and a parsing function for each data element.

Returns a map with all extracted data which maybe empty."
  ; map-reduce version
  (reduce #(conj %1 %2)
          (concat [{}]
                 (fn [[key [start end parsefn]]]
                   {key (parsefn (substring line start end))})
                 (seq pattern)))))

We are simply mapping over the entire pattern and use argument destructuring again to extract the relevant parts of it, but this time, due to the call to seq a pattern part will be a sequence, not a map. The map call with the anonymous function will produce a sequence of hashmaps with key and parsing results, which then gets reduced to a single map. In order to use reduce, you have to provide a function taking two arguments: the first will consume the intermediate result, the second will be the next value of the sequence to reduce. This is the reason why we have this ugly concat [()] ... in front of the call to map: we need to provide the initial value for reduce which in this case is an empty hashmap. An even more concise version replaces the call to reduce with into, resulting in a version which looks pretty idiomatic to me and is also way easier to understand then the lengthy recursive version above.

(defn parse-line-map [line pattern]
  "Parse a line with data in fixed positions using pattern.
Pattern should be a map consisting of a key for the data to return,
a start and end position and a parsing function for each data element.

Returns a map with all extracted data which maybe empty."
  (into {}
         (fn [[key [start end parsefn]]]
           {key (parsefn (substring line start end))})
         (seq pattern))))

Of course, we can use a similar approach for the find-*-difference function. We could also take a slightly different approach and sort the parsing results and then treat the minimal value as the result. If we combine this with slurp the resulting code also becomes way more compact. In order to filter out only partially parseable lines, we need to supply a list of keys that the parsing result must have values for.

(defn sort-diff-map
  "Return some result from a data file which has some lowest difference"
  [filename parse-pattern desiredkeys diffn]  
  ; map-filter version
  (sort-by diffn 
           (filter #(every? (partial get %) desiredkeys)
                   (map #(parse-line-map % parse-pattern)
                        (string/split-lines (slurp filename))))))

(defn find-mingoal-map
  "Return team in soccerfile with the smallest goal difference, using the sort-map fn."
  (get (take 1 
             (sort-diff-map soccer-file soccer-team-pattern 
                            [:team :fval :aval]
                            (fn [{aval :aval fval :fval curteam :team}]
                              (when (and aval fval)
                                (abs (- fval aval))))))

Summing up, what have we seen during this kata? Clojure’s platform dependent approach to reading files, usage of regular expressions, destructuring (again), anonymous functions (again) and map/reduce. Overall not very exciting. I know it’s not a fair comparison, but I would always opt for solving such tasks with Perl, especially if they are so trivial as in this case in which you can solve each task with a one-liner, basically. There is room for using more elaborate languages (e.g. Python, Ruby, Clojure) if parsing and processing become so elaborate that it makes sense to have more structure in the code. But for a task of the size of this kata, the amount of code required is usually not worth the effort.

Posted by Holger Schauer

Defined tags for this entry: ,
Jun 7

I wanted to learn Clojure for a while but never made much progress beyond the level of where I had a running slime connection to Clojure and had a simple initial leiningen project file for a web project with ring. I just got distracted by other stuff but more importantly, I had no goal for a web project, so I already failed with thinking about what I would build. Consequently, I stopped reading Fogus and Housers nice book Joy of Clojure. After quite a while (close to a year), I stumbled upon some posts reminding me of the code katas on the pragmatic programmers web site (cf. code katas). Somehow this triggered the idea that doing these katas might be a good way to learn Clojure finally.

In the linked blog posts I’ll describe the various solutions I worked out. This post here will serve as the main overview and get updates as I move along.

The code (as well as the original markdown for these articles) can be found on github.

Kata 14 concludes my little journey into the original katas. It took far too much time writing these, as I didn’t had enough time to work on them continuously. Besides, many of the katas I’ve left out seemed rather uninteresting and some are thought experiments.

Posted by Holger Schauer

Defined tags for this entry: ,

(Page 1 of 3, totaling 39 entries)