Chance findings | Entries from October 2013

Moving targets -- on agile conception

Posted by Holger Schauer on Monday, October 28. 2013

Two things are clear for every project and every product: there is always a start and an end. I’ll ignore the end for now and would like to take a closer look at the start. One has to wonder: where do all these projects or new products come from? Obviously, somebody finds out at one point that there is a need for something and this something is the nucleus of the perspective development. If this idea requires significant effort, it’s typically necessary to convince the people responsible for the money that is worthwhile to pursue the idea. Let’s take a closer look at what this something actually is and how it changes and needs to change before the developers can finally go make it real.

To use a more descriptive tag, I will use the term feature idea for now, although we could well talk about ideas for a new product or product line. Feature ideas in a very early stage are rarely more than a single, simple statement or even just a few descriptive keywords like “I would like a button here where I can save my settings” or “remember me functionality”. Of course, whoever came up with the idea has typically second thoughts, more or less fuzzy but these are not communicated, neither discussed nor finalized.

A feature idea

Quite often these second thoughts are related to some existing systems or solutions, even if it’s just to describe what the respective solutions do not provide (note that these solutions might not be your own). Typically feature ideas are about some specific expected behavior of some system and basically boil down to what needs to be build. This is all good: feature ideas don’t need further elaboration or lengthy descriptions in an early state.

How do you proceed from here to the nitty-gritty details that finally need to be done? In traditional settings, to get from the idea for a larger feature to development you need to jump over the budget hurdle, which usually requires detailed analysis results, including business cases, return on investment calculations, and implementation plans. The latter consist of solution outline, milestones, resource plans and cost requirements. It is important to note that to come up with all this substantial efforts need to be invested already. The bigger the need for precise information, the bigger the required investment. And then at some point you hopefully get the okay to move ahead. But in case the idea is turned down, all this investment in detailed plans was just wasted. In agile setups, we would like to avoid that waste. Still, of course, the need for getting a decision on where the business wants to invest its money does not go away. But in order to do so, does the senior management really need all those details? They don’t. There is a need for accuracy, not for precision (cf. for a description of the difference). Asking for all the details is a sure sign for lack of trust, but I have never seen any senior management trying to verify that all those precise details were actually accurate, which probably explains why the demand for documents and spreadsheets is all there is. This is then typically combined with taking all plans as promises that need to be kept and not as means to enable successful investments (command-and-control vs. plan-to-replan).

What is surely needed boils down to just one thing: understanding the value of an investment. It’s often much easier to determine what you want or can invest than to precisely estimate upfront what you need to invest (cf. this discussion of the no estimates idea which shows the wonderful relevant Dilbert strip on the same issue). On top-level, investment is essentially tied to strategy: e.g. in the Scaled Agile Framework (SAFe, cf. Leffingwell et al.), investment themes determine the relation between budget (resources) and the program portfolio of business epics. This essentially requires an understanding how each possible business epic or feature idea contributes to the overall company strategy (this is basically applying the Lean principle of “optimize the whole” to feature ideas).

We essentially danced around the big elephant in the room now: feature ideas often describe expected behavior, but we need to know about the value of an idea. But if all we have is a simple statement (and not the full-fledged analysis we would like to avoid) of some idea, how can we know its value? The essential insight is that it might not be necessary to understand the precise value yet. The earlier in the process we are, the less is it necessary to perform lengthy and detailed analysis: determining the value of some idea is only a means to further a decision on where to invest further. However, at least it should be possible to judge the value of one feature idea in relation or better said in comparison to other ideas, as Johanna Rothman discusses in her article on why cost is the wrong question for evaluation projects: at least in an early stage, it might be enough to use a very simple kind of rough value sizing, e.g. putting an “M” on one feature idea in comparison to a “L” for another. It is important to recognize that the value of ideas will change over time: in particular so when looked at in relation to other things your business needs to think about, as new ideas come up , some ideas get implemented and other discarded. Accounting for these changes over time is why planning to replan is important in agile methodologies. An important concept in agile methodologies that helps making this clear is the cost of delay: the idea is that you try to understand or predict how the value contribution of some feature will change over time, in particular what impact not pursuing the idea will have. This is most useful for prioritizing by cost of delay of what to invest in — obviously, as CoD is all about loss of value over time, as time moves on you want to reconsider your priorities (where you want to drill down further with an analysis etc.). (An alternative coming from the lean camp is to use an impact-effort matrix, but such a matrix needs an estimate about the effort required and also treats value as an absolute over time, which it usually isn’t. Cf. this article about cost of delay vs. WSJF in SAFe for a discussion why cost of delay and value considerations are not the same).

Inevitably, when you try to think about the value of some feature idea, you will think about what this simple statements like “remember me functionality” should actually mean. Very often the way in which we name or describe a potential feature is often a confusion (and at best a conflation) of problem and solution space and that can make it quite difficult to judge it’s value. As such, naive feature descriptions can quickly become a big impediment. For one thing, they are — due to being explicitly simple — usually ambiguous and / or often hard to understand. Does “remember functionality” describe that I stay logged in on the website or does it mean that the website remembers the visitor and shows specific ads related to his interests or both? This is often the reason why some people attribute completely different value to an idea. Worse, they typically don’t give any clue about the need behind the wanted behavior nor about the value this behavior brings. Just as bad is that simple statements of the feature idea in terms of expected behaviour (“I want a button that …”) limits creativity by already pointing in the direction of a possible solution: maybe there are other options which are much better suited, both from a usability point of view but also with regard to the required functionality. So the conflation of problem and solution is actually problematic for understanding the problem and for designing the solution.

So what is the process to move from a feature idea to the solution design? I think it’s useful to see feature idea statements as promises for analysis. The analysis of a feature idea should essentially focus on understanding the user and / or business needs the idea is related to. The stereotypical user story template “As a <type of persona>, I want to <perform X> in order <to achive Y>” is a useful tool in this respect as it is explicitly constructed to include and name the need (cf. Gojko Adzic rant Writing “As a user” does not make it a user story how this can go wrong when taken on too easy). Now you might not have such a story or might not have it yet and maybe it’s not clear what this simple feature idea is about. That’s the time to engage with whoever seems able to contribute, but first and foremost it should probably be the primary beneficiaries of the feature idea. Traditional requirements engineering practice would suggest to involve all relevant stakeholders (perspective users, management, marketing, legal department, …) and to employ whatever RE tool seems reasonable to use (e.g. brainstorming, workshops, open interviews, etc.). In agile setups, determining who actually is a stakeholder which needs to be heard is the responsibility of the product owner. I find that it is usually also a good idea to involve the team, because they will typically have a good understanding of the solution space and should also know about the envisioned user needs. Don’t take it too far, though: If you are still trying to analyze in order to understand the value of some idea, i.e. you are still at an early stage (inception phase as the disciplined agile delivery folks would have it) of your planned feature, you should probably try to keep it light-weight.

You will likely get a lot of input on your feature idea and many in the form of comparable simple statements to what you originally started out with. Understanding the needs requires going beyond the level of simple statements or simple suggestions of simple solutions (Laura Klein has a nice description of how difficult that can be: Sometimes users will tell you that they want a toaster in their car, when what they really mean is that they don’t have time to make breakfast in the morning.). And although the famous 5 Whys might come across somewhat strange or invasive in the course of an interview, it might take that many attempts to actually understand the problem that needs to be solved. If you’re lucky, a potential user will offer some simple insight as to the problem with simple statements such as “I would like a button here so I can save my settings, so I don’t have to adjust them again and again”. Further analysis might reveal that the real problem is that there is a usability problem and the user forgets to adjust the settings too often or that the user is a regular user which would like to spend less time with repetitive tasks. The main point here is that this process is about understanding the needs, which typically requires challenging any superficial and obvious ideas: this is both about maximizing the amount of work not done (simplicity) and about building solutions that satisfy the user.

Change in understanding over time

This is now where the team basically takes over: the team is responsible for working out the solution, not only constructing / implementing it, but essentially designing the solution. This is not to say that the product owner or other stakeholders cannot or should not contribute ideas on how to solve it, quite to the contrary. It is quite essential to have stakeholders around to enable the knowledge transfer of needs from the users or stakeholders to the team. So while the separate paragraphs of this text might suggest a handover from product owner to development team, from needs to solutions, ideally this should be a smooth transition with lots of participation of the entire team and face-to-face communication. Tools like Roman Pichler’s product canvas are useful, but there is a reason why the agile manifesto recommends people interactions over tool usage. Again, for working out the solutions many ideas and tools can and should be exploited, from brainstorming over UI scribbles to models and prototypes. Typically when designing the solution, many alternatives come to mind and different people from different backgrounds will have different opinions and favorites. This is actually good, because you can embrace this diversity to think about the best possible solution. However, the best solution might not actually be the solution you want to implement, often the simplest thing that could possibly work is a reasonable way to move forward: this minimizes the effort required, thereby enabling faster time to market and thereby enabling earlier feedback on whether this solution actually works (this is the tension between trying to build the best thing vs. learning fast from feedback about what actually proves to work). People with lots of experience in designing solutions will typically also come up with all kind of challenging questions which might require drilling further down into the needs (e.g. questioning if the feature is targeted at the right people or what the legal department has to say to some feature). This will trigger more learning about the feature under consideration. But this should not be taken as an excuse to run endless analysis or design meetings: it is often a better approach to end a discussion with a decision and move ahead and find out later whether this decision was actually flawed and needs correction. It is often a good idea to consider the risks associated with some idea: the more risk, the closer you want to consider your options. In general, when in doubt about which option to chose, think about what data would be helpful to decide: performing light-weight experiments might be useful for minimizing these risks (e.g. technical spikes or usability tests with paper-pen sketches).

A brief terminology recap: an epic is just some requirement (feature idea) that is so “large” and / or uncertain that it needs to be broken down, i.e. nobody could take an epic and just implement it, the developers just have to come back and ask lots of questions. If you wanted to you could say after breaking down an epic, all you end up with are features but in agile circles these are usually split up into even smaller stories (cf. the image below). Of course, this is a completely arbitrary and often also highly context-dependent distinction. It’s also fairly common to use the user story template for describing epics. I used the term feature idea instead of epics to bring in some notion of time: epics start out as (feature or other) ideas. If you think about the timeline again, it’s clear that the closer you are to implementing some feature, the more details you need to know, i.e. the higher your level of understanding of the feature needs to be. And this is where the idea comes into play that you need to INVEST in your user stories, so that you can derive SMART tasks. This is obviously needed for the very next things you want to implement, and as any risk-aware tech guy or project lead will know, probably also a good idea for the next most things to do afterwards. Business cases are mainly needed when you are going to go for big investments and when it’s critical to understand the return on investment. But this brings us back to the discussion of value vs. cost and the discussion of precision vs. accuracy. Use cases, finally, are a useful tool when it comes to understanding some set of requirements in the context of a larger system. They can also be useful for driving discussions with stakeholders and are often a useful starting point if you like to model your problem / design before implementation. I see use cases as a useful tool on the feature level, to cover a broader range of more detailed stories. As such it does not matter much when to describe a use case: if it helps you drive your understanding of some problem, use it before drilling down to more fine-grained descriptions. If you already have a lot of details and you fear to lose the overview, it can also make sense to reconstruct some use case from more detailed stories.

Uncertainty vs. time

Finally, let’s step back briefly and re-consider who is part of this solution team. It’s quite common that for feature ideas of different sizes, different people will make up the solution team. I.e. when your feature idea is concerned with some detail of some larger system, the team might just be the developer team (in the Scrum sense). For bigger features, projects or programs, it’s quite common that lead developers, design and usability specialists, solution and / or enterprise architects and maybe operational staff will play an important part in working out the solution. Whether these people are working in specialist teams (as e.g. SAFe assumes) or are part of the development teams is not of much importance, but having members of those teams working on the solution design is quite essential: if this is not the case, in the best case, the final developers will only have to rediscover and understand all the good reasons why you came up with the solution in the first place, in order to get it right. Worst, you might end up with frustrated developers which don’t take responsibility and don’t care about the build solution, because they feel they have no saying in its design.

Coding katas Clojure -- Bloom filters

Posted by Holger Schauer on Tuesday, October 15. 2013

Kata5 is an interesting one to do in Clojure as Bloom filters, which are a probalistic data-structure for determining set membership, are all about spending as few bits as possible (or required) to store data. When bit manipulations are required, not many programmers would jump to Lisp or Java and indeed most descriptions are about implementations in C or one of it’s derivates. This probably does not come as a surprise, but as we will see is not entirely justified (wrt. to Lisp or Java).

But before discussing this in detail, let’s dive in with this kata. The description of the kata already tells us pretty exactly what we’re supposed to build: a bunch of hash functions and an array of bits which are then set or checked. We’ll start with the hash functions first. Hash functions are dime-a-dozen, Java provides one, Clojure, too. Still it is interesting to go beyond the readily provided functions and to implement some hashing functions. The goal here is not to build perfect hash functions, but to get a feel how an implementation of one looks like in Clojure (as this blog post series is about Clojure, not about Computer Science — the code is also available in my github repository for Clojure codekatas).

As the task is here to build a Bloom filter for strings, all hash functions basically boil down to iterating over a sequence of characters, converting each character into a numerical value (i.e. applying int) and then using this in an accumulating computation of the total hash value. We start out with more or less the simplest approach possible: we simply sum up the integer values of the characters (also known as the Kernighan & Ritchie “lose-lose” hash algorithms).

  (defn sum-chars 
    "Sum up the chars of a given string"
    [charseq]
    (reduce + (map int charseq)))

Nothing interesting to see but a straight-forward map/reduce, so let’s move on and take a stab at what Dan Bernstein (djb) suggested. In its most innocent version this iooks pretty similar, but uses bit-shifting and has some magic numbers thrown in for good measure.

(defn djb-string-hash
     "Use djb's method for hashing a string"
     [charseq]
     (reduce (fn [curhash charval]
               (+ 
                 (+ (bit-shift-left curhash 5) curhash) 
               charval))
         (cons 5381 (map int charseq))))

This is already were it got interesting, because if you run this with the simple string “foobar”, you will run directly into an overflow. The clojure documentation on unchecked-add could have told me so directly, of course. I had to use a quite a bit of websearch-fu, wildly testing around and knocking my head on the table to come up with this version:

    (defn djb-string-hash
      "Use djb's method for hashing a string"
      [charseq]
  (reduce (fn [curhash charval]
               ^long (unchecked-add 
                       (unchecked-add 
                             (bit-shift-left curhash 5) 
                             curhash) 
               charval))
    (cons 5381 (map int charseq))))

This looks quite good, but is still not bulletproof, as testing it with a longer string (> 11 characters) will show. The real issue is actually due to Java: unchecked-add uses internally a data structure that can (and will) result in a Java IntegerOverflowException. Java does not have unsigned numeric types and also defaults to throwing exceptions on overflow and Clojure, leveraging the JVM, is directly affected by that issue.

      kata5-bloom-filters.core> (djb-string-hash "foobar")
      6953516687550
      kata5-bloom-filters.core> (djb-string-hash "foobarfoobar")
      ArithmeticException integer overflow  
      clojure.lang.Numbers.throwIntOverflow (Numbers.java:1388)

Compare for instance the behavior of Common Lisp: the straight-forward translation of the exact same naive implementation will not overflow (unless aggressively optimizing against safety or promising to the compiler that only certain results will occur) due to automated boxing.

(defun djb-string-hash (charseq)
  "Use djb's method for hashing a string"
  (reduce #'(lambda (curhash charval)
          (+ (ash curhash 5) curhash charval))
       (cons 5381 (map 'list #'char-code charseq))))

Clojure does provide a similar behavior with e.g. the BigInteger type, however this is of no use as Clojure’s bit-operations don’t know how to handle BigInteger data — they are only defined for the primitive data types that Java provides. Worse, converting back from a BigInteger to a primitive data type (e.g. long) will not — as one might naively expect — simply truncate the data but yield -2:

kata5-bloom-filters.core> (bit-shift-left (bigint 20) 2)
IllegalArgumentException bit operation not supported for: 
       class clojure.lang.BigInt  
       clojure.lang.Numbers.bitOpsCast (Numbers.java:1008)
kata5-bloom-filters.core> (* 2 (bigint Integer/MAX_VALUE))
4294967294N
kata5-bloom-filters.core> (int (* 2 (bigint Integer/MAX_VALUE)))
IllegalArgumentException Value out of range for int: 4294967294
       clojure.lang.RT.intCast (RT.java:1115)
kata5-bloom-filters.core> (unchecked-int (* 2 (bigint Integer/MAX_VALUE)))
-2

Similar issues arose with all of the other hash functions (i.e. sdbm and fnv), cf. the code on github — nothing interesting here, so let’s move on to the bloom filter itself.

Multiple options come to mind when thinking about the data structure to use. Let’s start out with the simplest possible option: just use a simple number. This defaults to Java long, i.e. a 64-bit number. This implies of course that we already have a rather arbitrary upper limit on the size of the bloom filter, which influences the number of possible entries and the number of false positives, cf. this overview article about the garden variety of bloom filters.

(defn bloom-add [bloom charseq & {:keys [hashfns] 
                                      :or {hashfns *hash-functions*}}]
  (reduce #(bit-set %1 %2) 
          (conj (map #(% charseq) hashfns)
                bloom)))

(defn bloom-contains? [bloom charseq & {:keys [hashfns] 
                                    :or {hashfns *hash-functions*}}]
  (every? #(bit-test bloom %) 
      (map #(% charseq) hashfns)))

(defn build-bloom [wordfile & {:keys [hashfns] 
                       :or {hashfns *hash-functions*}}]
  (reduce #(bloom-add %1 %2 :hashfns hashfns)
        (cons 0 (string/split-lines (slurp wordfile)))))

The code here is pretty straight-forward, maybe with the possible exception that we’re mapping over a list of functions in bloom-add and bloom-contains?. We could extract this part to a simple function which makes the code a little more readable.

     (defn hash-string [charseq & {:keys [hashfns] 
                                   :or {hashfns *hash-functions*}}]
        (map #(% charseq) hashfns))

This very naive implementation will run into problems right away: The hash functions will yield hash values that are itself 64 bit in size whereas the biggest bit that can be set is 63. A straight-forward fix for that is to consider the size of the ‘bloom filter’ (i.e. 64 bit) and to truncate the hash values accordingly via the modulo operation. I.e., instead of calling (bit-set bloom value) we do (bit-set bloom (mod value 64)).

Now, if you think about it, using a simple number is probably not the optimal data structure: for one, we just limited us to bit arrays that are 64 bits in size (which for instance implies that with the /usr/share/dict/words file you’ll end up with Integer/MAX_VALUE, i.e. all bits set to 1) but due to the immutable nature of the mathematical operations, we actually require a lot more space than just one long, thereby very much defeating an important characteristic of Bloom filters.

So let’s use a completely different idea and use one of Java’s mutable data structures: BitSets. This leads to the following naive, non-thread-safe implementation:

(defn bloom-add [bloom charseq & {:keys [hashfns] 
                                      :or {hashfns *hash-functions*}}]
        (let [size (.size bloom)]
             (doseq [hashval (hash-string charseq :hashfns hashfns)]
               (.set bloom (Math/abs (mod hashval size)) true))
     bloom))

(defn bloom-contains? [bloom charseq & {:keys [hashfns] 
                                            :or {hashfns *hash-functions*}}]
    (let [size (.size bloom)
          hashvals (hash-string charseq :hashfns hashfns)]
           (every? #(= (.get bloom (Math/abs (mod % size))) true) hashvals)))

(defn build-bloom [wordfile & {:keys [bloom-filter size hashfns]
                                   :or {size 1024
                                        hashfns *hash-functions*}}]
    (let [bloom (or bloom-filter (BitSet. size))]
           (reduce #(bloom-add %1 %2 :hashfns hashfns)
              (cons bloom (string/split-lines (slurp wordfile))))
        bloom))

We basically just exchanged the bit-set/bit-test functions with the respective BitSet methods and use a dynamic size. This hints at a possible generalization: we could consider multiple bloom filter implementations (types, if you want to) that need to support some sort of bit-setting and getting operation plus size. This would be the internal protocol, while bloom-add/bloom-contains? (and maybe build-bloom) form the external API.

Now, of course, we would like to fix the problem that this code is not thread-safe. As is made pretty clear in Fogus etal. book “Joy of Clojure”, Clojure’s reference types are of no use here:

“Wrapping a mutable object in a Clojure reference type provides absolutely no guarantees for safe concurrent modification. Doing this will at best explode immediately or, worse, provide inaccurate results.”

The advice Fogus etal. offer is to use the locking macro. If we combine the above idea of using an internal protocol, we can at least apply it where it is necessary, i.e. around the calls to .get/.set.

(defprotocol BloomFilterImpl
    (bloom-size [filter])
    (bloom-bit-get [filter position])
    (bloom-bit-set [filter position value]))

(extend-type BitSet
    BloomFilterImpl
    (bloom-size [filter]
        (.size filter))
    (bloom-bit-get [filter position]
        (locking filter
            (.get filter position)))
    (bloom-bit-set [filter position value]
        (if (< position (bloom-size filter))
            (locking filter
                (.set filter position value))
            (throw (IllegalArgumentException. 
                                    "position outside of bloom filter size")))))

If you wonder why the protocol does not have the add or contains? functions, this is because these operations would be part of some dictionary protocol or some such (although it is somewhat debatable if dictionaries should guarantee the absence of false-positives).

Let’s dig some more into the concurrency issue: it’s surprisingly hard to come up with a scenario where the mutability of the BitSet could be problematic. For one, we are always only adding entries by manipulating a single bit and do that in an atomic fashion that does not rely on the previous value of the BitSet in any way. For another, we don’t have any delete operation, so we can’t possibly run into the situation where some bit / some dictionary entry goes missing — assuming, of course, that all modifications to the bloom filter happen through the functions we supplied only and not by some other means directly on the BitSet outside our control. The only scenario that comes to my mind would be where one wants to keep the state of the bloom filter fixed in one thread, i.e. for some time we want to be able to deny having seen some value / word (which another thread just tried to sneak in while we were not looking). I can’t imagine a real world usage for this scenario, but that probably says more about my creativity than about anything else.

Let’s briefly discuss the options to account for this scenario: the locking scenario above is not enough as the locking occurs as part of the getting/setting operations — there is no way in which one thread could prohibit modifications to the Bloom filter during a specified amount of time with this. Of course, as locks nest you could add locking outside the calls to add new elements to the bloom filter. The other option would be to use one of Clojure’s reference types. But as discussed above, these are not useful for protecting mutable data structures, so we would need to go back to using one of Clojures persistent data structures. So, let’s briefly step aside and compare the speed of Java arrays generated via Clojure and using Clojure vectors, both on booleans:

(defn make-random-boolean-array [size]
  (boolean-array
   (take size (repeatedly #(rand-nth [true false])))))

(defn make-random-boolean-vector [size]
  (into [] (take size (repeatedly #(rand-nth [true false])))))

(defn print-flipped-boolean-array [ba]
  (let [size (count ba)]
    (loop [indx 0
           result ""]
      (if (= indx size)
        result
        (do
          (aset ba indx (not (aget ba indx)))
          (recur (inc indx)
                 (string/join
                  [result
                   (if (aget ba indx) 1 0)])))))))

(defn print-flipped-boolean-vector [bv]
  (let [size (count bv)]
    (loop [indx 0
           vec bv
           result ""]
      (if (= indx size)
        result
        (let [newvec (assoc vec indx (not (get vec indx)))]
          (recur (inc indx)
                 newvec
                 (string/join
                  [result
                   (if (get newvec indx) 1 0)])))))))

kata5-bloom-filters.core> (time (do 
                                      (print-boolean-array 
                                        (make-random-boolean-array 10000)) 
                                      true))
"Elapsed time: 587.247368 msecs"
true
kata5-bloom-filters.core> (time (do 
                                      (print-boolean-array 
                                        (make-random-boolean-array 10000)) 
                                      true))
"Elapsed time: 592.598888 msecs"
true
kata5-bloom-filters.core> (time (do 
                                      (print-boolean-vector 
                                        (make-random-boolean-vector 10000)) 
                                      true))
"Elapsed time: 76.657272 msecs"
true
kata5-bloom-filters.core> (time (do 
                                      (print-boolean-vector 
                                        (make-random-boolean-vector 10000)) 
                                      true))
"Elapsed time: 69.666769 msecs"
true
kata5-bloom-filters.core> (time (do 
                                      (print-boolean-vector 
                                        (make-random-boolean-vector 10000)) 
                                      true))
"Elapsed time: 71.897087 msecs"
true

Now if I run this a reasonable number of times, it appears that for simple element access the boolean vector is outperforming the boolean array, even if I’m basically doing building up a new partial copy of the vector all the time / for all elements. That was a welcome surprise for me. So, using a vector of booleans would be a viable next possibility. The road ahead, however, is hindered by the fact that Clojure does not derive a complex type for vectors, not even if you especially declare the vector to be of a certain type. You’ll always end up with a vector or Vec:

kata5-bloom-filters.core> (into (vector-of :boolean) [true false])
[true false]
kata5-bloom-filters.core> (type *1)
clojure.core.Vec

This is problematic as we don’t want to just extend-type the general type. The way out of it is the use of reify which is just like extend-type for objects (that’s actually quite misleading, if you look at the documentation of reify, go check yourself — I would also recommend reading up about this in “Joy of Clojure”).

(defn make-bloom-vector [size]
  (let [bloom-vect
        (ref (into (vector-of :boolean)
                   (take size (repeatedly #(identity false)))))]
    (reify BloomFilterImpl
      (bloom-size [_]
        size)
      (bloom-bit-get [_ position]
        (nth @bloom-vect position))
      (bloom-bit-set [_ position value]
        (alter bloom-vect assoc position value)))))

As you can see, I’m using a closure to store the vector of booleans in a ref. We then deref the vector on get and alter it on set. This is especially set up such that any transaction handling (i.e. dosync) occurs outside of the implementation details, to account for the scenario discussed above. Of course, whether refs are really best suited for handling any concurrency situation is likely to depend ultimately on the application specific context or concurrency requirements.

I will leave it at that. This time, we have seen quite some more features that are pretty much unique to Clojure: explicit protocols, reification of instances and references, i.e. using Clojure’s STM. There was also quite a bit of discussion of some consequences of Clojure’s decision to leverage the JVM, some bad (overflows), some nice (plugging in a readily available data structure). Overall, this kata was quite an interesting exercise.

Page 1 of 1, totaling 2 entries