Generic programming in Haskell

Generic programming is a powerful way to define a function that works in an analogous way for a class of types. In this article, I describe the latest approach to generic programming that is implemented in GHC. This approach goes back to the paper A Generic Deriving Mechanism for Haskell by José Pedro Magalhães, Atze Dijkstra, Johan Jeuring, and Andres Löh.

This article is a writeup of a Theory Lunch talk I gave on 4 February 2016. As usual, the source of this article is a literate Haskell file, which you can download, load into GHCi, and play with.

Motivation

Parametric polymorphism allows you to write functions that deal with values of any type. An example of such a function is the reverse function, whose type is [a] -> [a]. You can apply reverse to any list, no matter what types the elements have.

However, parametric polymorphism does not allow your functions to depend on the structure of the concrete types that are used in place of type variables. So values of these types are always treated as black boxes. For example, the reverse function only reorders the elements of the given list. A function of type [a] -> [a] could also drop elements (like the tail function does) or duplicate elements (like the cycle function does), but it could never invent new elements (except for ⊥) or analyze elements.

Now there are situation where a function is suitable for a class of types that share certain properties. For example, the sum function works for all types that have a notion of binary addition. Haskell uses type classes to support such functions. For example, the Num class provides the method (+), which is used in the definition of sum, whose type Num a => [a] -> a contains a respective class constraint.

The methods of a class have to be implemented separately for every type that is an instance of the class. This is reasonable for methods like (+), where the implementations for the different instances differ fundamentally. However, it is unfortunate for methods that are implemented in an analogous way for most of the class instances. An example of such a method is (==), since there is a canonical way of checking values of algebraic data types for equality. It works by first comparing the outermost data constructors of the two given values and if they match, the individual fields. Only when the data constructors and all the fields match, the two values are considered equal.

For several standard classes, including Eq, Haskell provides the deriving mechanism to generate instances with default method implementations whose precise functionality depends on the structure of the type. Unfortunately, there is no possibility in standard Haskell to extend this deriving mechanism to user-defined classes. Generic programming is a way out of this problem.

Prerequisites

For generic programming, we need several language extensions. The good thing is that only one of them, DeriveGeneric, is specific to generic programming. The other ones have uses in other areas as well. Furthermore, DeriveGeneric is a very small extension. So the generic programming approach we describe here can be considered very lightweight.

We state the full set of necessary extensions with the following pragma:

{-# LANGUAGE DefaultSignatures,
             DeriveGeneric,
             FlexibleContexts,
             TypeFamilies,
             TypeOperators #-}

Apart from these language extensions, we need the module GHC.Generics:

import GHC.Generics

Our running example

As our running example, we pick serialization and deserialization of values. Serialization means converting a value into a bit string, and deserialization means parsing a bit string in order to get back a value.

We introduce a type Bit for representing bits:

data Bit = O | I deriving (Eq, Show)

Furthermore, we define the class of all types that support serialization and deserialization as follows:

class Serializable a where

    put :: a -> [Bit]

    get :: [Bit] -> (a, [Bit])

There is a canonical way of serializing values of algebraic data types. It works by first encoding the data constructor of the given value as a sequence of bits and then serializing the individual fields. To show this approach in action, we define an algebraic data type Tree, which is a type of labeled binary trees:

data Tree a = Leaf | Branch (Tree a) a (Tree a) deriving Show

An instantiation of Serializable for Tree that follows the canonical serialization approach can be carried out as follows:

instance Serializable a => Serializable (Tree a) where

    put Leaf                     = [O]
    put (Branch left root right) = [I]       ++
                                   put left  ++
                                   put root  ++
                                   put right

    get (O : bits) = (Leaf, bits)
    get (I : bits) = (Branch left root right, bits''') where

        (left,  bits')   = get bits
        (root,  bits'')  = get bits'
        (right, bits''') = get bits''

Of course, it quickly becomes cumbersome to provide such an instance declaration for every algebraic data type that should use the canonical serialization approach. So we want to implement the canonical approach once and for all and make it easily usable for arbitrary types that are amenable to it. Generic programming makes this possible.

Representations

An algebraic data type is essentially a sum of products where the terms “sum” and “product” are understood as follows:

  • A sum is a variant type. In Haskell, Either is the canonical type constructor for binary sums, and the empty type Void from the void package is the nullary sum.

  • A product is a tuple type. In Haskell, (,) is the canonical type constructor for binary products, and () is the nullary product.

The key idea of generic programming is to map types to representations that make the sum-of-products structure explicit and to implement canonical behavior based on these representations instead of the actual types.

The GHC.Generics module defines a number of type constructors for constructing representations:

data V1 p

infixr 5 :+:
data (:+:) f g p = L1 (f p) | R1 (g p)

data U1 p = U1

infixr 6 :*:
data (:*:) f g p = f p :*: g p

newtype K1 i a p = K1 { unK1 :: a }

newtype M1 i c f p = M1 { unM1 :: f p }

All of these type constructors take a final parameter p. This parameter is relevant only when dealing with higher-order classes. In this article, however, we only discuss generic programming with first-order classes. In this case, the parameter p is ignored. The different type constructors play the following roles:

  • V1 is for the nullary sum.

  • (:+:) is for binary sums.

  • U1 is for the nullary product.

  • (:*:) is for binary products.

  • K1 is a wrapper for fields of algebraic data types. Its parameter i used to provide some information about the field at the type level, but is now obsolete.

  • M1 is a wrapper for attaching meta information at the type level. Its parameter i denotes the kind of the language construct the meta information refers to, and its parameter c provides access to the meta information.

The GHC.Generics module furthermore introduces a class Generic, whose instances are the types for which a representation exists. Its definition is as follows:

class Generic a where

  type Rep a :: * -> *

  from :: a -> (Rep a) p

  to :: (Rep a) p -> a

A type Rep a is the representation of the type a. The methods from and to convert from values of the actual type to values of the representation type and vice versa.

To see all this in action, we make Tree a an instance of Generic:

instance Generic (Tree a) where

    type Rep (Tree a) =
        M1 D D1_Tree (
            M1 C C1_Tree_Leaf U1
            :+:
            M1 C C1_Tree_Branch (
                M1 S NoSelector (K1 R (Tree a))
                :*:
                M1 S NoSelector (K1 R a)
                :*:
                M1 S NoSelector (K1 R (Tree a))
            )
        )

    from Leaf                     = M1 (L1 (M1 U1))
    from (Branch left root right) = M1 (
                                        R1 (
                                        M1 (
                                            M1 (K1 left)
                                            :*:
                                            M1 (K1 root)
                                            :*:
                                            M1 (K1 right)
                                        ))
                                    )

    to (M1 (L1 (M1 U1)))      = Leaf
    to (M1 (
            R1 (
            M1 (
                M1 (K1 left)
                :*:
                M1 (K1 root)
                :*:
                M1 (K1 right)
            ))
        ))                    = Branch left root right

The types D1_Tree, C1_Tree_Leaf, and C1_Tree_Branch are type-level representations of the type constructor Tree, the data constructor Leaf, and the data constructor Branch, respectively. We declare them as empty types:

data D1_Tree
data C1_Tree_Leaf
data C1_Tree_Branch

We need to make these types instances of the classes Datatype and Constructor, which are part of GHC.Generics as well. These classes provide a link between the type-level representations of type and data constructors and the meta information related to them. This meta information particularly covers the identifiers of the type and data constructors, which are needed when implementing canonical implementations for methods like show and read. The instance declarations for the Tree-related types are as follows:

instance Datatype D1_Tree where

  datatypeName _ = "Tree"

  moduleName _ = "Main"

instance Constructor C1_Tree_Leaf where

  conName _ = "Leaf"

instance Constructor C1_Tree_Branch where

  conName _ = "Branch"

Instantiating the Generic class as shown above is obviously an extremely tedious task. However, it is possible to instantiate Generic completely automatically for any given algebraic data type, using the deriving syntax. This is what the DeriveGeneric language extension makes possible.

So instead of making Tree a an instance of Generic by hand, as we have done above, we could have declared the Tree type as follows in the first place:

data Tree a = Leaf | Branch (Tree a) a (Tree a)
              deriving (Show, Generic)

Implementing canonical behavior

As mentioned above, we implement canonical behavior based on representations. Let us see how this works in the case of the Serializable class.

We introduce a new class Serializable' whose methods provide serialization and deserialization for representation types:

class Serializable' f where

    put' :: f p -> [Bit]

    get' :: [Bit] -> (f p, [Bit])

We instantiate this class for all the representation types:

instance Serializable' U1 where

    put' U1 = []

    get' bits = (U1, bits)

instance (Serializable' r, Serializable' s) =>
         Serializable' (r :*: s) where

    put' (rep1 :*: rep2) = put' rep1 ++ put' rep2

    get' bits = (rep1 :*: rep2, bits'') where

        (rep1, bits')  = get' bits
        (rep2, bits'') = get' bits'

instance Serializable' V1 where

    put' _ = error "attempt to put a void value"

    get' _ = error "attempt to get a void value"

instance (Serializable' r, Serializable' s) =>
         Serializable' (r :+: s) where

    put' (L1 rep) = O : put' rep
    put' (R1 rep) = I : put' rep

    get' (O : bits) = let (rep, bits') = get' bits in
                      (L1 rep, bits')
    get' (I : bits) = let (rep, bits') = get' bits in
                      (R1 rep, bits')

instance Serializable' r => Serializable' (M1 i a r) where

    put' (M1 rep) = put' rep

    get' bits = (M1 rep, bits') where

        (rep, bits') = get' bits

instance Serializable a => Serializable' (K1 i a) where

    put' (K1 val) = put val

    get' bits = (K1 val, bits') where

        (val, bits') = get bits

Note that in the case of K1, the context mentions Serializable, not Serializable', and the methods put' and get call put and get, not put' and get'. The reason is that the value wrapped in K1 has an ordinary type, not a representation type.

Accessing canonical behavior

We can now apply canonical behavior to ordinary types using the methods from and to from the Generic class. For example, we can implement functions defaultPut and defaultGet that provide canonical serialization and deserialization for all instances of Generic:

defaultPut :: (Generic a, Serializable' (Rep a)) =>
              a -> [Bit]
defaultPut = put' . from

defaultGet :: (Generic a, Serializable' (Rep a)) =>
              [Bit] -> (a, [Bit])
defaultGet bits = (to rep, bits') where

    (rep, bits') = get' bits

We can use these functions in instance declarations for Serializable. For example, we can make Tree a an instance of Serializable in the following way:

instance Serializable a => Serializable (Tree a) where
    
    put = defaultPut

    get = defaultGet

Compared to the instance declaration we had initially, this one is a real improvement, since we do not have to implement the desired behavior of put and get by hand anymore. However, it still contains boilerplate code in the form of the trivial method declarations. It would be better to establish defaultPut and defaultGet as defaults in the class declaration:

class Serializable a where

    put :: a -> [Bit]
    put = defaultPut

    get :: [Bit] -> (a, [Bit])
    get = defaultGet

However, this is not possible, since the types of defaultPut and defaultGet are less general than the types of put and get, as they put additional constraints on the type a. Luckily, GHC supports the language extension DefaultSignatures, which allows us to give default implementations that have less general types than the actual methods (and consequently work only for those instances that are compatible with these less general types). Using DefaultSignatures, we can declare the Serializable class as follows:

class Serializable a where

    put :: a -> [Bit]
    default put :: (Generic a, Serializable' (Rep a)) =>
                   a -> [Bit]
    put = defaultPut

    get :: [Bit] -> (a, [Bit])
    default get :: (Generic a, Serializable' (Rep a)) =>
                   [Bit] -> (a, [Bit])
    get = defaultGet

With this class declaration in place, we can make Tree a an instance of Serializable as follows:

instance Serializable a => Serializable (Tree a)

With the minor extension DeriveAnyClass, which is provided by GHC starting from Version 7.10, we can even use the deriving keyword to instantiate Serializable for Tree a. As a result, we only have to write the following in order to define the Tree type and make it an instance of Serializable:

data Tree a = Leaf | Branch (Tree a) a (Tree a)
              deriving (Show, Generic, Serializable)

So finally, we can use our own classes like the Haskell standard classes regarding the use of deriving clauses, except that we have to additionally derive an instance declaration for Generic.

Specialized implementations

Usually, not all instances of a class should or even can be generated by means of generic programming, but some instances have to be crafted by hand. For example, making Int an instance of Serializable requires manual work, since Int is not an algebraic data type.

However, there is no problem with this, since we still have the opportunity to write explicit instance declarations, despite the presence of a generic solution. This is in line with the standard deriving mechanism: you can make use of it, but you are not forced to do so. So we can have the following instance declaration, for example:

instance Serializable Int where

    put val = replicate val I ++ [O]

    get bits = (length is, bits') where

        (is, O : bits') = span (== I) bits

Of course, the serialization approach we use here is not very efficient, but the instance declaration illustrates the point we want to make.

5 thoughts on “Generic programming in Haskell

  1. Robert Peszek

    Thank you for your blog, I plan to read more. I loved reading your post on constraints.

    I have noticed some minor typos:

    • The link to “the latest approach” appears to be broken now.
    • M1 (definition) as stated does not have c as indicated (“c provides access to the meta information”), but it has a instead.
    • The definition of get for Int does not seem to be total, as the pattern match 0 : bits' may fail.

    Also, reading this post, I got the impression (more reading between lines, I guess) that the generic implementation of Serializable somehow gets implemented with the magic of DeriveAnyClass, which is clearly not the case. Maybe clarifying which boilerplate is automatable and which is not would make it a bit more clear.

    Like

    Reply
    1. Wolfgang Jeltsch Post author

      Glad that you like my blog. ☺

      I fixed the link and the definition of M1.

      The definition of get for Int is correct. The list of following bits cannot start with I, since span removes all leading I-bits. So the pattern match can only fail if the list of following bits is empty. However, this would mean an illegal encoding was given, and the function should fail in this case.

      Like

      Reply
  2. Politron

    All examples in the net are related with tree leafs and even worst mathematics reference. Is that hard and complex this subject. Can I see a simple example of how to process a data type Male and Female in a generic code where both of them share age and name but not sex?

    Like

    Reply

When replying to another comment, please press that comment’s “Reply” button.