# On the implementation of C-sets

$$

$$

The idea of a “\mathsf{C}-set” was introduced in an earlier post. To recap, a \mathsf{C}-set is a functor from a small category \mathsf{C} to \mathsf{Set}. When implementing this idea on a computer, we restrict it to a functor from a finitely presented category \mathsf{C} to \mathsf{FinSet}. The object of this blog post is to explain how we implement such \mathsf{C}-sets as a data structure in the Julia programming language.

To describe a functor from a finitely presented category, it suffices to give the value of the functor on the objects of the category and on the *generating* morphisms. Knowing this data, we can derive the value of the functor on all other morphisms, because all other morphisms are finite compositions of the generating morphisms. Notice that this is a non-unique way of serializing the data of a C-set; we could be working with a presentation of \mathsf{C} with redundant generators, for instance.

In Catlab, an object of \mathsf{FinSet} is just a wrapper around a natural number, which we will write in mathematical notation as [n]. A map from [n] to [m] is a length-n array of integers ranging from 1 to m. Therefore, in order to represent a functor from the category defined bytes

we could use a data structure that looks like this:

The vertices of `g::Graph`

are `1:g.nv`

and the edges are `1:g.ne`

. If `j`

is in `1:g.ne`

, then `g.src[j]`

is the source of `j`

, and `g.tgt[j]`

is the target. This is actually the way that a common data structure for graphs stores the data of a graph (an edge list).

However, if that was all there was to C-sets, there would not be much reason to use them. Sure, from an aesthetic standpoint it is elegant to unify the implementation of many different data structures, but handwriting a data structure like `Graph`

is not so hard. Writing functions to add and modify vertices and edges is also not so hard. The advantages of implementing C-sets generically start to really shine when we talk upgrade the edge list implementation to an *adjacency list* implementation, which adds an index that caches the inverses of `src`

and `tgt`

. This data structure looks like

```
struct Graph
nv::Int
ne::Int
src::Vector{Int}
tgt::Vector{Int}
src_inv::Vector{Vector{Int}}
tgt_inv::Vector{Vector{Int}}
end
```

If `i`

is between 1 and `g.nv`

(where `g::Graph`

), then `g.src_inv[i]`

is a list of all of the edges that have `i`

as their source, and `g.tgt_inv[i]`

is a list of all the edges with `i`

as their target. Now, whenever you add or modify a vertex or edge, you have to update the indexes as well, which can be somewhat tricky to do correctly. However, with the generic C-set implementation in `Catlab.jl`

, the code to add and modify each data structure is automatically generated, and the only thing that an implementor of a new C-set-based structure has to do is give memorable names to the automatically generated functions (here as in the rest of CS, the one thing that cannot be automated is naming!).

To sum up, while from a mathematical perspective the purpose of C-sets is that they generalize many interesting mathematical objects, from a scientific computing perspective the purpose of C-sets is that tedious and error prone code that keeps track of indexes can be automatically generated.

There’s an old computer science joke which says that there are two problems in computer science:

- Cache invalidation
- Naming
- Off-by-one errors

CSets solves the first of the two.

## The core data structure

The question we will answer in this section is how do *automatically* derive a data structure like `Graph`

from a presentation like `SchGraph`

.^{1}

Our approach is to look at the structure like a database (the viewpoint in Databases are categories). Each object in the category corresponds to a table, and each morphism out of that object corresponds to a column in that table. Therefore, for a graph, we have two tables. The table for vertices has no columns, and the table for edges has two columns. To store the tables, we turn to the StructArrays.jl package.^{2} For instance, the table for edges is stored using the type

This type of data structure should be familiar to most data scientists. In Python, it is what is provided by pandas, and in R it is called a data frame. Logically, it is a list of records, however it is actually implemented by a list of columns, each column holding one field of the record.

All of the tables for the objects are stored in a NamedTuple with keys of the names of the objects. So, the entire type that stores that data of a graph looks like

```
NamedTuple{(:V,:E),Tuple{
StructVector{NamedTuple{(),Tuple{}}},
StructVector{NamedTuple{(:src,:tgt),Tuple{Int,Int}}}}}
```

The indexes we store in a more straightforward way, just a NamedTuple with a field corresponding to each indexed morphism, so for graphs it would be

I suppose that we could store the indexes in StructArrays grouped by codomain, just like the data is stored in StructArrays grouped by domain, but we haven’t done that yet, for no particular reason.

To actually implement this, we use the following data type (note: this does not appear in Catlab because in Catlab we also deal with attributes).

```
struct CSet{CD <: CatDesc, Indexed, TablesType <: NamedTuple, IndexType <: NamedTuple}
tables::TablesType
index::IndexType
function CSet{CD, Indexed}()
tables = make_empty_tables(CD)
index = make_empty_index(CD,Indexed)
CSet{CD,Indexed,typeof(tables),typeof(index)}(tables,index)
end
end
```

The idea is that `TablesType`

and `IndexType`

are uniquely specified by `CD`

and `Indexed`

, however it is not possible to do sophisticated enough type wrangling to write this into the struct, so instead the types are generated whenever we construct a new CSet. The types of TablesType and IndexType are in the type of a concrete CSet so that when the compiler encounters a function that handles CSets, it has all the type information it needs to compile the function to something very fast.

But what is `CatDesc`

? In order to parameterize `CSet`

by a (presentation of a) category, we need a parameter that represents a (presentation of a) category. One choice for this would be a struct that looked like this:

In this structure, the domain and codomain of `hom[i]`

are `ob[dom[i]]`

and `ob[codom[i]]`

, respectively. However, in Julia not all values can parameterize types. For reasons I do not quite understand, a tuple of ints or a tuple of symbols can parameterize a type, but a tuple of tuples cannot, and anything with vectors is Right Out (this at least makes sense because vectors are mutable). One approach would be to have a separate parameter in `CSet`

for tuples of `ob`

, `hom`

, `dom`

, and `codom`

, and indeed the first iteration of C-sets did exactly that. But with some cleverness, we can fit that into single parameter.

We do this because Julia allows types to be parameterized by other types, so while Julia rejects

it accepts

We can even recover the behavior of the first example by overloading `getproperty`

:

```
function Base.getproperty(T::Type{CatDesc{Ob,Hom,Dom,Codom}},key::Symbol)
where {Ob,Hom,Dom,Codom}
@match key begin
:ob => Ob
:hom => Hom
:dom => Dom
:codom => Codom
_ => getfield(T,key)
end
end
```

Then in functions that take in a `CSet`

, we can inspect the category \mathsf{C} using those properties, as in the following example

```
function print_tables(cs::CSet{CD}) where {CD}
for ob in CD.ob
println("$ob = ")
println(cs.tables[ob])
end
end
```

Now that you have the definition of `CatDesc`

, we leave it as an exercise to the reader to implement `make_empty_tables`

and `make_empty_index`

. Or you could look at the source code for Catlab!

## Operations on `CSet`

s

In this section, we will give an example of how we write generic code to operate on C-sets.

The basic operation of modifying a `CSet`

is adding data to it. To be more precise, let A be an object of \mathsf{C}, and let F : C \to \mathsf{FinSet}. We call an element of F(A) a *part*. In order to add a part to F(A), we must also modify the definition of F(g) for all maps g with domain A. In other words, we must add an entire row to the table in slot A of F. If x \in F(A), we call F(g)(x) a *subpart* of x; in the data structure, the subparts are the elements in the rows of the tables. With all this in mind, the function to add a part has the following signature (or at least, this is one of the signatures, more signatures are available in the actual implementation).

To call this on a graph, we might invoke it like this:

```
add_edge!(g::Graph,s::Int,t::Int) = add_part!(g, :E, (src=s,tgt=t))
add_vertex!(g::Graph) = add_part!(g, :V, (;))
```

Now, we want to make this as efficient as handwritten functions. Therefore, before we talk about the implementation of `add_part!`

, let’s look at how we might handwrite `add_edge!`

and `add_vertex!`

.

```
function add_edge!(g::Graph,s::Int,t::Int)
# make sure that the source and target exist
@assert 1 <= s <= length(g.tables.V) && 1 <= t <= length(g.tables.V)
push!(g.tables.E, (src=s,tgt=t))
part_num = length(g.tables.E)
# The indices are kept sorted for easy access
insertsorted!(g.index.src[s],part_num)
insertsorted!(g.index.tgt[t],part_num)
return part_num
end
function add_vertex!(g::Graph)
push!(g.tables.V, (;))
part_num = length(g.tables.V)
push!(g.index.src,[])
push!(g.index.tgt,[])
return part_num
end
```

That wasn’t too bad, but there are some subtleties that could easily trip you up. For instance, in `add_vertex!`

you have to add new empty arrays to the index, so that when you add edges, the edges can be added to those arrays. Also, we want to additionally have functions that add many edges at once, or many vertices, and those come with more complexity. And in real library code (Graphs.jl) there are even more things that must be dealt with. There are clear benefits to writing this once, generically.

On the other hand, it’s unclear how we could write generic code that approaches the performance of this handwritten code. We would need to loop over the morphisms, and execute conditionals to check whether morphisms were indexed. While we might get the same asymptotic complexity, this would be a significant performance cost.

Luckily, Julia provides us a mechanism to have the best of both worlds: the performance of the hand-written function written in a generic way. To do this, we use an `@generated`

function to automatically generate different code for each type that a function could be called with. This is similar to a macro, but while a macro only has access to the syntax of its arguments, a `@generated`

function has access to the *type* of its arguments. This is best illustrated by an example, but also if you have not encountered macros in Julia, I encourage you to read up a little on them, because many of the techniques carry over to generated functions.

```
@generated function unrolled_addto!(x::StaticVector{T,n},y::StaticVector{T,n}) where {T,n}
code = Expr(:block)
for i in 1:n
push!(code.args, :(x[$i] += y[$i]))
end
code
end
```

`StaticVector{T,n}`

is a vector with element type `T`

and length `n`

. If `x::StaticVector{Int,3}`

and `y::StaticVector{Int,3}`

, then calling `unrolled_addto!(x,y)`

is equivalent to running

To summarize, an `@generated`

function generates code at compile-time (or in Julia terms, type-dispatch time) based on the types of its arguments, and then that code is run at run-time. The code is then cached, so the next time the function is called with those types, it just runs the compiled code immediately.

Now we will define `add_part!`

. First of all, there is a trick that we use to get access to the object that we are adding a part to:

```
add_part!(cs::CSet,ob::Symbol,subparts::NamedTuple) = _add_part!(cs,Val(ob),subparts)
@generated function _add_part!(cs::CSet{CD,Indexed},::Val{ob},subparts::NamedTuple{columns}) where
{CD, Indexed, ob, columns}
...
end
```

Val is a useful little type with definition

`_add_part!`

itself has 5 parts.

- We check that the subparts exist.
- We add the new part with its subparts to the right table.
- We add new empty indices for all of the indexed morphisms with codomain of the type of the part.
- We add the new part to the indices for all of the indexed morphisms with domain of the type of the part.
- We return the number of the new part.

Without further ado, we implement `_add_part!`

. We use some functions which do not exist, their definition is left to the reader.

```
@generated function _add_part!(cs::CSet{CD,Indexed},::Val{ob},subparts::NamedTuple{columns}) where
{CD, Indexed, ob, columns}
# Make sure that we have all the subparts
@assert get_outgoing_morphisms(CD,ob) == columns
code = Expr(:block)
# Part 1
for hom in columns
push!(code.args, :(@assert 1 <= subparts.$hom <= length(cs.tables.$(codom(CD,hom)))))
end
# Part 2
push!(code.args, :(push!(cs.tables.$ob, subparts)))
push!(code.args, :(part_num = length(cs.tables.$ob)))
# Part 3
for hom in CD.hom
if codom(CD,hom) == ob && hom ∈ Indexed
push!(code.args, :(push!(cs.index.$hom,[])))
end
end
# Part 4
for hom in columns
if hom ∈ Indexed
push!(code.args, :(insertsorted!(cs.index.$hom,part_num)))
end
end
# Part 5
push!(code.args, :(return part_num))
code
end
```

This generates code very similar to the handwritten code that we wrote above, automatically. We use this strategry to implement almost all of the generic operations on `CSet`

s, and this is what makes `CSet`

s competitive with handwritten data structures.

## Summary

We have illustrated how the theory of C-sets can be implemented in a concrete data structure, and then manipulated both generically and efficiently with `@generated`

functions. We hope that the reader has gained an appreciation for the advantages of the C-set data structure, and also more generally for how `@generated`

code brings new tools to the table. In future posts, we will talk about how we add attributes to C-sets, and what that means both categorically and technically. To give a teaser, attributes allow us to talk about *decorated* graphs, where each edge has, for instance, an associated real number. So stay tuned!

## Footnotes

In Catlab, C-sets come in two flavors: those with attributes, and those without attributes. Instead of categories, attributed C-sets are based off of things called “schemas”. In later posts, we will discuss attributes, so this is not relevant now; we mention it so that the reader will not be caught off-guard while reading the Catlab source code.↩︎

One problem with

`StructVector`

s is that they do not support structs with no fields. So, for instance, the table for vertices,`StructVector{NamedTuple{(),Tuple{}}}`

does not work. From the point of view of`StructArrays.jl`

, this is perfectly reasonable: there was no reason for them to expect that zero-sized tuples would be a desirable thing to have. In any case, we get around this in Catlab with a bit of a hack, but for the purposes of this blog post we will assume that`StructVectors`

s support empty tuples.↩︎