C-sets for data analysis: relational data and conjunctive queries

databases
c-sets
attributed-c-sets
Not just useful for graphs, C-sets are a general-purpose tool for data analysis offering the functionality of an in-memory relational database. In this post, we illustrate Catlab’s new capabilities for querying C-sets and we explain the categorical underpinnings of conjunctive queries.
Author

Evan Patterson

Published

December 28, 2020

So far on the blog we have motivated \mathsf{C}-sets mainly as a unifying abstraction that encompasses graphs, wiring diagrams, Petri nets, and other graph-like objects. But what makes attributed \mathsf{C}-sets so useful is that they are a joint generalization of two essential data structures in computer science: graphs, as we have seen, but also data frames, a mainstay in any modern environment for data analysis.

A data frame represents a table of data as a set of named columns, where different columns may have different data types. Here is an excerpt from a data frame with four columns, provided by the Sitka dataset in the R package MASS.

395×4 DataFrame
370 rows omitted
Row size time tree treat
Float64 Int64 Int64 Cat…
1 4.51 152 1 ozone
2 4.98 174 1 ozone
3 5.41 201 1 ozone
4 5.9 227 1 ozone
5 6.15 258 1 ozone
6 4.24 152 2 ozone
7 4.2 174 2 ozone
8 4.68 201 2 ozone
9 4.92 227 2 ozone
10 4.96 258 2 ozone
11 3.98 152 3 ozone
12 4.36 174 3 ozone
13 4.79 201 3 ozone
384 4.28 227 77 control
385 4.54 258 77 control
386 4.5 152 78 control
387 4.8 174 78 control
388 5.28 201 78 control
389 5.83 227 78 control
390 6.16 258 78 control
391 2.99 152 79 control
392 3.61 174 79 control
393 4.48 201 79 control
394 4.91 227 79 control
395 5.06 258 79 control

Mathematically, a data frame is a multispan of sets whose apex is a finite set. A multispan, in turn, is a span that may have a number of legs different than two. The Sitka data frame is a multispan with four legs:

The apex is the set [n] := \{1,\dots,n\} tabulating the n = 395 rows of the data frame. The first leg of the multispan, the function \mathrm{size}: [n] \to \mathbb{R}, encodes the first column of the table, a vector in \mathbb{R}^n.

Although data frames are useful in isolation, many datasets have a relational structure that is better expressed by a collection of interlinked tables than by a single table. Arguably that is true even of this simple example. The Sitka dataset is longitudinal data, consisting of repeated measurements over time of the sizes of Sitka spruce trees, some of which were grown in ozone-enriched chambers (ozone) and others not (control). Thus, the data has a two-level structure: there is a set of trees and, for each tree, a set of measurements. The two types of units, trees and measurements, should be recorded in separate tables, where each measurement is linked to the tree that it measures. Graphically, the schema \mathsf{C} is:

Let’s use Catlab to set up the corresponding attributed \mathsf{C}-set. The complete schema, including data attributes, is:

Code
@present SchSitka(FreeSchema) begin
  (Numerical, Time, Treatment)::AttrType

  Tree::Ob
  treat::Attr(Tree, Treatment)
  
  Measurement::Ob
  tree::Hom(Measurement, Tree)
  size::Attr(Measurement, Numerical)
  time::Attr(Measurement, Time)
end

@acset_type SitkaData(SchSitka, index=[:tree])

Notice that the assigned treatment (treat) is properly an attribute of the trees, not of the measurements as the original data frame would suggest. Having defined the schema and corresponding Julia type, we load the data from the data frame (sitka_df) into the attributed \mathsf{C}-set (sitka):

Code
sitka = SitkaData{Float64,Int,String}()

# Add trees and measurements, including data attributes of the latter.
trees = unique!(sort(sitka_df.tree))
@assert trees == 1:last(trees)
add_parts!(sitka, :Tree, length(trees))
add_parts!(sitka, :Measurement, nrow(sitka_df),
           tree=sitka_df.tree, size=sitka_df.size, time=sitka_df.time)

# Assign treatment to each tree, checking for data consistency.
for tree in trees
  measurements = incident(sitka, tree, :tree)
  sitka[tree, :treat] = only(unique(sitka_df.treat[measurements]))
end

import Tables
map(Tables.rowcount, tables(sitka))
(Tree = 79, Measurement = 395)

The resulting tables, excerpts of which are shown below, comprise 395 measurements spread across 79 trees.

Code
pretty_tables(sitka, crop=:both, display_size=(15,80))
┌──────┬───────┐
│ Tree │ treat │
├──────┼───────┤
│    1 │ ozone │
│    2 │ ozone │
│    3 │ ozone │
│    4 │ ozone │
│    5 │ ozone │
│    6 │ ozone │
│    7 │ ozone │
│  ⋮   │   ⋮   │
└──────┴───────┘
 72 rows omitted
┌─────────────┬──────┬──────┬──────┐
│ Measurement │ tree │ size │ time │
├─────────────┼──────┼──────┼──────┤
│           1 │    1 │ 4.51 │  152 │
│           2 │    1 │ 4.98 │  174 │
│           3 │    1 │ 5.41 │  201 │
│           4 │    1 │  5.9 │  227 │
│           5 │    1 │ 6.15 │  258 │
│           6 │    2 │ 4.24 │  152 │
│           7 │    2 │  4.2 │  174 │
│      ⋮      │  ⋮   │  ⋮   │  ⋮   │
└─────────────┴──────┴──────┴──────┘
                    388 rows omitted

The data munging was simple enough, with one qualification. In order to consistently set the treat attribute of the trees, we had to verify that each tree is actually assigned a unique treatment value in the original table. Nothing about the structure of the table guarantees that. This subtlety illustrates a key strength of the relational data model: even in the absence of equational axioms, the schema can enforce logical constraints that a single table cannot, eliminating redundancy and reducing the risk of transcription errors.

Of course, flattened views of relational data are often needed for particular data analyses. These views are easily obtained by making queries against the \mathsf{C}-set. For example, the following query recovers the original data frame.

Code
select_all = @relation (size=size, time=time, tree=t, treat=treat) begin
  Tree(_id=t, treat=treat)
  Measurement(tree=t, size=size, time=time)
end

query(sitka, select_all, table_type=DataFrame) == sitka_df
true

The query joins the Tree and Measurement tables along their respective trees, carrying along the data attributes of both tables.

Readers familiar with relational databases will recognize the foregoing ideas. Relational data is ubiquitous, and relational databases, in the form of SQL databases like MySQL and SQLite, constitute a large industry. The mathematics of relational databases as \mathsf{C}-sets or similar categorical structures is also now well established.1 Despite this, relational databases play only a minor role in the data analysis environments offered by languages like Python, R, and Julia. A typical analysis might begin with a query against a SQL database, returning a single large table, but will then proceed by ad hoc manipulation of data frames. If a query involving multiple tables (or multiple copies of the same table) is needed, it is carried out manually through a sequence of joins. With relational databases, by contrast, queries are specified declaratively and the database system does the work of translating the query into an efficient algorithm.

Important though they are, data frames are the lowest common denominator of data science tooling. We can see no reason for this state of affairs besides tradition and the tendency of SQL databases to be siloed from other systems. We hope that \mathsf{C}-sets will be a useful tool for data analysis, serving as an in-memory relational database that interoperates smoothly with the JuliaData ecoystem. Indeed, the \mathsf{C}-sets in Catlab do not compete with this ecosystem but rather depend on it: the tables of a \mathsf{C}-set are provided by TypedTables.jl and may be exchanged for any other columnar table compliant with Tables.jl.

In the remainder of this post, we explain how conjunctive queries work and illustrate them on a dataset with more complex relational structure than the Sitka data. We then explain the categorical underpinnings of conjunctive queries, particularly the close connection between joins in relational databases and finite limits in the category of sets.

Conjunctive queries on C-sets

The mutagenesis dataset, available through the Relational Dataset Repository, is a collection of molecular graphs from a study on the mutagenicity of certain chemical compounds. The dataset has a three-level relational structure: there is a set of molecules, each of which has a set of atoms, which may be connected pairwise by chemical bonds.

The complete schema includes data attributes for each of the three types.

Code
@present SchMutagenesis(FreeSchema) begin
  (Indicator, Categorical, Name, Numerical)::AttrType
  
  Molecule::Ob
  molecule_id::Attr(Molecule, Name)
  (ind1, inda)::Attr(Molecule, Indicator)
  (logp, lumo)::Attr(Molecule, Numerical)
  mutagenic::Attr(Molecule, Indicator)
  
  Atom::Ob
  molecule::Hom(Atom, Molecule)
  atom_id::Attr(Atom, Name)
  atom_type::Attr(Atom, Categorical)
  element::Attr(Atom, Name)
  charge::Attr(Atom, Numerical)
  
  Bond::Ob
  (atom1, atom2)::Hom(Bond, Atom)
  bond_type::Attr(Bond, Categorical)
end

@acset_type MutagenesisData(SchMutagenesis,
  index=[:atom1, :atom2, :molecule],
  unique_index=[:atom_id, :molecule_id])

The attributes molecule_id and atom_id do not appear to be meaningful but are carried over from the primary keys in the original SQL database.

The data is loaded into Catlab by making three SELECT * from [table]; queries on the SQL database and mapping the primary keys to integer IDs. We omit this routine code, which should in any case be automated in the future. The result is an ACSet mutagenesis having the following dimensions.

Code
map(Tables.rowcount, tables(mutagenesis))
(Molecule = 188, Atom = 4893, Bond = 5243)

Here are excerpts from each of the tables:

Code
pretty_tables(mutagenesis, crop=:both, display_size=(15,80))
┌──────────┬─────────────┬───────┬───────┬──────┬────────┬───────────┐
│ Molecule │ molecule_id │  ind1 │  inda │ logp │   lumo │ mutagenic │
├──────────┼─────────────┼───────┼───────┼──────┼────────┼───────────┤
│        1 │          d1 │  true │ false │ 4.23 │ -1.246 │      true │
│        2 │         d10 │  true │ false │ 4.62 │ -1.387 │      true │
│        3 │        d100 │ false │ false │ 2.68 │ -1.034 │     false │
│        4 │        d101 │  true │ false │ 6.26 │ -1.598 │      true │
│        5 │        d102 │  true │ false │  2.4 │ -3.172 │      true │
│        6 │        d103 │  true │ false │ 4.69 │ -1.487 │      true │
│        7 │        d104 │  true │ false │ 6.26 │ -1.546 │      true │
│    ⋮     │      ⋮      │   ⋮   │   ⋮   │  ⋮   │   ⋮    │     ⋮     │
└──────────┴─────────────┴───────┴───────┴──────┴────────┴───────────┘
                                                      181 rows omitted
┌──────┬──────────┬─────────┬───────────┬─────────┬────────┐
│ Atom │ molecule │ atom_id │ atom_type │ element │ charge │
├──────┼──────────┼─────────┼───────────┼─────────┼────────┤
│    1 │        3 │  d100_1 │        22 │       c │ -0.128 │
│    2 │        3 │ d100_10 │         3 │       h │  0.132 │
│    3 │        3 │ d100_11 │        29 │       c │  0.002 │
│    4 │        3 │ d100_12 │        22 │       c │ -0.128 │
│    5 │        3 │ d100_13 │        22 │       c │ -0.128 │
│    6 │        3 │ d100_14 │        22 │       c │ -0.128 │
│    7 │        3 │ d100_15 │        22 │       c │  0.202 │
│  ⋮   │    ⋮     │    ⋮    │     ⋮     │    ⋮    │   ⋮    │
└──────┴──────────┴─────────┴───────────┴─────────┴────────┘
                                           4886 rows omitted
┌──────┬───────┬───────┬───────────┐
│ Bond │ atom1 │ atom2 │ bond_type │
├──────┼───────┼───────┼───────────┤
│    1 │     1 │    12 │         7 │
│    2 │     1 │    24 │         1 │
│    3 │     3 │     4 │         7 │
│    4 │     4 │     5 │         7 │
│    5 │     4 │     9 │         1 │
│    6 │     5 │     6 │         7 │
│    7 │     5 │    10 │         1 │
│  ⋮   │   ⋮   │   ⋮   │     ⋮     │
└──────┴───────┴───────┴───────────┘
                   5236 rows omitted

We are now going to make queries against the mutagenesis data. Generally speaking, Catlab supports evaluating conjunctive queries on attributed \mathsf{C}-sets. Although more restrictive than arbitrary SQL queries, conjunctive queries play in a central role in the theory and practice of relational databases. They are precisely the queries that can be expressed in regular logic, the subsystem of first-order logic with connectives \exists, \wedge, \top. Alternatively, as we will see later, they are the queries that can be computed as finite limits in \mathsf{Set}.

In Catlab, conjunctive queries are most easily specified using the @relation macro, which has the general form:

@relation (out1=var1, out2=var2, ) [where (x,y,z,)] begin
  T₁(col1=var3, col2=var4, )
  
  Tₙ(col1=var5, col2=var6, )
end

Every relation statement in the macro body specifies a table T_i and a subset of its columns, with each column assigned to a variable. To evaluate the query, you take the cartesian product of the selected tables T_1, \dots, T_n, then filter the rows so that any two columns assigned to the same variable are equal. Finally, you return a table that exposes a subset of the variables, given by the output names in the first clause. The full set of variables is optionally stated in the where clause; if it is omitted, the set of variables is inferred.

Conjunctive queries are more intuitive than this description may make them appear. Returning to the mutagenesis data, we might assume that molecular graphs are undirected, so that for every for bond between atoms a_1 and a_2, there is also a bond between atoms a_2 and a_1. We can check this assumption using the following query.

Code
symmetric_pairs = @relation (atom1=a1, atom2=a2) begin
  Bond(atom1=a1, atom2=a2)
  Bond(atom1=a2, atom2=a1)
end

nrow(query(mutagenesis, symmetric_pairs))
0

In fact, no bond in this dataset has a corresponding bond in the opposite direction. The creators of the data presumably left the oppositely oriented bonds implicit.

A more fundamental constraint is that any two bonded atoms must belong to the same molecule.

Code
bond_within_molecule = @relation (Bond=b) begin
  Bond(_id=b, atom1=a1, atom2=a2)
  Atom(_id=a1, molecule=m)
  Atom(_id=a2, molecule=m)
end

result = query(mutagenesis, bond_within_molecule)
nrow(result) == nparts(mutagenesis, :Bond)
true

At least this assumption is satisfied by the data! The schema presented earlier has no equations, but we could have added an equation explitictly stating the assumption:

@present SchMutagenesis(FreeSchema) begin
  [...]
  
  compose(atom1, molecule) == compose(atom2, molecule)
end

By default, the variables in a query may take any values, but variables can be fixed to specific values using the optional third argument to query. For example, the following query returns all carbon atoms belonging to a mutagenic molecule, together with the charges of the atoms.

Code
atoms_in_molecule = @relation (Atom=a, molecule=m, charge=charge) begin
  Atom(_id=a, element=element, charge=charge, molecule=m)
  Molecule(_id=m, mutagenic=mutagenic)
end

query(mutagenesis, atoms_in_molecule, (element="c", mutagenic=true))
1833×3 DataFrame
1808 rows omitted
Row Atom molecule charge
Int64 Int64 Float64
1 27 4 -0.121
2 28 4 -0.091
3 29 4 0.009
4 30 4 0.009
5 31 4 0.009
6 32 4 -0.121
7 33 4 -0.121
8 34 4 -0.122
9 35 4 -0.091
10 36 4 -0.091
11 37 4 -0.091
12 38 4 -0.121
13 39 4 -0.091
1822 4855 188 -0.116
1823 4856 188 -0.086
1824 4857 188 0.014
1825 4858 188 -0.086
1826 4866 188 -0.116
1827 4870 188 0.014
1828 4871 188 -0.086
1829 4872 188 -0.086
1830 4873 188 -0.116
1831 4874 188 -0.116
1832 4875 188 -0.116
1833 4876 188 -0.116

The mutagenesis data is essentially a collection of graphs, one for each molecule, where the vertices are atoms and the edges are bonds. The conventional way to process such data would be to convert it into a graph, using a specialized data structure for directed graphs. In Catlab, this conversion is easily achieved by functorial data migration in its basic, contravariant form:

Code
using Catlab.Graphs

g = Graph(mutagenesis, Dict(:V => :Atom, :E => :Bond),
                       Dict(:src => :atom1, :tgt => :atom2))
(V = nv(g), E = ne(g))
(V = 4893, E = 5243)

The graph’s connected components correspond to the molecules. An advantage of the graph representation is that you can directly apply any graph analytics procedure that expects a graph as input. However, in contrast to other systems, in Catlab there are no further advantages to performing this conversion. Traversing the derived graph will be no faster or slower than traversing the MutagenesisData object, because graphs in Catlab are \mathsf{C}-sets isomorphic to the atom-bond substructure of the mutagenesis structure. This what we meant at the outset when we said that attributed \mathsf{C}-sets are a joint generalization of graphs and data frames.

The unification of graphs with relational data cuts in both directions. On the one hand, since graphs in Catlab are \mathsf{C}-sets, we can efficiently evaluate conjunctive queries against them. As a simple example, here are all the directed paths of length two in the derived graph:

Code
paths2 = @relation (vert1=v1, vert2=v2, vert3=v3) begin
  E(src=v1, tgt=v2)
  E(src=v2, tgt=v3)
end

query(g, paths2)
5539×3 DataFrame
5514 rows omitted
Row vert1 vert2 vert3
Int64 Int64 Int64
1 1 12 20
2 1 12 25
3 3 4 5
4 3 4 9
5 4 5 6
6 4 5 10
7 5 6 7
8 5 6 11
9 6 7 8
10 6 7 14
11 7 8 3
12 7 8 13
13 7 14 18
5528 4888 4889 4890
5529 4888 4891 4892
5530 4889 4890 4877
5531 4889 4890 4878
5532 4890 4877 4881
5533 4890 4877 4886
5534 4890 4878 4879
5535 4891 4892 4884
5536 4891 4892 4893
5537 4892 4893 4885
5538 4892 4893 4887
5539 4893 4887 4888

On the other hand, we can formulate richer queries against the mutagenesis data, where more data attributes are available. The following query refines the previous one to find all directed paths from a carbon atom to a nitrogen atom to another carbon atom. The resulting table is much smaller.

Code
paths2 = @relation (atom1=a1, atom2=a2, atom3=a3) begin
  Bond(atom1=a1, atom2=a2)
  Bond(atom1=a2, atom2=a3)
  Atom(_id=a1, element=elem1)
  Atom(_id=a2, element=elem2)
  Atom(_id=a3, element=elem3)
end

query(mutagenesis, paths2, (elem1="c", elem2="n", elem3="c"))
41×3 DataFrame
16 rows omitted
Row atom1 atom2 atom3
Int64 Int64 Int64
1 163 164 165
2 169 170 163
3 620 624 608
4 768 769 787
5 786 766 767
6 1105 1104 1086
7 1315 1319 1320
8 1316 1321 1320
9 1347 1348 1366
10 1365 1345 1346
11 1403 1406 1407
12 1407 1408 1392
13 1407 1408 1412
30 3113 3102 3103
31 3185 3184 3166
32 3362 3368 3378
33 3377 3366 3357
34 3595 3601 3608
35 3607 3600 3590
36 4029 4019 4020
37 4398 4385 4402
38 4546 4550 4534
39 4837 4838 4839
40 4888 4891 4892
41 4892 4893 4887

Conjunctive queries as limits

The theory of relational databases is category theory. This statement is true in many ways and from many perspectives, but we will focus on just one: the equivalence between conjunctive queries on relational databases and finite limits in the category of sets. This equivalence is almost immediate if you are familiar with the relevant concepts; it is certainly not new (Kato 1983; Alagic and Alagic 1991). Nevertheless, it is a fruitful observation for both category theory and relational databases. In one direction, conjunctive queries inherit an intuitive diagrammatic syntax through the operad of undirected wiring diagrams. Conversely, any general-purpose package for computational category theory, which Catlab aims to be, ought to be able to efficiently compute finite limits of sets. Algorithms for doing so have been extensively developed by database theorists and engineers.

Here is a dictionary between conjunctive queries of different kinds and finite limits in \mathsf{Set} of different shapes.

  • cross joins of tables X and Y are products

  • joins2 along attributes a_X: X \to A and a_Y: Y \to A are pullbacks

  • multi-way joins along attributes a_i: X_i \to A, for i=1,\dots,n, are wide pullbacks

  • general conjunctive queries are limits of finite bipartite diagrams

The last statement, about conjunctive queries and bipartite diagrams, requires elaboration. We will break it down into two steps. First, conjunctive queries can be represented by undirected wiring diagrams (UWDs); in Catlab, they actually are UWDs.3 Second, the UWD translates into a bipartite diagram, whose limit is the result of the query.

As an illustration, recall this conjunctive query on the mutagenesis data.

Code
bond_within_molecule = @relation (Bond=b) begin
  Bond(_id=b, atom1=a1, atom2=a2)
  Atom(_id=a1, molecule=m)
  Atom(_id=a2, molecule=m)
end

bond_within_molecule isa UndirectedWiringDiagram
true

The tables in the query correspond to boxes (unfilled circles) in the UWD and the variables correspond to junctions (filled circles).

We won’t say much more about UWDs here. For additional examples of diagrammatic query syntax, see the related earlier post by Andrew Baas on UWDs and SQL queries.

In the second step, the UWD translates into a bipartite diagram in \mathsf{Set}, where the boxes of the UWD become vertices in the diagram’s top layer and the junctions become vertices in the bottom layer.

The result R of the conjunctive query is then the limit of this bipartite diagram, which can be thought of as a “generalized pullback.”

In technical terms, isomorphism classes of data frames (multispans) constitute an algebra of the operad of UWDs, where composition is given by limits of the induced bipartite diagrams.

All this is to say what a conjunctive query is but not how to compute it. There are many ways to evaluate a conjunctive query. The simplest, sketched in the previous section, is to take the cartesian product of the sets in the top layer of the bipartite diagram, then filter the tuples, keeping only those that project to equal values under the functions over any given set in the bottom layer. In a special case, this procedure is known to category theorists as “computing a pullback by taking an equalizer of a product.” However, it is probably the worst possible algorithm for computing a limit, generally being extremely inefficient.

The inefficiency arises from having to iterate over the elements of a cartesian product X_1 \times \cdots \times X_n, whose cardinality grows exponentially in n. To avoid this, the conjunctive query should be decomposed into a sequence of joins, a problem known as query planning. Catlab currently employs a simple greedy heuristic for choosing joins.

Joins, or equivalently pullbacks, may themselves be computed in many ways, of which the most common are (Mishra and Eich 1992):

  1. Nested loop join: the naive algorithm that iterates over all pairs of elements
  2. Sort-merge join: both vectors of attributes are sorted, then merged together by synchronized iteration
  3. Hash join: a hash map (index of preimages) is built for one attribute, then the other attribute vector is iterated, looking up preimages for each element

All three join algorithms are implemented in Catlab, for two-way and multi-way joins. Both the sort-merge join and the hash join are generally much faster than the nested loop join. The hash join performs especially well when an attribute is already indexed in the attributed \mathsf{C}-set, as happens for the source and target functions in a graph.

Outlook

Despite its simplistic query planning, Catlab’s conjunctive queries on \mathsf{C}-sets are already fairly usable. Over time we will implement better query planning algorithms, as well as more general kinds of queries. Possibilities include unions of conjunctive queries, the class of queries expressible in coherent logic, and recursive queries, as in Datalog.

As we have seen, relational databases enjoy two key advantages over isolated data frames. First, relational schemas enable the data’s encoding to mirror its underlying relational structure, which eliminates redundancy in storage and promotes internal consistency. Second, declarative query languages make it easy to ask questions about the data and view it through different lenses. On the other hand, SQL databases and logic programming languages like Prolog and Datalog have generally suffered from the Achilles’ heel of all domain-specific languages (DSLs): you cannot step outside the DSL without abandoning the system entirely. Yet stepping outside the DSL is often necessary, since even the most expressive query languages have their limitations.

We think that \mathsf{C}-sets can offer the best of both worlds to data analysis, providing the benefits of the relational data model while existing as ordinary data structures in a general-purpose programming language. If you cannot express your query using the @relation macro, you can always just write the for loops or the map calls! Due to the design of the Julia programming language and the \mathsf{C}-sets in Catlab, doing so should cause no performance degradations. In this way, we hope that \mathsf{C}-sets will eventually become a valuable component of the JuliaData ecosystem.

References

Alagic, Suad, and Mara Alagic. 1991. “Joins as Pullbacks.” In Third Workshop on Foundations of Models and Languages for Data and Objects, Aigen, Austria, 23.-27. September 1991, edited by Jutta Göers, Andreas Heuer, and Gunter Saake, 91/3:197–207. Informatik-Berichte Des IfI. Technische Universität Clausthal.
Kato, Akihiko. 1983. AN ABSTRACT RELATIONAL MODEL AND NATURAL JOIN FUNCTORS .” Bulletin of Informatics and Cybernetics 20 (3/4): 95–106. https://doi.org/10.5109/13349.
Mishra, Priti, and Margaret H. Eich. 1992. “Join Processing in Relational Databases.” ACM Computing Surveys 24 (1): 63–113. https://doi.org/10.1145/128762.128764.
Schultz, Patrick, David I. Spivak, Christina Vasilakopoulou, and Ryan Wisnesky. 2016. “Algebraic Databases.”
Spivak, David I. 2014. Category Theory for the Sciences. The MIT Press. London, England: MIT Press.
Spivak, David I. 2012. “Functorial Data Migration.” Information and Computation 217 (August): 31–51. https://doi.org/10.1016/j.ic.2012.05.001.

Footnotes

  1. See the line of research by David Spivak and collaborators (David I. Spivak 2012; David I. Spivak 2014; Schultz et al. 2016) and also the earlier blog post by Owen Lynch on the mathematics of attributed \mathsf{C}-sets.↩︎

  2. In this post, by “join” we always mean both “inner join” and “equi-join.” Catlab does not yet support outer joins or joins with predicates other than equality.↩︎

  3. And UWDs are, in turn, \mathsf{C}-sets for a schema \mathsf{C} with tables Box, Junction, Port, and OuterPort. This is a circularity but not a vicious one. It does imply that we could evaluate conjunctive queries on conjunctive queries (!), for whatever that is worth.↩︎