Row | size | time | tree | treat |
---|---|---|---|---|
Float64 | Int64 | Int64 | Cat… | |
1 | 4.51 | 152 | 1 | ozone |
2 | 4.98 | 174 | 1 | ozone |
3 | 5.41 | 201 | 1 | ozone |
4 | 5.9 | 227 | 1 | ozone |
5 | 6.15 | 258 | 1 | ozone |
6 | 4.24 | 152 | 2 | ozone |
7 | 4.2 | 174 | 2 | ozone |
8 | 4.68 | 201 | 2 | ozone |
9 | 4.92 | 227 | 2 | ozone |
10 | 4.96 | 258 | 2 | ozone |
11 | 3.98 | 152 | 3 | ozone |
12 | 4.36 | 174 | 3 | ozone |
13 | 4.79 | 201 | 3 | ozone |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
384 | 4.28 | 227 | 77 | control |
385 | 4.54 | 258 | 77 | control |
386 | 4.5 | 152 | 78 | control |
387 | 4.8 | 174 | 78 | control |
388 | 5.28 | 201 | 78 | control |
389 | 5.83 | 227 | 78 | control |
390 | 6.16 | 258 | 78 | control |
391 | 2.99 | 152 | 79 | control |
392 | 3.61 | 174 | 79 | control |
393 | 4.48 | 201 | 79 | control |
394 | 4.91 | 227 | 79 | control |
395 | 5.06 | 258 | 79 | control |
C-sets for data analysis: relational data and conjunctive queries
$$
$$
So far on the blog we have motivated \mathsf{C}-sets mainly as a unifying abstraction that encompasses graphs, wiring diagrams, Petri nets, and other graph-like objects. But what makes attributed \mathsf{C}-sets so useful is that they are a joint generalization of two essential data structures in computer science: graphs, as we have seen, but also data frames, a mainstay in any modern environment for data analysis.
A data frame represents a table of data as a set of named columns, where different columns may have different data types. Here is an excerpt from a data frame with four columns, provided by the Sitka
dataset in the R package MASS.
Mathematically, a data frame is a multispan of sets whose apex is a finite set. A multispan, in turn, is a span that may have a number of legs different than two. The Sitka
data frame is a multispan with four legs:
The apex is the set [n] := \{1,\dots,n\} tabulating the n = 395 rows of the data frame. The first leg of the multispan, the function \mathrm{size}: [n] \to \mathbb{R}, encodes the first column of the table, a vector in \mathbb{R}^n.
Although data frames are useful in isolation, many datasets have a relational structure that is better expressed by a collection of interlinked tables than by a single table. Arguably that is true even of this simple example. The Sitka
dataset is longitudinal data, consisting of repeated measurements over time of the sizes of Sitka spruce trees, some of which were grown in ozone-enriched chambers (ozone
) and others not (control
). Thus, the data has a two-level structure: there is a set of trees and, for each tree, a set of measurements. The two types of units, trees and measurements, should be recorded in separate tables, where each measurement is linked to the tree that it measures. Graphically, the schema \mathsf{C} is:
Let’s use Catlab to set up the corresponding attributed \mathsf{C}-set. The complete schema, including data attributes, is:
Notice that the assigned treatment (treat
) is properly an attribute of the trees, not of the measurements as the original data frame would suggest. Having defined the schema and corresponding Julia type, we load the data from the data frame (sitka_df
) into the attributed \mathsf{C}-set (sitka
):
Code
sitka = SitkaData{Float64,Int,String}()
# Add trees and measurements, including data attributes of the latter.
trees = unique!(sort(sitka_df.tree))
@assert trees == 1:last(trees)
add_parts!(sitka, :Tree, length(trees))
add_parts!(sitka, :Measurement, nrow(sitka_df),
tree=sitka_df.tree, size=sitka_df.size, time=sitka_df.time)
# Assign treatment to each tree, checking for data consistency.
for tree in trees
measurements = incident(sitka, tree, :tree)
sitka[tree, :treat] = only(unique(sitka_df.treat[measurements]))
end
import Tables
map(Tables.rowcount, tables(sitka))
(Tree = 79, Measurement = 395)
The resulting tables, excerpts of which are shown below, comprise 395 measurements spread across 79 trees.
┌──────┬───────┐
│ Tree │ treat │
├──────┼───────┤
│ 1 │ ozone │
│ 2 │ ozone │
│ 3 │ ozone │
│ 4 │ ozone │
│ 5 │ ozone │
│ 6 │ ozone │
│ 7 │ ozone │
│ ⋮ │ ⋮ │
└──────┴───────┘
72 rows omitted
┌─────────────┬──────┬──────┬──────┐
│ Measurement │ tree │ size │ time │
├─────────────┼──────┼──────┼──────┤
│ 1 │ 1 │ 4.51 │ 152 │
│ 2 │ 1 │ 4.98 │ 174 │
│ 3 │ 1 │ 5.41 │ 201 │
│ 4 │ 1 │ 5.9 │ 227 │
│ 5 │ 1 │ 6.15 │ 258 │
│ 6 │ 2 │ 4.24 │ 152 │
│ 7 │ 2 │ 4.2 │ 174 │
│ ⋮ │ ⋮ │ ⋮ │ ⋮ │
└─────────────┴──────┴──────┴──────┘
388 rows omitted
The data munging was simple enough, with one qualification. In order to consistently set the treat
attribute of the trees, we had to verify that each tree is actually assigned a unique treatment value in the original table. Nothing about the structure of the table guarantees that. This subtlety illustrates a key strength of the relational data model: even in the absence of equational axioms, the schema can enforce logical constraints that a single table cannot, eliminating redundancy and reducing the risk of transcription errors.
Of course, flattened views of relational data are often needed for particular data analyses. These views are easily obtained by making queries against the \mathsf{C}-set. For example, the following query recovers the original data frame.
Code
true
The query joins the Tree
and Measurement
tables along their respective trees, carrying along the data attributes of both tables.
Readers familiar with relational databases will recognize the foregoing ideas. Relational data is ubiquitous, and relational databases, in the form of SQL databases like MySQL and SQLite, constitute a large industry. The mathematics of relational databases as \mathsf{C}-sets or similar categorical structures is also now well established.1 Despite this, relational databases play only a minor role in the data analysis environments offered by languages like Python, R, and Julia. A typical analysis might begin with a query against a SQL database, returning a single large table, but will then proceed by ad hoc manipulation of data frames. If a query involving multiple tables (or multiple copies of the same table) is needed, it is carried out manually through a sequence of joins. With relational databases, by contrast, queries are specified declaratively and the database system does the work of translating the query into an efficient algorithm.
Important though they are, data frames are the lowest common denominator of data science tooling. We can see no reason for this state of affairs besides tradition and the tendency of SQL databases to be siloed from other systems. We hope that \mathsf{C}-sets will be a useful tool for data analysis, serving as an in-memory relational database that interoperates smoothly with the JuliaData ecoystem. Indeed, the \mathsf{C}-sets in Catlab do not compete with this ecosystem but rather depend on it: the tables of a \mathsf{C}-set are provided by TypedTables.jl and may be exchanged for any other columnar table compliant with Tables.jl.
In the remainder of this post, we explain how conjunctive queries work and illustrate them on a dataset with more complex relational structure than the Sitka data. We then explain the categorical underpinnings of conjunctive queries, particularly the close connection between joins in relational databases and finite limits in the category of sets.
Conjunctive queries on C-sets
The mutagenesis dataset, available through the Relational Dataset Repository, is a collection of molecular graphs from a study on the mutagenicity of certain chemical compounds. The dataset has a three-level relational structure: there is a set of molecules, each of which has a set of atoms, which may be connected pairwise by chemical bonds.
The complete schema includes data attributes for each of the three types.
Code
@present SchMutagenesis(FreeSchema) begin
(Indicator, Categorical, Name, Numerical)::AttrType
Molecule::Ob
molecule_id::Attr(Molecule, Name)
(ind1, inda)::Attr(Molecule, Indicator)
(logp, lumo)::Attr(Molecule, Numerical)
mutagenic::Attr(Molecule, Indicator)
Atom::Ob
molecule::Hom(Atom, Molecule)
atom_id::Attr(Atom, Name)
atom_type::Attr(Atom, Categorical)
element::Attr(Atom, Name)
charge::Attr(Atom, Numerical)
Bond::Ob
(atom1, atom2)::Hom(Bond, Atom)
bond_type::Attr(Bond, Categorical)
end
@acset_type MutagenesisData(SchMutagenesis,
index=[:atom1, :atom2, :molecule],
unique_index=[:atom_id, :molecule_id])
The attributes molecule_id
and atom_id
do not appear to be meaningful but are carried over from the primary keys in the original SQL database.
The data is loaded into Catlab by making three SELECT * from [table];
queries on the SQL database and mapping the primary keys to integer IDs. We omit this routine code, which should in any case be automated in the future. The result is an ACSet mutagenesis
having the following dimensions.
Here are excerpts from each of the tables:
┌──────────┬─────────────┬───────┬───────┬──────┬────────┬───────────┐
│ Molecule │ molecule_id │ ind1 │ inda │ logp │ lumo │ mutagenic │
├──────────┼─────────────┼───────┼───────┼──────┼────────┼───────────┤
│ 1 │ d1 │ true │ false │ 4.23 │ -1.246 │ true │
│ 2 │ d10 │ true │ false │ 4.62 │ -1.387 │ true │
│ 3 │ d100 │ false │ false │ 2.68 │ -1.034 │ false │
│ 4 │ d101 │ true │ false │ 6.26 │ -1.598 │ true │
│ 5 │ d102 │ true │ false │ 2.4 │ -3.172 │ true │
│ 6 │ d103 │ true │ false │ 4.69 │ -1.487 │ true │
│ 7 │ d104 │ true │ false │ 6.26 │ -1.546 │ true │
│ ⋮ │ ⋮ │ ⋮ │ ⋮ │ ⋮ │ ⋮ │ ⋮ │
└──────────┴─────────────┴───────┴───────┴──────┴────────┴───────────┘
181 rows omitted
┌──────┬──────────┬─────────┬───────────┬─────────┬────────┐
│ Atom │ molecule │ atom_id │ atom_type │ element │ charge │
├──────┼──────────┼─────────┼───────────┼─────────┼────────┤
│ 1 │ 3 │ d100_1 │ 22 │ c │ -0.128 │
│ 2 │ 3 │ d100_10 │ 3 │ h │ 0.132 │
│ 3 │ 3 │ d100_11 │ 29 │ c │ 0.002 │
│ 4 │ 3 │ d100_12 │ 22 │ c │ -0.128 │
│ 5 │ 3 │ d100_13 │ 22 │ c │ -0.128 │
│ 6 │ 3 │ d100_14 │ 22 │ c │ -0.128 │
│ 7 │ 3 │ d100_15 │ 22 │ c │ 0.202 │
│ ⋮ │ ⋮ │ ⋮ │ ⋮ │ ⋮ │ ⋮ │
└──────┴──────────┴─────────┴───────────┴─────────┴────────┘
4886 rows omitted
┌──────┬───────┬───────┬───────────┐
│ Bond │ atom1 │ atom2 │ bond_type │
├──────┼───────┼───────┼───────────┤
│ 1 │ 1 │ 12 │ 7 │
│ 2 │ 1 │ 24 │ 1 │
│ 3 │ 3 │ 4 │ 7 │
│ 4 │ 4 │ 5 │ 7 │
│ 5 │ 4 │ 9 │ 1 │
│ 6 │ 5 │ 6 │ 7 │
│ 7 │ 5 │ 10 │ 1 │
│ ⋮ │ ⋮ │ ⋮ │ ⋮ │
└──────┴───────┴───────┴───────────┘
5236 rows omitted
We are now going to make queries against the mutagenesis data. Generally speaking, Catlab supports evaluating conjunctive queries on attributed \mathsf{C}-sets. Although more restrictive than arbitrary SQL queries, conjunctive queries play in a central role in the theory and practice of relational databases. They are precisely the queries that can be expressed in regular logic, the subsystem of first-order logic with connectives \exists, \wedge, \top. Alternatively, as we will see later, they are the queries that can be computed as finite limits in \mathsf{Set}.
In Catlab, conjunctive queries are most easily specified using the @relation
macro, which has the general form:
@relation (out1=var1, out2=var2, …) [where (x,y,z,…)] begin
T₁(col1=var3, col2=var4, …)
⋮
Tₙ(col1=var5, col2=var6, …)
end
Every relation statement in the macro body specifies a table T_i and a subset of its columns, with each column assigned to a variable. To evaluate the query, you take the cartesian product of the selected tables T_1, \dots, T_n, then filter the rows so that any two columns assigned to the same variable are equal. Finally, you return a table that exposes a subset of the variables, given by the output names in the first clause. The full set of variables is optionally stated in the where
clause; if it is omitted, the set of variables is inferred.
Conjunctive queries are more intuitive than this description may make them appear. Returning to the mutagenesis data, we might assume that molecular graphs are undirected, so that for every for bond between atoms a_1 and a_2, there is also a bond between atoms a_2 and a_1. We can check this assumption using the following query.
Code
0
In fact, no bond in this dataset has a corresponding bond in the opposite direction. The creators of the data presumably left the oppositely oriented bonds implicit.
A more fundamental constraint is that any two bonded atoms must belong to the same molecule.
Code
true
At least this assumption is satisfied by the data! The schema presented earlier has no equations, but we could have added an equation explitictly stating the assumption:
@present SchMutagenesis(FreeSchema) begin
[...]
compose(atom1, molecule) == compose(atom2, molecule)
end
By default, the variables in a query may take any values, but variables can be fixed to specific values using the optional third argument to query
. For example, the following query returns all carbon atoms belonging to a mutagenic molecule, together with the charges of the atoms.
Code
Row | Atom | molecule | charge |
---|---|---|---|
Int64 | Int64 | Float64 | |
1 | 27 | 4 | -0.121 |
2 | 28 | 4 | -0.091 |
3 | 29 | 4 | 0.009 |
4 | 30 | 4 | 0.009 |
5 | 31 | 4 | 0.009 |
6 | 32 | 4 | -0.121 |
7 | 33 | 4 | -0.121 |
8 | 34 | 4 | -0.122 |
9 | 35 | 4 | -0.091 |
10 | 36 | 4 | -0.091 |
11 | 37 | 4 | -0.091 |
12 | 38 | 4 | -0.121 |
13 | 39 | 4 | -0.091 |
⋮ | ⋮ | ⋮ | ⋮ |
1822 | 4855 | 188 | -0.116 |
1823 | 4856 | 188 | -0.086 |
1824 | 4857 | 188 | 0.014 |
1825 | 4858 | 188 | -0.086 |
1826 | 4866 | 188 | -0.116 |
1827 | 4870 | 188 | 0.014 |
1828 | 4871 | 188 | -0.086 |
1829 | 4872 | 188 | -0.086 |
1830 | 4873 | 188 | -0.116 |
1831 | 4874 | 188 | -0.116 |
1832 | 4875 | 188 | -0.116 |
1833 | 4876 | 188 | -0.116 |
The mutagenesis data is essentially a collection of graphs, one for each molecule, where the vertices are atoms and the edges are bonds. The conventional way to process such data would be to convert it into a graph, using a specialized data structure for directed graphs. In Catlab, this conversion is easily achieved by functorial data migration in its basic, contravariant form:
Code
(V = 4893, E = 5243)
The graph’s connected components correspond to the molecules. An advantage of the graph representation is that you can directly apply any graph analytics procedure that expects a graph as input. However, in contrast to other systems, in Catlab there are no further advantages to performing this conversion. Traversing the derived graph will be no faster or slower than traversing the MutagenesisData
object, because graphs in Catlab are \mathsf{C}-sets isomorphic to the atom-bond substructure of the mutagenesis structure. This what we meant at the outset when we said that attributed \mathsf{C}-sets are a joint generalization of graphs and data frames.
The unification of graphs with relational data cuts in both directions. On the one hand, since graphs in Catlab are \mathsf{C}-sets, we can efficiently evaluate conjunctive queries against them. As a simple example, here are all the directed paths of length two in the derived graph:
Code
Row | vert1 | vert2 | vert3 |
---|---|---|---|
Int64 | Int64 | Int64 | |
1 | 1 | 12 | 20 |
2 | 1 | 12 | 25 |
3 | 3 | 4 | 5 |
4 | 3 | 4 | 9 |
5 | 4 | 5 | 6 |
6 | 4 | 5 | 10 |
7 | 5 | 6 | 7 |
8 | 5 | 6 | 11 |
9 | 6 | 7 | 8 |
10 | 6 | 7 | 14 |
11 | 7 | 8 | 3 |
12 | 7 | 8 | 13 |
13 | 7 | 14 | 18 |
⋮ | ⋮ | ⋮ | ⋮ |
5528 | 4888 | 4889 | 4890 |
5529 | 4888 | 4891 | 4892 |
5530 | 4889 | 4890 | 4877 |
5531 | 4889 | 4890 | 4878 |
5532 | 4890 | 4877 | 4881 |
5533 | 4890 | 4877 | 4886 |
5534 | 4890 | 4878 | 4879 |
5535 | 4891 | 4892 | 4884 |
5536 | 4891 | 4892 | 4893 |
5537 | 4892 | 4893 | 4885 |
5538 | 4892 | 4893 | 4887 |
5539 | 4893 | 4887 | 4888 |
On the other hand, we can formulate richer queries against the mutagenesis
data, where more data attributes are available. The following query refines the previous one to find all directed paths from a carbon atom to a nitrogen atom to another carbon atom. The resulting table is much smaller.
Code
Row | atom1 | atom2 | atom3 |
---|---|---|---|
Int64 | Int64 | Int64 | |
1 | 163 | 164 | 165 |
2 | 169 | 170 | 163 |
3 | 620 | 624 | 608 |
4 | 768 | 769 | 787 |
5 | 786 | 766 | 767 |
6 | 1105 | 1104 | 1086 |
7 | 1315 | 1319 | 1320 |
8 | 1316 | 1321 | 1320 |
9 | 1347 | 1348 | 1366 |
10 | 1365 | 1345 | 1346 |
11 | 1403 | 1406 | 1407 |
12 | 1407 | 1408 | 1392 |
13 | 1407 | 1408 | 1412 |
⋮ | ⋮ | ⋮ | ⋮ |
30 | 3113 | 3102 | 3103 |
31 | 3185 | 3184 | 3166 |
32 | 3362 | 3368 | 3378 |
33 | 3377 | 3366 | 3357 |
34 | 3595 | 3601 | 3608 |
35 | 3607 | 3600 | 3590 |
36 | 4029 | 4019 | 4020 |
37 | 4398 | 4385 | 4402 |
38 | 4546 | 4550 | 4534 |
39 | 4837 | 4838 | 4839 |
40 | 4888 | 4891 | 4892 |
41 | 4892 | 4893 | 4887 |
Conjunctive queries as limits
The theory of relational databases is category theory. This statement is true in many ways and from many perspectives, but we will focus on just one: the equivalence between conjunctive queries on relational databases and finite limits in the category of sets. This equivalence is almost immediate if you are familiar with the relevant concepts; it is certainly not new (Kato 1983; Alagic and Alagic 1991). Nevertheless, it is a fruitful observation for both category theory and relational databases. In one direction, conjunctive queries inherit an intuitive diagrammatic syntax through the operad of undirected wiring diagrams. Conversely, any general-purpose package for computational category theory, which Catlab aims to be, ought to be able to efficiently compute finite limits of sets. Algorithms for doing so have been extensively developed by database theorists and engineers.
Here is a dictionary between conjunctive queries of different kinds and finite limits in \mathsf{Set} of different shapes.
cross joins of tables X and Y are products
joins2 along attributes a_X: X \to A and a_Y: Y \to A are pullbacks
multi-way joins along attributes a_i: X_i \to A, for i=1,\dots,n, are wide pullbacks
general conjunctive queries are limits of finite bipartite diagrams
The last statement, about conjunctive queries and bipartite diagrams, requires elaboration. We will break it down into two steps. First, conjunctive queries can be represented by undirected wiring diagrams (UWDs); in Catlab, they actually are UWDs.3 Second, the UWD translates into a bipartite diagram, whose limit is the result of the query.
As an illustration, recall this conjunctive query on the mutagenesis data.
Code
true
The tables in the query correspond to boxes (unfilled circles) in the UWD and the variables correspond to junctions (filled circles).
We won’t say much more about UWDs here. For additional examples of diagrammatic query syntax, see the related earlier post by Andrew Baas on UWDs and SQL queries.
In the second step, the UWD translates into a bipartite diagram in \mathsf{Set}, where the boxes of the UWD become vertices in the diagram’s top layer and the junctions become vertices in the bottom layer.
The result R of the conjunctive query is then the limit of this bipartite diagram, which can be thought of as a “generalized pullback.”
In technical terms, isomorphism classes of data frames (multispans) constitute an algebra of the operad of UWDs, where composition is given by limits of the induced bipartite diagrams.
All this is to say what a conjunctive query is but not how to compute it. There are many ways to evaluate a conjunctive query. The simplest, sketched in the previous section, is to take the cartesian product of the sets in the top layer of the bipartite diagram, then filter the tuples, keeping only those that project to equal values under the functions over any given set in the bottom layer. In a special case, this procedure is known to category theorists as “computing a pullback by taking an equalizer of a product.” However, it is probably the worst possible algorithm for computing a limit, generally being extremely inefficient.
The inefficiency arises from having to iterate over the elements of a cartesian product X_1 \times \cdots \times X_n, whose cardinality grows exponentially in n. To avoid this, the conjunctive query should be decomposed into a sequence of joins, a problem known as query planning. Catlab currently employs a simple greedy heuristic for choosing joins.
Joins, or equivalently pullbacks, may themselves be computed in many ways, of which the most common are (Mishra and Eich 1992):
- Nested loop join: the naive algorithm that iterates over all pairs of elements
- Sort-merge join: both vectors of attributes are sorted, then merged together by synchronized iteration
- Hash join: a hash map (index of preimages) is built for one attribute, then the other attribute vector is iterated, looking up preimages for each element
All three join algorithms are implemented in Catlab, for two-way and multi-way joins. Both the sort-merge join and the hash join are generally much faster than the nested loop join. The hash join performs especially well when an attribute is already indexed in the attributed \mathsf{C}-set, as happens for the source and target functions in a graph.
Outlook
Despite its simplistic query planning, Catlab’s conjunctive queries on \mathsf{C}-sets are already fairly usable. Over time we will implement better query planning algorithms, as well as more general kinds of queries. Possibilities include unions of conjunctive queries, the class of queries expressible in coherent logic, and recursive queries, as in Datalog.
As we have seen, relational databases enjoy two key advantages over isolated data frames. First, relational schemas enable the data’s encoding to mirror its underlying relational structure, which eliminates redundancy in storage and promotes internal consistency. Second, declarative query languages make it easy to ask questions about the data and view it through different lenses. On the other hand, SQL databases and logic programming languages like Prolog and Datalog have generally suffered from the Achilles’ heel of all domain-specific languages (DSLs): you cannot step outside the DSL without abandoning the system entirely. Yet stepping outside the DSL is often necessary, since even the most expressive query languages have their limitations.
We think that \mathsf{C}-sets can offer the best of both worlds to data analysis, providing the benefits of the relational data model while existing as ordinary data structures in a general-purpose programming language. If you cannot express your query using the @relation
macro, you can always just write the for
loops or the map
calls! Due to the design of the Julia programming language and the \mathsf{C}-sets in Catlab, doing so should cause no performance degradations. In this way, we hope that \mathsf{C}-sets will eventually become a valuable component of the JuliaData ecosystem.
References
Footnotes
See the line of research by David Spivak and collaborators (David I. Spivak 2012; David I. Spivak 2014; Schultz et al. 2016) and also the earlier blog post by Owen Lynch on the mathematics of attributed \mathsf{C}-sets.↩︎
In this post, by “join” we always mean both “inner join” and “equi-join.” Catlab does not yet support outer joins or joins with predicates other than equality.↩︎
And UWDs are, in turn, \mathsf{C}-sets for a schema \mathsf{C} with tables
Box
,Junction
,Port
, andOuterPort
. This is a circularity but not a vicious one. It does imply that we could evaluate conjunctive queries on conjunctive queries (!), for whatever that is worth.↩︎