JuMP-ing with AlgebraicJulia I: The IJKLM model
$$
$$
In this blog post, we will look at how AlgebraicJulia can bring new tools to mathematical programming, using JuMP.
Mathematical programming, and more specifically, linear (LP) and mixed-integer (MIP) programming are a staple for process optimization in dozens of industries, from electricity distribution, public transportation planning, airline scheduling, employee shift scheduling, supply chain planning, manufacturing, and more. For all but the simplest models, it is a necessity for users to be able to generate the LP/MIP problems via the use of an algebraic modeling language. These languages allow the user to write down a model close to how one might see it presented in an operations research or management science textbook. The language then applies simplifications, before passing it off to a dedicated solver program once it is in a standard form. There are a lot of modeling languages, some of the best known commercial ones being GAMS and AMPL.
In all algebraic modeling languages, sets are a fundamental way to organize variables and constraints. For example, in the famous diet problem, the set of foodstuffs may be used to index the decision variables corresponding to the amount of each food item in the diet. Both the GAMS documentation and AMPL documentation contain significant sections related to the creation, traversal, and manipulation of sets and subsets. While the similarity of many tasks in model building to database operations has been noticed for several decades, most clearly in Fourer (1997), support for more complex operations including n-ary products and relations has remained limited, and modelers often use ad hoc techniques which increase model generation times, sometimes prohibitively. In this post, we show how to use acsets, a categorical data structure described by Patterson, Lynch, and Fairbanks (2021), and categorical operations to formally and efficiently generate mathematical programming models.
The IJKLM model
In a blog post on the GAMS blog, a comparison was made between several open source modeling languages including JuMP, and GAMS. The initial JuMP implementation saw poor performance due to inefficient Julia code. The JuMP dev team responded with their own blog post on the JuMP website, using a fast solution based on DataFrames.jl. In this post we will see how to use tools from AlgebraicJulia to accomplish the example modeling task. The original data and code is at justine18/performance_experiment.
The model is given as:
\text{min} \ z = 1
\sum_{(j,k):(i,j,k) \in \mathcal{IJK}} \ \sum_{l:(j,k,l) \in \mathcal{JKL}} \ \sum_{m:(k,l,m) \in \mathcal{KLM}} x_{i,j,k,l,m} \ge 0 \hspace{1cm} \forall \ i \in \mathcal{I}
x_{i,j,k,l,m} \ge 0 \hspace{1cm} \forall \ (i,j,k) \in \mathcal{IJK}, l:(j,k,l) \in \mathcal{JKL}, m:(k,l,m) \in \mathcal{KLM}
The GAMS blog post calls subsets of Cartesian products “maps”, but we will use the standard term “relations” for subsets of n-ary products.
Data generation
First we load some packages. DataFrames
for data frames, Distributions
for sampling binomial random variates, JuMP
to set up the model, and HiGHS
for a solver. Catlab
and DataMigrations
are the two AlgebraicJulia packages we will use.
The first step is to generate synthetic data according to the method from the original repo. The probability of all zeros with the given model sizes is incomprehensibly small but there is a check for it anyway. Maybe a cosmic ray will pass through your processor at a bad time.
Like the original code, the sets I,J,K,L,M are vectors of strings. There are 3 relations which are “sparse” subsets of the corresponding products, IJK,JKL,KLM.
Code
SampleBinomialVec = function(A,B,C,p=0.05)
vec = rand(Binomial(1, p), length(A) * length(B) * length(C))
while sum(vec) == 0
vec = rand(Binomial(1, p), length(A) * length(B) * length(C))
end
return vec
end
n=100 # something large
m=20 # 20
# Sets IJKLM
I = ["i$x" for x in 1:n]
J = ["j$x" for x in 1:m]
K = ["k$x" for x in 1:m]
L = ["l$x" for x in 1:m]
M = ["m$x" for x in 1:m]
# make IJK
IJK = DataFrame(Iterators.product(I,J,K))
rename!(IJK, [:i,:j,:k])
IJK.value = SampleBinomialVec(I,J,K)
filter!(:value => v -> v != 0, IJK)
select!(IJK, Not(:value))
# make JKL
JKL = DataFrame(Iterators.product(J,K,L))
rename!(JKL, [:j,:k,:l])
JKL.value = SampleBinomialVec(J,K,L)
filter!(:value => v -> v != 0, JKL)
select!(JKL, Not(:value))
# make KLM
KLM = DataFrame(Iterators.product(K,L,M))
rename!(KLM, [:k,:l,:m])
KLM.value = SampleBinomialVec(K,L,M)
filter!(:value => v -> v != 0, KLM)
select!(KLM, Not(:value))
The “intuitive” formulation
As given in the original GAMS blog post, this is the naive formulation that relies on nested for loops. As remarked in the JuMP blog, this is equivalent to taking two inner joins. Another way to look at it is as finding “paths” through the relations, such that they match on common elements. Each formulation will be benchmarked.
Code
@benchmark let
x_list = [
(i, j, k, l, m)
for (i, j, k) in eachrow(IJK)
for (jj, kk, l) in eachrow(JKL) if jj == j && kk == k
for (kkk, ll, m) in eachrow(KLM) if kkk == k && ll == l
]
model = JuMP.Model()
set_silent(model)
@variable(model, x[x_list] >= 0)
@constraint(
model,
[i in I],
sum(x[k] for k in x_list if k[1] == i) >= 0
)
end
BenchmarkTools.Trial: 10 samples with 1 evaluation.
Range (min … max): 524.072 ms … 543.738 ms ┊ GC (min … max): 2.33% … 2.33%
Time (median): 533.789 ms ┊ GC (median): 2.31%
Time (mean ± σ): 534.375 ms ± 6.926 ms ┊ GC (mean ± σ): 2.42% ± 0.61%
▁▁ ▁ █ ▁ ▁ █ ▁
██▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁█▁▁▁▁▁▁█▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁█ ▁
524 ms Histogram: frequency by time 544 ms <
Memory estimate: 253.95 MiB, allocs estimate: 6603758.
The DataFrames version
The JuMP blog authors used a version based on two inner joins to vastly improve computation speed.
Code
ijklm_df = DataFrames.innerjoin(
DataFrames.innerjoin(IJK, JKL; on = [:j, :k]),
KLM;
on = [:k, :l],
)
@benchmark let
ijklm = DataFrames.innerjoin(
DataFrames.innerjoin(IJK, JKL; on = [:j, :k]),
KLM;
on = [:k, :l],
)
model = JuMP.Model()
set_silent(model)
ijklm[!, :x] = @variable(model, x[1:size(ijklm, 1)] >= 0)
for df in DataFrames.groupby(ijklm, :i)
@constraint(model, sum(df.x) >= 0)
end
end
BenchmarkTools.Trial: 3506 samples with 1 evaluation.
Range (min … max): 1.214 ms … 4.442 ms ┊ GC (min … max): 0.00% … 45.74%
Time (median): 1.331 ms ┊ GC (median): 0.00%
Time (mean ± σ): 1.423 ms ± 365.885 μs ┊ GC (mean ± σ): 4.65% ± 10.30%
▆█▆▃▁ ▁
▇▅█████▇▆▆▆▅▅▁▁▁▄▄▁▃▅▃▃▁▁▄▅▅▇▇▆▅▅▁▁▁▁▁▁▁▁▁▁▁▃▃▄▅▅▆▆▇▅▆▆▆█▆▆ █
1.21 ms Histogram: log(frequency) by time 3.17 ms <
Memory estimate: 1.84 MiB, allocs estimate: 17137.
The Acsets version
Acsets (Attributed C-Sets) are a categorical data structure provided in the ACSets.jl library, and imported and extended with machinery from applied category theory in Catlab.jl. One of the advantages of using acsets and categorical machinery in general is that they have a natural graphical presentation, which in many cases is very readable and illuminating. If this is your first exposure to acsets, a gentle introduction can be found at Graphs and C-sets I: What is a graph?.
We use @present
to make a schema for the acset which will store the data. Note that each “set” has turned into an object in the schema, and that the relations are also objects. There are projections (homs) from the relations into the sets which are involved in each relation.
Code
@present IJKLMSch(FreeSchema) begin
(I,J,K,L,M,IJK,JKL,KLM)::Ob
IJK_I::Hom(IJK,I)
IJK_J::Hom(IJK,J)
IJK_K::Hom(IJK,K)
JKL_J::Hom(JKL,J)
JKL_K::Hom(JKL,K)
JKL_L::Hom(JKL,L)
KLM_K::Hom(KLM,K)
KLM_L::Hom(KLM,L)
KLM_M::Hom(KLM,M)
end
Catlab.to_graphviz(IJKLMSch, graph_attrs=Dict(:dpi=>"72",:ratio=>"expand",:size=>"8"))
Using @acset_type
will programatically generate a data type and methods specific to the schema provided. We then use @acset
to build a data instance on our schema.
Code
@acset_type IJKLMData(IJKLMSch, index=[:IJK_I,:IJK_J,:IJK_K,:JKL_J,:JKL_K,:JKL_L,:KLM_K,:KLM_L,:KLM_M])
ijklm_acs = @acset IJKLMData begin
I = n
J = m
K = m
L = m
M = m
IJK = nrow(IJK)
IJK_I = [parse(Int, i[2:end]) for i in IJK.i]
IJK_J = [parse(Int, j[2:end]) for j in IJK.j]
IJK_K = [parse(Int, k[2:end]) for k in IJK.k]
JKL = nrow(JKL)
JKL_J = [parse(Int, j[2:end]) for j in JKL.j]
JKL_K = [parse(Int, k[2:end]) for k in JKL.k]
JKL_L = [parse(Int, l[2:end]) for l in JKL.l]
KLM = nrow(KLM)
KLM_K = [parse(Int, k[2:end]) for k in KLM.k]
KLM_L = [parse(Int, l[2:end]) for l in KLM.l]
KLM_M = [parse(Int, m[2:end]) for m in KLM.m]
end
Conjunctive Queries on Acsets
The critical thing that the JuMP devs did to speed thing up was to replace the for loops with 2 inner joins, to get the “paths” through the relations. How to do this with acsets? Well one thing to do is execute a conjunctive query on the acset to get the same thing. This is described in a C-sets for data analysis: relational data and conjunctive queries.
Code
connected_paths_query = @relation (i=i,j=j,k=k,l=l,m=m) begin
IJK(IJK_I=i, IJK_J=j, IJK_K=k)
JKL(JKL_J=j, JKL_K=k, JKL_L=l)
KLM(KLM_K=k, KLM_L=l, KLM_M=m)
end
Catlab.to_graphviz(connected_paths_query, box_labels=:name, junction_labels=:variable, graph_attrs=Dict(:dpi=>"72",:size=>"3.5",:ratio=>"expand"))
While the blog post should be consulted for a complete explanation, the conjunctive query is expressed using an undirected wiring diagram (UWD) which is visualized above. Nodes (labeled ovals) in the UWD correspond to tables (primary keys) in the acset. Junctions (labeled dots) correspond to variables. Ports, which are unlabed in this graphical depiction, are where wires connect junctions to nodes. These correspond to columns of the table they are connected to. Outer ports, which are wires that run “off the page”, are the columns of the table that will be returned as the result of the query. Conceptually, rows that are returned from the query come from filtering the Cartesian product of the tables (nodes) such that variables in columns match according to ports that share a junction.
The JuMP blog post notes that while the data frames version doesn’t resemble the nested summation it is arguably just as readable, especially if the columns were related to the process that was being modeled. We suggest that the acsets version is also just as readable, if not more, as the data schema and query diagram directly represent the data that parameterizes the optimization model. Furthermore because the schema of the acset is known at compile time, incorrect queries (or other operations) on acsets will be caught as compile time errors.
The query can then be evaluated on the specific acset instance. We can confirm that both the acsets and data frame methods return the same number of rows.
Now we can go ahead and see how fast the acsets version is. The fact that the acsets based query is right on the tails of the DataFrames
version is a performance win for the acsets library, as it is usually a generic conjunctive query engine across very general data structures (i.e., acsets are in general much more complex than a single dataframe, due to presence of multiple tables connected via foreign keys).
Code
BenchmarkTools.Trial: 2759 samples with 1 evaluation.
Range (min … max): 1.549 ms … 9.245 ms ┊ GC (min … max): 0.00% … 20.42%
Time (median): 1.685 ms ┊ GC (median): 0.00%
Time (mean ± σ): 1.808 ms ± 446.566 μs ┊ GC (mean ± σ): 5.04% ± 10.74%
▂██▆▃ ▁▁▁ ▁
▅▅██████▅▅▄▆▅▅▄▁▃▅▄▄▁▃▁▁▁▃▁▃▁▃▄▃▁▆▅▅▄▆▄▁▁▁▁▁▁▁▃▁▃▆▇█████▇▆▆ █
1.55 ms Histogram: log(frequency) by time 3.38 ms <
Memory estimate: 2.74 MiB, allocs estimate: 23500.
Data Migrations between Acsets
While the query execution in the previous section is already quite useful for practical application, it does have one downside, and that is the type of the return object is a DataFrame
. While this is appropriate for many cases, one of the benefits of acsets is that, from one point of view, they are in-memory relational databases, and are therefore capable of representing data more complex than can be expressed in a single table. Therefore, it would be nice if one could execute a data migration from one type of acset, where “type” means the schema, to another, that is able to carry along further data we need. For more details on data migration, please see Evan Patterson’s Topos Institute colloquium talk “Categories of diagrams in data migration and computational physics”.
In this context, if the “set” objects (I, J, etc) were further connected to other tables, maybe, say, a list of suppliers, or a list of materials, or even a process graph of downstream work, it would be inconvenient at least, if we lost that relational information during a query. In that case, we’d really want to return another acset on a different schema that is precisely the right shape for what we want to do.
In this simple case, the schema we want is shown below. We’ll think of the object IJKLM as being mapped to the subset of the n-ary product that has those “paths” through the relations we are seeking.
Code
Now we formulate the data migration, using AlgebraicJulia/DataMigrations.jl. While we will not be able to rigorously explain data migration here, if one has C-Set (instance of data on schema C) and wants to migrate it to a D-Set, a data migration functor F needs to be specified.
Here, C is our schema IJKLMSch
and D is IJKLMRelSch
. The functor F is a mapping from D to the category of diagrams on C; formally we denote it F:D\rightarrow \text{Diag}^{\text{op}}(C). Each object in D gets assigned a diagram into C, and morphisms in D get assigned to contravariant morphisms of diagrams.
Code
M = @migration IJKLMRelSch IJKLMSch begin
IJKLM => @join begin
ijk::IJK
jkl::JKL
klm::KLM
i::I
j::J
k::K
l::L
m::M
IJK_I(ijk) == i
IJK_J(ijk) == j
JKL_J(jkl) == j
IJK_K(ijk) == k
JKL_K(jkl) == k
KLM_K(klm) == k
JKL_L(jkl) == l
KLM_L(klm) == l
KLM_M(klm) == m
end
I => I
J => J
K => K
L => L
M => M
i => i
j => j
k => k
l => l
m => m
end;
A diagram is itself a functor D:J\rightarrow C, where J is (usually) a small category, and D will point at some instance of the diagram in C. We can plot what the largest diagram looks like, that which object IJKLM
in D is mapped to. Note the similarity to the conjunctive query visualized as a UWD previously. In particular, note that “relation” elements must agree upon the relevant “set” elements via their morphisms. The object in C that each object in J corresponds to is given by the text after the colon in the relevant node.
Because of the simplicity of the schema IJKLMRelSch
, the contravariant morphisms of diagrams simply pick out the object in D associated with the source of the morphism. Likewise, the natural transformation part of morphisms of diagrams simply selects for each object its identity morphism.
We run the data migration to move data from the schema IJKLMSch
to IJKLMRelSch
using the function migrate
, and check that the result has the same number of records as other methods.
Code
true
Once again, let’s benchmark. The data migration is slightly slower than the conjunctive query method, but data migrations can express a much richer language of data manipulation than conjunctive queries are capable of.
Code
BenchmarkTools.Trial: 1380 samples with 1 evaluation.
Range (min … max): 3.212 ms … 10.213 ms ┊ GC (min … max): 0.00% … 18.52%
Time (median): 3.347 ms ┊ GC (median): 0.00%
Time (mean ± σ): 3.621 ms ± 654.179 μs ┊ GC (mean ± σ): 6.28% ± 10.98%
▃▇██▇▅▄▂ ▁ ▁ ▂▂▁▁ ▁
█████████▇▅▅▄▅▅▁▁▅▁▄▅▁▅▁▄▁▁▄▁▁▄▁▁▁▄▁▁▁▄▄▅▆██▇████████▆▆█▇▇▆ █
3.21 ms Histogram: log(frequency) by time 5.36 ms <
Memory estimate: 7.33 MiB, allocs estimate: 42773.
To summarize, acsets and tools for working with them provided by AlgebraicJulia can greatly ease development of complex mathematical programming models. Relations, unions, and other concepts from sets are elegantly generalized by category theory, and can help develop more correct, formal, and expressive models. While the optimization “model” examined in this article is fairly abstract, we will investigate model of more practical interest in future installments on the blog.
Acknowledgements
I would like to thank Kevin Carlson for his assistance regarding DataMigrations.jl.