0. Name Disambiguation and Source-Paper Confirmation
0.1 Which paper this document mainly corresponds to
This document mainly corresponds to:
- Fischer et al., ICML 2019, DL2: Training and Querying Neural Networks with Logic
In this note, DL2 refers by default to the system introduced in that paper:
- it takes declarative logical constraints $\phi$ as its object;
- it translates constraints into a numerical objective that is nonnegative, differentiable almost everywhere, and such that “the constraint is satisfied if and only if the loss is 0”;
- it uses the same “logic $\to$ loss” mechanism to support both training and querying.
DL2 is generally understood as Deep Learning with Differentiable Logic.
Its focus is not to propose one standalone trick for one specific task, but to turn “how logical constraints enter neural networks” into a unified interface.
0.2 Its place in the reading path
If we regard Hu et al. 2016 -> Semantic Loss -> DL2 as a progressively systematized line of work on logical constraints, then the question this paper answers is:
Can logic be treated not merely as one particular kind of output loss, but as a unified constraint language, so that the same piece of logic can both participate in training and be used directly to query network behavior?
In other words:
- Hu 2016 is more like:
rule -> teacher distribution -> student - Semantic Loss is more like:
rule -> exact semantic loss -> parameter update - DL2 is more like:
declarative constraint -> translated loss + optimizer -> training / querying
Therefore, the real advance of this paper is not “writing one more loss,” but rather:
- extending the object of constraint from output probabilities to more general numerical terms;
- letting logic enter not only training, but also querying;
- letting training target not only fixed points in the training set, but also global constraints over input regions outside the training set.
0.5 Minimal Problem Setup and Notation
To keep the note self-contained, we first fix the minimal setup used below.
Let the neural-network parameter be $$ \theta, $$ let the input variable be denoted by $$ x, $$ and if the constraint also contains auxiliary variables that need to be searched, perturbed, or quantified over, denote them by $$ z. $$
DL2’s logical language does not start directly from Boolean variables, but from numerical terms.
A term is written as
$$
t(x,z,\theta),
$$
and it can be:
- a constant;
- an input coordinate;
- a network output probability $p_\theta(x)_i$;
- a pre-softmax logit;
- an internal neuron activation;
- an algebraic combination of several numerical terms.
This note assumes by default that a term is differentiable with respect to the variables and parameters at least on almost all points.
Two terms can form comparison constraints:
$$
t=t',\qquad t\le t',\qquad t\ne t',\qquad t More complex constraints are formed by combining these atomic constraints through
$$
\wedge,\qquad \vee,\qquad \neg
$$
and are written uniformly as
$$
\phi(x,z,\theta).
$$ Because DL2 simultaneously covers Define
$$
\phi_{\text{eq}}(u)=\bigl(u=(1,1)\bigr),
\qquad
u=(u_0,u_1)\in\mathbb R^2.
$$ This example has only one purpose: In the semi-supervised CIFAR-100 setting, define a group of semantic group probabilities. Then impose the following constraint on each group:
$$
p_g(x)<\varepsilon\ \vee\ p_g(x)>1-\varepsilon.
$$ Its intuition is: Define the query: Its meaning is: This example is best suited to show: The question this paper truly wants to answer is: When an expert already knows that a network should satisfy some numerical-logical property, can these constraints be written in a unified declarative way and used for both training and querying, without having to design a separate task-specific loss for each type of constraint? The authors target scenarios like the following: If we look only at Logic-Net or Semantic Loss, each of them has already solved part of the problem, but neither is systematic enough. The limitations of Logic-Net are: The limitations of Semantic Loss are: More generally, many soft logic encodings are “continuous” but not necessarily “optimizable.” If a logic translation can still produce zero gradient when a constraint is violated, then it is not truly usable for optimization. This is also the most important dividing line between DL2 and methods that merely “write logic as a continuous expression.” The most important thing about this figure is not the page layout, but that it makes the query objects of DL2 explicit as three categories: This shows that the logical object of DL2 is not a “label post-processing rule,” but rather: all of these can be expressed as unified numerical constraints. In other words, what this figure is really saying is: What is usually most worth looking at on this page is not the specific numbers, but two things: What the authors want to emphasize is: Taken together with Figure 1, this page basically makes the difference between DL2 and the previous two papers clear: The most fundamental shift in DL2’s perspective is: Therefore, the objects constrained by DL2 can be: For example, the following constraints are all natural objects in DL2: $$
p_\theta(x)_1 > p_\theta(x)_2,
$$
$$
\mathrm{sum}(x)=\mathrm{sum}(G(x)),
$$
$$
\mathrm{NN}(i).l_1[0,1,1,31]=0.
$$ This means that DL2’s logic does not constrain only “whether the class is correct,” but can directly constrain the numerical behavior both inside and outside the network. DL2’s constraint form is uniformly: This naturally supports: For example, the people-group constraint above
$$
p_g(x)<\varepsilon\ \vee\ p_g(x)>1-\varepsilon
$$
is a typical disjunction: DL2 is more general in a sense that goes beyond “it can write more symbols.” More precisely, it extends three dimensions: More general constraint objects More general constraint shapes More general usage modes Only by stacking these three points together do we get the system-level character of DL2. For each constraint $\phi$, DL2 associates a nonnegative loss
$$
L(\phi),
$$
and requires it to satisfy two core properties: Taken together, these two properties mean: For comparison constraints, the basic translation given by DL2 is:
$$
L(t\le t')=\max(t-t',0),
$$
$$
L(t\ne t')=\xi\cdot [t=t'],
$$
where $\xi>0$ is a constant and $[t=t']$ is the indicator function. The remaining comparison constraints are obtained by logical-equivalence rewriting:
$$
L(t=t') := L(t\le t' \wedge t'\le t),
$$
$$
L(t This step is crucial, because DL2 does not casually assign one penalty term to each symbol. Instead, it first preserves logical equivalence and only then performs a unified translation. For conjunction and disjunction, DL2 defines:
$$
L(\phi'\wedge \phi'')=L(\phi')+L(\phi''),
$$
$$
L(\phi'\vee \phi'')=L(\phi')\cdot L(\phi'').
$$ The meanings of these two definitions are: For negation, DL2 does not directly invent a new penalty term, but first rewrites it by logical equivalence, for example:
$$
L(\neg(t\le t'))=L(t' The purpose of doing so is: The paper gives the key conclusion: Theorem 1: $L(\phi)=0$ if and only if $\phi$ is satisfied. The significance of this theorem is not just that “the loss is beautifully designed,” but that: Without this property, both training and querying could only be understood as some heuristic approximation, rather than direct optimization of constraint satisfaction. Continue with the equality toy above:
$$
\phi_{\text{eq}}(u)=\bigl(u=(1,1)\bigr),\qquad u=(u_0,u_1).
$$ Under the PSL encoding compared in the paper, one may obtain
$$
L_{\text{PSL}}(\phi)(u)=\max(u_0+u_1-1,0).
$$ If the optimization starting point is
$$
u=(0.2,0.2),
$$
then
$$
u_0+u_1-1\le 0,
$$
so the gradient is 0 and optimization stops; however, $u$ does not satisfy the target assignment at that point. Under the DL2 translation, by contrast,
$$
L(\phi_{\text{eq}})(u)=|u_0-1|+|u_1-1|.
$$ At the same point
$$
u=(0.2,0.2)
$$
the gradient is still nonzero, so optimization will continue pushing $u$ toward the satisfying point. This example is meant to convey only one sentence: DL2 does not simply write logic as a continuous expression. It explicitly considers whether, when a constraint is violated, the gradient can still keep optimization moving forward. Let the training distribution be $D$, the search space be $A$, and the constraint be
$$
\phi(x,z,\theta).
$$ DL2 aims to find parameters $\theta$ such that for a random sample $x\sim D$, the constraint holds for all $z\in A$ as much as possible:
$$
\arg\max_\theta\Pr_{x\sim D}\bigl[\forall z\in A.\ \phi(x,z,\theta)\bigr].
$$ This objective can be rewritten equivalently as minimizing the probability that “there exists a counterexample”:
$$
\arg\min_\theta\Pr_{x\sim D}\bigl[\exists z\in A.\ \neg\phi(x,z,\theta)\bigr].
$$ This step is crucial because it rewrites the training problem into: If for each given $x,\theta$, one can find a worst counterexample
$$
z^*(x,\theta)=\arg\max_{z\in A}[\neg\phi(x,z,\theta)],
$$
then the outer problem becomes:
$$
\arg\max_\theta \Pr_{x\sim D}\bigl[\phi(x,z^*(x,\theta),\theta)\bigr].
$$ DL2 interprets this process as a two-player game: After approximation by a loss, the inner problem is written as
$$
z^*(x,\theta)=\arg\min_{z\in A} L(\neg\phi)(x,z,\theta),
$$
and the outer problem is approximated as
$$
\arg\min_\theta \mathbb E_{x\sim T}\bigl[L(\phi)(x,z^*(x,\theta),\theta)\bigr],
$$
where $T$ is the training set. Therefore, the essence of DL2 training is not one static regularization term, but rather: Directly minimizing
$$
L(\neg\phi)
$$
can sometimes be difficult. The example given in the paper is the local robustness constraint:
$$
\phi((x,y),z,\theta)=\|x-z\|_\infty\le \varepsilon \Rightarrow \mathrm{logit}_\theta(z)_y>\delta.
$$ Under the direct translation, one obtains
$$
L(\neg\phi)
=
\max(0,\|x-z\|_\infty-\varepsilon)
+
\max(0,\mathrm{logit}_\theta(z)_y-\delta).
$$ The problem here is: Therefore, the key engineering treatment in DL2 is: This step does not change the logic semantics, but changes the optimization interface: Once the inner search space becomes a convex set $B$, DL2 uses projected gradient descent (PGD). Its role is: So in DL2, PGD is not an auxiliary trick, but a core component of whether training is tractable. If we compress the training algorithm in the paper into the shortest main line, it is: This is the point that most deserves to be singled out for DL2 relative to the previous two papers. In much prior work, constraints mainly act only on the training samples themselves. to be searched for counterexamples that violate the constraint. That is, the “globality” of DL2 mainly comes from: For example, consider a Lipschitz constraint:
$$
\forall z\in B_\varepsilon(x),\ z'\in B_\varepsilon(x').\
\|f(z)-f(z')\|_2 < L\|z-z'\|_2.
$$ What is being constrained here is no longer a fixed pair of points in the training set, but all pairs of points in two neighborhoods. DL2 designs a declarative query language close to SQL, whose basic shape is: Here: The DSL also provides some syntactic sugar: Therefore, in the paper,
$$
\mathrm{class}(NN(x))=y
$$
can essentially be understood as: DL2’s querying is not another independent system. The only difference is: So the unity of training and querying is not a slogan, but the fact that: Continue with the generator disagreement query above: Its logical structure can be decomposed into three layers: Domain constraint Cross-network behavioral constraint Return object This shows that the real role of the query is not to “look up a label,” but to: In querying, DL2 still compiles constraints into a loss, but the optimizer is changed to The reason is straightforward: Therefore, the standard querying workflow is: This point must be distinguished clearly from Semantic Loss. Semantic Loss optimizes: DL2 does not optimize this kind of “distribution-level satisfying mass.” In other words: This is also different from Logic-Net. DL2 contains no: Its chain is more like: $$
\phi
\Longrightarrow
L(\phi)
\Longrightarrow
\text{adversary / optimizer}
\Longrightarrow
\theta \ \text{or}\ z.
$$ Therefore, what DL2 really systematizes is: If we look only at one loss evaluation, DL2 is indeed evaluating violation at one concrete assignment point. But once quantified variables and adversary search are taken into account, it can become global: So a more accurate statement is: This is a completely different source of “globality” from that of Semantic Loss. In one sentence: Especially relative to PSL, the key strength of DL2 is not that it is “harder,” but that: The three main logic papers can be compressed into: $$
\text{Logic-Net}: \phi \to q^\star \to \theta,
$$
$$
\text{Semantic Loss}: \phi \to P_p(\phi) \to \theta,
$$
$$
\text{DL2}: \phi \to L(\phi) \to (\theta\ \text{or}\ z).
$$ The last line is the most important, because it simultaneously supports: In the semi-supervised CIFAR-100 experiments, DL2 uses semantic group constraints such as people / trees. Therefore, although unlabeled data do not provide exact class labels, they still provide: The results reported in the paper are: The unsupervised experiment in the paper is a shortest-distance regression task on a graph. This shows that: This result is important because it pushes DL2 from “a classification trick with added rules” to a more general framework for structural learning. In the supervised experiments, the authors consider: In terms of results, the core gain of DL2 is not that “prediction accuracy improves on all tasks,” but rather that: This shows that the value of DL2 mainly lies in: In the querying experiments, DL2 is tested on multiple datasets, multiple networks, and multiple query templates. Although DL2 provides a unified logic $\to$ loss translation, the practical difficulties still lie in optimization: That is, DL2 solves “how to express things uniformly and optimize them differentiably,” not “all constraints suddenly become cheap.” The paper states clearly in the experiments that: Put more plainly: For querying, if some query times out or no solution is found, this does not automatically mean: More likely reasons are: Therefore, DL2 querying should be understood more as: Compared with Logic-Net and Semantic Loss, DL2 is stronger, but the cost is also more obvious: Therefore, DL2 is more like a framework or system than just “a prettier formula.” The current repository already contains local DL2 resources: The code organization in the repository corresponds to the paper structure: From the perspective of reading order, the safest entry path is: The advantage of doing this is: The best first step is a minimal The reasons are: The most natural implementation directory is still: A safe minimal structure could be: The most important thing here is not to pile up all functionality first, but to make these three points clear first: If we keep only the most important sentences, then the skeleton of this paper is: So what it truly solves is: How to lift “the interaction between logical constraints and neural networks” from a single task-specific trick into a unified declarative system that is trainable and queryable. And the real price it pays is: The stronger the expressiveness, the heavier the inner-loop search during training, convex-set projection, query optimization, and engineering complexity become.0.5.1 Three running examples used throughout this note
loss translation, training, and querying, this note fixes three minimal examples, each serving a different part of the discussion.Example 1: an equality toy for explaining the geometry of loss translation
Example 2: a people-group constraint for explaining the training interface
For example,
$$
p_{\text{people}}(x)
=
p_\theta(x)_{\text{baby}}
+p_\theta(x)_{\text{boy}}
+p_\theta(x)_{\text{girl}}
+p_\theta(x)_{\text{man}}
+p_\theta(x)_{\text{woman}}.
$$
Example 3: a generator disagreement query for explaining querying
FIND i[100]
WHERE i[:] in [-1, 1],
class(N1(G(i)), c1),
N1(G(i)).p[c1] > 0.3,
class(N2(G(i)), c2),
N2(G(i)).p[c2] > 0.3
RETURN G(i)
1. The Core Problem the Paper Wants to Solve
1.1 The intuitive question
1.2 Why earlier logical methods are still not enough
The issue especially emphasized by the authors is:
2. Two Key Figures in the Paper
2.1 Figure 1: what DL2 querying is actually searching for

2.2 The query benchmark page: why it is more like a system than a single trick

3. What DL2’s Logical Language Is Actually Constraining
3.1 The basic unit of constraint is not a Boolean output bit, but a numerical term
3.2 The logical structure comes from Boolean combinations of comparison constraints
3.3 Where exactly it is more general than earlier work
It is not limited to output class probabilities, but can directly constrain inputs, logits, internal neurons, and cross-network relations.
It is not limited to linear output rules, but can also express nonlinear conditions, norm-based conditions, neighborhood conditions, and interpolation-based conditions.
Logic can be used both for training and for querying.
4. From Logical Constraints to Loss: DL2’s Core Translation
4.1 The translation target: nonnegative, differentiable almost everywhere, and 0 if and only if satisfied
4.2 How atomic comparison constraints are translated
4.3 How Boolean combinations are translated
4.4 Why Theorem 1 matters
4.5 Why it is better suited to gradient descent than some soft logic encodings
5. How DL2 Enters Training
5.1 The training objective is first rewritten as “find counterexamples”
5.2 Inner and outer optimization: optimizer vs adversary
5.3 Why convex input constraints are pulled out of the loss
5.4 Why PGD appears here
5.5 Why it can do global training “outside the training set”
DL2, however, allows:
Constraints of this kind have essentially gone beyond “adding one extra loss on data points.”
6. How DL2 Performs Querying
6.1 The basic form of the query DSL
FIND z1[m1], ..., zk[mk]
WHERE φ(z1, ..., zk)
[INIT z1 = c1, ..., zk = ck]
[RETURN t(z)]
FIND defines the variables to search and their shapes;WHERE writes the logical constraints;INIT provides initial values;RETURN specifies the final object to return; if omitted, the search variables themselves are returned by default.
, denotes conjunction;in denotes a box constraint;class denotes the argmax class of a network at a given input.
6.2 Why querying and training use the same core
It reuses the same core as training:
6.3 How to read the most typical query example
FIND i[100]
WHERE i[:] in [-1, 1],
class(N1(G(i)), c1),
N1(G(i)).p[c1] > 0.3,
class(N2(G(i)), c2),
N2(G(i)).p[c2] > 0.3
RETURN G(i)
$i$ is a 100-dimensional noise vector, and each dimension lies in $[-1,1]$.
The generator output $G(i)$ must be classified by two classifiers into different classes, both with relatively high confidence.
What is returned is not the noise itself, but the generated sample $G(i)$.
6.4 Why querying uses L-BFGS-B
L-BFGS-B.
L-BFGS-B is more suitable for this kind of continuous constrained search.
7. What It Is Actually Optimizing
7.1 It is not optimizing the “total probability mass of satisfying worlds”
What DL2 optimizes is:
7.2 It is not a teacher distribution either
7.3 Why it is both local and potentially global
In that sense, it is pointwise, rather than summing over an entire set of worlds as Semantic Loss does.
8. Its Essential Differences from Logic-Net, Semantic Loss, and PSL
8.1 The differences from Logic-Net and Semantic Loss
Dimension
Logic-Net
Semantic Loss
DL2
Rule entry point
teacher posterior
enters the output loss directly
unified declarative constraint language
Intermediate object
teacher distribution $q^\star$
total probability mass of satisfying worlds
violation loss produced by constraint translation
Primary object
output distribution
set of worlds over a discrete output distribution
inputs, outputs, neurons, and their numerical relations
Supports querying
no
no
yes
Naturally supports constraints outside the training set
weak
weak
strong
Suitable for regression / physical constraints
moderate
relatively weak
more natural
Logic-Net: first change the distribution, then change the parameters;Semantic Loss: directly optimize the total probability mass of satisfying worlds;DL2: compile general logical constraints into a unified loss, and then use it for training or querying.8.2 The differences from PSL / XSAT / Bach et al.
Method
Core object
Main issue
DL2 relative advantage
PSL
soft truth over $[0,1]$
can produce zero gradients while still unsatisfied
the zeros of the loss are strictly aligned with constraint satisfaction, and the optimization geometry is more stable
XSAT
numerical satisfiability
atomic losses are discrete and nondifferentiable
suitable for gradient methods
Bach et al.
soft form of linear constraints
mainly handles linear conjunctions
DL2 supports more general Boolean combinations and nonlinear numerical terms
8.3 A compressed relation among three logic-constraint papers
9. Why This Paper Is Effective
9.1 Semi-supervised learning: it lets unlabeled data carry higher-level semantic group constraints
Its essence is not to tell the network “which exact class this is,” but to tell the network:
9.2 Unsupervised learning: structural properties alone can approach supervised solutions
The network must predict the shortest distance from a source node to all nodes, but during training it is not given labels directly. It is given only the structural properties that the shortest-distance function must satisfy.
9.3 Supervised learning: it can significantly improve constraint accuracy
9.4 Querying: it also turns “network behavior analysis” into a unified interface
What the authors emphasize is not that “every query succeeds,” but that:
10. Computational Cost and Limitations
10.1 Elegant constraint translation does not mean optimization is always easy
10.2 Global constraints are clearly more expensive than training-set constraints
T-type training-set constraints are more like “adding structure on top of existing sample relations”;G-type global constraints are more like “doing region-level counterexample search while training.”10.3 Query failure does not imply unsatisfiability
10.4 Higher expressiveness also means a higher barrier to use
11. Correspondence with the Local Repository and Code
training/supervised/semisupervised/unsupervised/querying/
README.md to clarify the system boundary;querying/README.md to build intuition for the DSL as quickly as possible;training/README.md, and then return to the nested optimization in the paper.
12. Suggestions for a Minimal Reproduction
12.1 What is best to do first
query-only toy.
12.2 A safest minimal reproduction path
12.3 The three comparisons most worth doing
query DSL vs query API
See the difference in how the same constraint is expressed in the high-level and low-level interfaces.only training-set constraints vs constraints with global quantified variables
See where the “globality” of DL2 actually comes from.semantic loss-style constraints vs DL2-style constraints
See the difference in expressiveness and engineering difficulty between “output-layer satisfying probability mass” and a “declarative numerical constraint system.”12.4 If we want to align with the toy path in the current repository
repro/04_dl2_toy/
README.md
notes.md
config.py
toy_problem.py
dl2_bridge.py
run_query.py
run_train.py
experiment.py
results/
13. A One-Page Compressed Summary