logic_net_toy: Mechanism, Code Mapping, and Critical Analysis

The central question in logic_net_toy is how hand-written rules can be converted into trainable signals in a low-label two-dimensional classification task. The implementation uses these rules to construct a teacher distribution, which then affects the parameter updates of a student model.

The implementation does not encode rules as a direct scalar penalty. Instead, it follows the pipeline:

$$ \text{rule} \longrightarrow q_r(\cdot \mid x) \longrightarrow q_R(\cdot \mid x) \longrightarrow t_\theta(\cdot \mid x) \longrightarrow \mathrm{KL}\!\left(t_\theta(\cdot \mid x)\,\|\,p_\theta(\cdot \mid x)\right) \longrightarrow \theta. $$

Here, $q_r(\cdot \mid x)$ is the distribution induced by a single rule, $q_R(\cdot \mid x)$ is the aggregated rule distribution, $t_\theta(\cdot \mid x)$ is the teacher distribution, $p_\theta(\cdot \mid x)$ is the student distribution, and $\mathrm{KL}$ denotes the Kullback-Leibler divergence.

1. Task Construction: Why Start with a Two-Dimensional Toy Problem

Code entry points:

data.py: decision_score, make_labels, make_balanced_dataset, create_datasets
model.py: TinyMLP

The data-generating function is defined in decision_score:

$$ s(x) = 1.1x_1 -0.55x_2 +0.35\sin(2.6x_2) -0.12x_1^2. $$

Labels are assigned by:

$$ y=\mathbf{1}\!\left[s(x)>0\right]. $$

The true decision boundary is not purely linear. It contains linear, sinusoidal, and quadratic terms. This design is important: a linear rule can provide a useful directional bias, but it cannot exactly reproduce the data-generating process.

A critical question is why the true boundary is not used directly as the rule. If the rule were identical to the true boundary, the experiment would mainly show that access to the correct answer is useful. The current setting is more informative because the rules are approximate. It tests whether an imperfect rule can still improve the student model through the teacher mechanism.

Balanced sampling is implemented by make_balanced_dataset. It prevents class imbalance from masking the effect of rules, making the comparison between the baseline and the logic-guided model easier to interpret. However, this also limits realism. In practical settings, class distributions are often imbalanced, and rule guidance may change the trade-off between precision and recall. A useful extension would include imbalanced datasets and report class-wise metrics.

The model is a small Multi-Layer Perceptron (MLP):

$$ p_\theta(y \mid x)=\mathrm{softmax}(f_\theta(x))_y. $$

This choice keeps the model simple enough for the effect of rule guidance to be inspected directly. It also means that the experiment primarily evaluates the interface between rules and training, rather than the behavior of the method under high-capacity models.

2. Rule Representation: Why Rules Are Not Hard Labels

Code entry points:

rules.py: RuleSpec, rule_margin, soft_rule_probability_for_rule, rule_distribution

A rule is represented by RuleSpec:

$$ r=(a_r,b_r,c_r,\kappa_r,w_r,\alpha_r). $$

The components have the following roles:

$(a_r,b_r,c_r)$ defines a linear boundary.
$\kappa_r$ specifies which class is favored on the positive side of the boundary.
$w_r$ controls the rule weight during aggregation.
$\alpha_r$ scales the rule temperature relative to the global temperature.

The rule margin is:

$$ m_r(x)=a_rx_1+b_rx_2+c_r. $$

This is implemented by rule_margin. The margin does more than indicate which side of the boundary a sample lies on. It also preserves distance from the boundary, which is later converted into a soft probability through a sigmoid function.

The effective temperature is:

$$ \tau_r=T\alpha_r. $$

The probability of being on the positive side of the rule is:

$$ \rho_r(x)=\sigma\!\left(\tau_r m_r(x)\right). $$

The field positive_class then maps the positive side of the rule into the probability of class 1:

$$ q_r(y=1\mid x)= \begin{cases} \rho_r(x), & \kappa_r=1,\\ 1-\rho_r(x), & \kappa_r=0. \end{cases} $$

The single-rule distribution is:

$$ q_r(\cdot\mid x) = \left(1-q_r(y=1\mid x),\ q_r(y=1\mid x)\right). $$

The critical design question is why rules are not converted into hard labels. Hard labels would treat rules as absolute truth, while the rules in this toy problem are only approximate. A soft distribution preserves uncertainty near the rule boundary and allows the current student prediction to remain part of the teacher construction.

This design solves one problem: imperfect rules can still be used as directional bias. It also leaves an open issue: $T$ and $\alpha_r$ are manually specified. If the temperature is too high, the rule becomes overconfident; if it is too low, the rule has little effect. A natural extension is to calibrate rule confidence from validation data or to adapt the temperature based on empirical rule reliability.

3. Rule Aggregation: Why It Is Not Simple Voting

Code entry points:

rules.py: aggregate_rule_distribution, aggregated_hard_rule_prediction

Multiple rules are not averaged directly. They are aggregated in log space:

$$ \ell_R(y\mid x) = \sum_{r\in R}w_r\log q_r(y\mid x). $$

The aggregated rule distribution is:

$$ q_R(y\mid x) = \frac{\exp(\ell_R(y\mid x))} {\sum_{k\in\{0,1\}}\exp(\ell_R(k\mid x))}. $$

Equivalently:

$$ q_R(y\mid x) \propto \prod_{r\in R}q_r(y\mid x)^{w_r}. $$

The corresponding code is:

log_rule = log_rule + rule.weight * torch.log(distribution.clamp_min(1e-6))
return F.softmax(log_rule, dim=1)

This is a weighted product-of-experts aggregation. If several rules support the same class, their evidence is multiplicatively reinforced. If a high-weight rule strongly rejects a class, that class probability is substantially suppressed.

This design addresses the rule-fusion problem. Simple averaging can dilute consistent evidence by mixing strong and weak rules. Multiplicative aggregation gives more weight to agreement among rules.

The cost is sensitivity to bad rules. A wrong high-weight rule can sharply reduce the probability of the correct class. The use of clamp_min(1e-6) prevents $\log 0$, but it only solves a numerical issue, not the semantic risk of unreliable rules. Possible improvements include calibrating rule weights on validation data, learning rule reliability, or adding uncertainty floors to prevent a single rule from dominating the teacher.

4. Teacher Construction: Why the Student Also Enters the Teacher

Code entry point:

rules.py: build_teacher_probs

The teacher is not an independent model. It is constructed from the current student distribution and the aggregated rule distribution:

$$ t_\theta(y\mid x) = \frac{ p_\theta(y\mid x)\,q_R(y\mid x)^\lambda }{ \sum_{k\in\{0,1\}}p_\theta(k\mid x)\,q_R(k\mid x)^\lambda }. $$

In log space:

$$ \log t_\theta(y\mid x) \propto \log p_\theta(y\mid x) +\lambda\log q_R(y\mid x). $$

The corresponding code is:

student_probs = F.softmax(student_logits, dim=1)
rule_probs = aggregate_rule_distribution(x, rule_specs, temperature)
teacher_logits = log_student + rule_strength * log_rule
return F.softmax(teacher_logits, dim=1)

A key question is why the teacher is not simply set to the rule distribution $q_R(\cdot\mid x)$. If the teacher were only the rule distribution, training would reduce to making the model imitate the rules. When rules are wrong, this would directly bias the model in the wrong direction. The current construction keeps $p_\theta(\cdot\mid x)$ in the teacher, so the teacher becomes the current student belief reweighted by rule evidence.

The rule strength $\lambda$ controls the strength of this reweighting. If $\lambda=0$, the teacher reduces to the student distribution. If $\lambda$ is large, the teacher becomes increasingly dominated by the rule distribution.

This design addresses the conflict between rules and data-driven learning. Rules do not overwrite the model; they reshape its target distribution with adjustable strength. The limitation is that the teacher depends on the current student. Early in training, when the student is weak, the teacher may also be unreliable. If the rule and student are both wrong in the same region, distillation can reinforce the error. A possible improvement is to gate the teacher by confidence or rule-student agreement.

5. Training Objective: Why Use KL Distillation

Code entry points:

trainer.py: train_logic_guided, distill_weight_at

The supervised loss on the labeled batch $B_l$ is:

$$ \mathcal{L}_{\mathrm{sup}} = \frac{1}{|B_l|} \sum_{(x_i,y_i)\in B_l} -\log p_\theta(y_i\mid x_i). $$

The corresponding code is:

supervised_loss = F.cross_entropy(labeled_logits, yb_l)

The distillation term uses the concatenation of labeled and unlabeled inputs:

$$ X_R=\mathrm{concat}(X_l,X_u). $$

The teacher is computed on $X_R$, and the distillation loss is:

$$ \mathcal{L}_{\mathrm{distill}} = \frac{1}{|X_R|} \sum_{x\in X_R} \mathrm{KL}\!\left(t_\theta(\cdot\mid x)\,\|\,p_\theta(\cdot\mid x)\right). $$

The corresponding code is:

x_rule = torch.cat((xb_l, xb_u), dim=0)
teacher_probs = build_teacher_probs(...)
distill_loss = F.kl_div(
    F.log_softmax(student_logits, dim=1),
    teacher_probs,
    reduction="batchmean",
)

The KL direction is $\mathrm{KL}(\text{teacher}\,\|\,\text{student})$. The teacher is the target, and the student is the distribution being pulled toward it. Reversing the direction would change the optimization behavior and weaken the intended role of the teacher as a soft target.

The total loss is:

$$ \mathcal{L}_t = (1-\pi_t)\mathcal{L}_{\mathrm{sup}} +\pi_t\mathcal{L}_{\mathrm{distill}}. $$

The distillation weight uses a ramp-up schedule:

$$ \pi_t = \pi_{\max} \cdot \min\left(1,\frac{t}{\max(1,T_{\mathrm{ramp}})}\right). $$

Here, $\pi_{\max}$ corresponds to max_distill_weight, and $T_{\mathrm{ramp}}$ corresponds to ramp_up_epochs.

This schedule addresses a stability issue. At the beginning of training, the student has not yet formed a reliable distribution, and therefore the student-dependent part of the teacher is also unreliable. The supervised term first establishes a basic classifier; rule influence is then increased gradually.

The teacher is constructed under with torch.no_grad() and uses student_logits.detach(). Thus, within the current step, the teacher is treated as a fixed target:

$$ \nabla_\theta t_\theta(\cdot\mid x)=0. $$

This makes the distillation loss update only the student distribution, without backpropagating through the teacher construction. It improves stability by preventing the model from simultaneously moving the target and chasing it. The cost is that rule strength, rule temperature, and rule weights are not adapted through the training loss. Learning rule reliability would require an additional differentiable path or an outer calibration procedure.

6. Unlabeled Samples: Why They Can Contribute to Training

Code entry point:

trainer.py: train_logic_guided

Unlabeled samples do not enter the supervised loss:

$$ x_u\notin \mathcal{L}_{\mathrm{sup}}. $$

They enter through the rule-distillation path:

$$ x_u \longrightarrow q_R(\cdot\mid x_u) \longrightarrow t_\theta(\cdot\mid x_u) \longrightarrow \mathcal{L}_{\mathrm{distill}}. $$

This addresses the low-label setting. Unlabeled samples have no observed labels, but rules can still provide distributional preferences on these samples. The teacher then turns these preferences into soft targets for the student.

The important caveat is that the contribution of unlabeled data depends on rule quality and teacher construction. If the rules are systematically wrong in some region, more unlabeled data can amplify the wrong bias. The method is therefore not simply improved by adding more unlabeled samples; the reliability of rule guidance becomes increasingly important.

7. Evaluation Design: How to Check Whether Rules Matter

Code entry points:

trainer.py: evaluate
rules.py: hard_rule_prediction_for_rule, aggregated_hard_rule_prediction

The function evaluate reports not only model accuracy but also rule accuracy and model-rule agreement:

$$ \mathrm{Acc}_{\mathrm{rule}} = \frac{1}{N} \sum_{i=1}^{N} \mathbf{1}\!\left[\hat{y}_R(x_i)=y_i\right]. $$

$$ \mathrm{Agree}_{\mathrm{model,rule}} = \frac{1}{N} \sum_{i=1}^{N} \mathbf{1}\!\left[\hat{y}_\theta(x_i)=\hat{y}_R(x_i)\right]. $$

These metrics are essential. Model accuracy alone does not reveal whether performance changes are caused by rule guidance or by ordinary training variability. Rule accuracy measures whether the rules are reliable. Model-rule agreement measures whether the student is being pulled toward the rules.

The current evaluation can be strengthened. It does not directly report teacher accuracy against labels or teacher entropy. If the teacher is overconfident and frequently wrong, KL distillation can be harmful. Useful additional diagnostics include:

teacher accuracy on the validation set
teacher entropy
rule-student disagreement regions
sensitivity curves under bad rules

These diagnostics would help distinguish useful rule information from rule-induced bias.

8. What Questions the Current Rule Sets Answer

Code entry point:

rules.py: RULE_SETS

The implementation contains five rule sets:

single_good: a single rule with the correct direction
single_bad: a single rule with the wrong direction
multi_good: several broadly reasonable rules
multi_mixed: reasonable rules mixed with an incorrect rule
multi_bad: several systematically wrong rules

These settings address different questions:

single_good tests whether one approximate correct rule helps low-label learning.
single_bad tests whether the teacher can be misled by an incorrect rule.
multi_good tests whether multiple rules are more stable than one rule.
multi_mixed tests whether product-of-experts aggregation is robust to a local bad rule.
multi_bad tests whether systematically wrong knowledge can overpower supervised learning.

The most informative settings are multi_mixed and multi_bad. If the method only works under single_good, it depends heavily on rule correctness. If it remains stable under multi_mixed, the teacher construction has some robustness to imperfect knowledge.

9. Main Limitations and Possible Improvements

First, rule weights are fixed. RuleSpec.weight is manually specified and cannot be calibrated from data. A natural extension is to learn rule reliability or update $w_r$ from validation statistics.

Second, rule temperature is fixed. rule_temperature and temperature_scale control the softness of each rule, but the implementation does not assess whether these values are appropriate. A calibration curve or boundary-error-based temperature adjustment would make the rule distribution more defensible.

Third, the teacher is fixed within each step. detach and no_grad improve stability, but they block gradients from the distillation loss to the rule-construction parameters. If rule weights are to be learned, an additional learnable module or outer optimization loop is needed.

Fourth, the diagnostics are not yet sufficient for error localization. The current implementation reports rule accuracy and prediction-rule agreement, but it does not diagnose the teacher directly. In bad-rule experiments, it is important to know whether performance degradation comes from the rule distribution, the teacher construction, or an overly large distillation weight.

Fifth, the rules are linear geometric rules. Real tasks may require nonlinear, compositional, group-level, or statistical constraints. A transfer to other settings should preserve the interface of rule-to-distribution, distribution-to-teacher, and teacher-to-student distillation, rather than copying the specific rule form.

10. Transferring the Method to Feature Selection

For feature selection, the transferable object is not the two-dimensional classification rule, but the interface:

$$ \text{feature prior} \longrightarrow q_R(S\mid x) \longrightarrow t_\theta(S\mid x) \longrightarrow \mathrm{KL}\!\left(t_\theta(S\mid x)\,\|\,p_\theta(S\mid x)\right). $$

Here, $S$ denotes the selected feature subset, $p_\theta(S\mid x)$ is the selection distribution produced by the current feature selector, and $q_R(S\mid x)$ is the rule distribution induced by prior knowledge.

In feature selection, rules may express constraints such as:

some features should appear as a group
redundant features should not be selected together
certain features are more reliable under specific conditions
selected features should satisfy monotonicity, sparsity, or domain constraints

For example, if $z_j\in\{0,1\}$ indicates whether feature $j$ is selected, then a co-occurrence rule can be represented as:

$$ q_R(z_a=1,z_b=1\mid x)\uparrow. $$

This increases the rule distribution probability that features $a$ and $b$ are selected together.

A redundancy constraint can be represented as:

$$ q_R(z_a=1,z_b=1\mid x)\downarrow. $$

This decreases the rule distribution probability that features $a$ and $b$ are selected together.

The transfer can follow four steps:

Define a feature rule specification, analogous to RuleSpec.
Convert each feature rule into a selection distribution $q_r(S\mid x)$, analogous to rule_distribution.
Aggregate multiple feature rules into $q_R(S\mid x)$, analogous to aggregate_rule_distribution.
Construct a teacher $t_\theta(S\mid x)$ from the current selector distribution $p_\theta(S\mid x)$ and the rule distribution, analogous to build_teacher_probs.

The central transferable idea is teacher reweighting:

$$ t_\theta(S\mid x) \propto p_\theta(S\mid x)\,q_R(S\mid x)^\lambda. $$

Thus, rules do not replace the feature selector. They reweight the selector’s current preference distribution.

The main redesign concerns the distribution space. In logic_net_toy, there are only two classes, so $q_R(\cdot\mid x)$ is a two-dimensional distribution. In feature selection, the subset space has size $2^d$, so direct enumeration of all $S$ is infeasible. Practical approximations include:

independent Bernoulli selection: $p_\theta(S\mid x)=\prod_j p_\theta(z_j\mid x)$
group-level selection: select feature groups first, then refine within each group
top-k relaxed selection: approximate discrete selection with a continuous relaxation
Gumbel-Softmax or Concrete distributions: preserve differentiable training paths

The main lesson from this toy problem is not the specific two-dimensional rule. It is the training pattern: convert domain knowledge into a probabilistic bias over the selection space, construct a teacher distribution, and distill that teacher into the selector instead of hard-coding the rules as irreversible filters.