Symbolic metaprogram search improves learning efficiency and explains rule learning in humans
Humans acquire a wide variety of concepts throughout their lives, many of which are well-described as rules, i.e. symbolic expressions in a kind of mental language or language of thought1. Category learning, for example, can be described as learning a rule which accepts or rejects potential category members based on their individual features2,3,4,5. Similarly, procedure learning can be described as acquiring a rule for which behaviors to sequence together in what order6,7,8. Theory learning can also be described as acquiring a network of rules explaining the relationships between various causes and effects9,10,11. The exact scope of human rule-learning is unclear: even if they can describe a wide variety of concepts12,13, theories of rule-learning face a number of challenges14,15,16. Moreover, exactly how many concepts are actually represented using rules is an often difficult empirical question, as seen, e.g., in debates over how humans process past-tense constructions in English17,18. Even so, rules are a significant part of humans’ cognitive landscape.
Moreover, many of the rules people learn are algorithmically rich. They go beyond associative pairings or even simple logical or arithmetic formulae to encode a series of steps with a variety of algorithmic content19. For example, the rules children learn for basic arithmetic require pattern matching, conditional reasoning, iteration, recursion, maintaining state, and caching partial results. Beyond logic and mathematics, these sorts of complex rules appear in domains as varied as game playing, social reasoning, food preparation, and natural language understanding.
Theories of how people acquire algorithmically rich rules must not only explain task performance but must also capture other hallmarks of human learning. While there are many, we focus here on three. First, the representations should be interpretable in ways that support the kinds of composition, explanation, sharing, and reuse we see in humans1,20,21,22,23. Second, learning should also be possible from sparse data on the scale that people realistically encounter24,25. Third, learning should require only moderate amounts of computation and search, consistent with human limits on thinking time and cognitive resources25,26.
One theory of rule-learning treats the language of thought as a sort of mental programming language, such that learning proceeds by constructing program-like representations. For example, the concept LIFT could be a simple program combining primitives for CAUSE, GO, and UP to mean, roughly, “cause to go up”27. This approach makes human learning analogous12 to program induction28—discovering programs to explain data. Humans learning new rules, much like computer programmers writing new programs, fluidly operate over a broad space of computations and appear to efficiently construct interpretable structures from sparse data19. Symbolic programs provide interpretable hypotheses by decomposing complex computations into discrete and semantically meaningful parts—i.e. simpler computations—that support modular explanation, reuse, and sharing29. Program-induction models are also typically data efficient, learning from relatively few observations. Human learning has been modeled as program induction in many domains, including structure discovery30, number acquisition31, rule learning32, physical reasoning33, memory34, and cultural transmission35. They have even been applied in domains seemingly resistant to program-based approaches, such as perceptual learning36,37,38,39, language learning40,41,42, and motor learning43,44.
Despite successes, program induction models face a fundamental obstacle: the hard problem of search. The space of possible programs grows exponentially in both program length and the number of primitive operators; it is unclear how to narrow the search space to prevent combinatorial explosions45. While continuous weights and differentiable error functions scale gradient-based search to arbitrarily complex neural networks46, no effective methods exist for the highly discontinuous spaces of symbolic programs. The need for effective search mechanisms is so intense that it has been hypothesized as a motivating force behind play47 and childhood48, highlighting just how significant it is that program induction models lack this ability.
To help address this problem, this paper focuses on a hypothesis about a class of representations which might help people search efficiently over program-like content. More specifically, we hypothesize that in addition to object-level content, people directly incorporate sophisticated forms of reasoning into their hypotheses. We predict that doing so reshapes inductive biases by simplifying relevant hypotheses49 and making them easier to find.
This hypothesis does not fit cleanly into the classic Marr levels50. It makes a theoretical claim not about a general computational problem or specific representation but instead about a class of representations, i.e. something between a computational and algorithmic-level claim. While many algorithmic-level details, such as the specific search algorithm, the particular domain, and even the content of individual metaprimitives, are significantly less important to our claims, we assume that algorithmic concept learning does involve a serial search process that cannot involve too many steps. These are algorithmic-level constraints on human thinking and we seek an algorithm that is consistent with them.
We therefore instantiate a version of this hypothesis in a model called MPL (MetaProgram Learner), which incorporates metaprograms—programs that revise programs—into its representation language. We test MPL against humans alongside recent and classic baselines on a benchmark of 100 program induction problems.
Before describing MPL, we present the task domain and outline our benchmark. The domain consists of list functions51,52,53,54, where learners encounter datasets pairing input and output lists of numbers. To see how learning in this domain might resemble program induction, consider \(\mathcalF\), a list function where:
$$[1,\,3,\,9,\,7]\to^\mathcalF[1,\,1,\,3,\,3,\,9,\,9,\,7,\,7]$$
(1)
Brief observation leads most people to a strong hypothesis. They notice that values in the output appear twice consecutively, suggesting duplication. Each input element also appears in the output in the same order. Together, these features suggest an iterative process like: repeat every element two times in order of appearance. This rule seemingly has no strong competitors, a sense that grows after seeing more examples:
$$[1,\,3,\,9,\,7]\to ^\mathcalF[1,\,1,\,3,\,3,\,9,\,9,\,7,\,7]$$
(2)
$$[6,\,9,\,2,\,8,\,0,\,5]\to^\mathcalF[6,\,6,\,9,\,9,\,2,\,2,\,8,\,8,\,0,\,0,\,5,\,5]$$
(3)
$$[9,\,2]\to^\mathcalF[9,\,9,\,2,\,2]$$
(4)
People see up to eleven examples in our experiments, but nearly all participants acquire this rule within three examples. Program induction models might hypothesize that learners represent it with a program like:
$$\tt \mathcalF=(\lambda \,\,\mboxxs\,\,(\mboxif\,\,(\mboxempty xs)\,\mboxxs\,[(\mboxhead xs),\,(\mboxhead xs)| \,(\mathcalF\,(\mboxtail xs))]))$$
(5)
(λxs …) uses the λ operator from λ-calculus, which here creates a function taking a list, xs, as input. (if (empty xs) …) tests whether xs is empty. If so, \(\mathcalF\) returns xs; there is nothing to duplicate. Otherwise, [(head xs), (head xs) ∣ …] creates a list repeating xs’ first element, or head, twice ([x,… ∣ zs] prepends x,… to the list zs). (\({\tt\mathcalF}\)(tail xs)) completes the list by recursively applying \(\mathcalF\) to xs’ remaining items, or tail.
Some list functions are harder to learn. Consider \(\mathcalG\):
$$[7,\,9,\,0,\,2,\,6,\,8,\,3,\,4,\,6]\to^\mathcalG[0,\,9,\,7,\,4,\,6,\,3]$$
(6)
Some people may notice that the output contains a subset of the input elements, but there seems to be no obvious pattern. Unlike with \(\mathcalF\), it is difficult to form strong hypotheses without more data:
$$[7,\,9,\,0,\,2,\,6,\,8,\,3,\,4,\,6]\to^\mathcalG[0,\,9,\,7,\,4,\,6,\,3]$$
(7)
$$[1,\,7,\,8,\,2,\,5,\,6,\,1]\to^\mathcalG[8,\,7,\,1,\,4,\,5,\,1]$$
(8)
$$[6,\,7,\,1,\,3,\,2,\,0,\,8,\,9,\,4,\,5]\to^\mathcalG[1,\,7,\,6,\,4,\,2,\,8]$$
(9)
Many people remain puzzled even after studying these examples. About half of our participants never acquire a rule for \(\mathcalG\); the others usually need three to five examples. Those who do acquire it may notice several unlikely coincidences. First, \(\mathcalG\) does not trivially map every input to the same output. Second, input length varies but the output always has six elements. Third, many but not all input elements appear in the output (perhaps \(\mathcalG\) filters elements using some test or shuffling operator). Fourth, shared elements differ in order, so filtering seems unlikely. Fifth, fixed positions in the input are copied to fixed positions in the output. Element 1 becomes element 3, 2 stays 2, 3 becomes 1, 5 stays 5, and 7 becomes 6. Finally, output element 4 is always 4.
Each observation identifies a simple pattern produced by aligning shared structure in the data. Putting them together leads to the rule: elements 3, 2, 1, the number 4, then elements 5 and 7. While this rule explains the data, it seems unusual. We can nevertheless model it as the program:
$$\tt\mathcalG=(\lambda \,\,\mboxxs\,\,(\,\mboxswap\,\,3\,1\,(\,\mboxreplace\,\,4\,4\,(\,\mboxcut\,\,6\,(\,\mboxtake\,\,7\,\,\mboxxs)))))$$
(10)
It again uses λ to create a function binding xs, (λ xs …). Working from the inside out in the remaining expression, (take 7 xs) takes the first seven elements, (cut 6 …) removes the sixth, (replace 4 4 …) replaces the fourth with a 4, and (swap 3 1 …) swaps the first and third. Composing a few simple operations represents an unlikely concept that can still be learned from sparse data.
Like other classic domains such as numerical functions55,56,57,58 and Boolean functions2,3,5,49,59, list functions might superficially seem abstract and focused on a narrow corner of human cognition, but they are well suited to empirical study and modeling of how people learn rules. Numbers and sequences both have a long and productive history in the study of human learning8,37,60,61,62,63. List functions are in fact particularly useful for testing the sorts of program-learning models of concept learning which have now been deployed to explain rule-learning in dozens of domains11,12,19,27,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44. They provide a general and well-controlled setting where problems vary widely in difficulty and algorithmic content (the domain is Turing-universal) and can be tested easily in humans and machines. Indeed, many bear a strong resemblance to everyday tasks such as sifting out junk mail (filtering); counting the books on a shelf (folding separate items into a composite result); alphabetizing a list of names (sorting by a criteria); and decorating a tray of cupcakes (mapping a transformation over a collection of items). Being analogous, however, does not mean that we claim that the tasks are equivalent. In more naturalistic cases, it seems likely that context-specific knowledge effects64 aid learning, but our results (including the replication described in Supplemental Note 8) show that in this domain, as in many others, people can rapidly acquire and apply rules from sparse data. Human performance on our task in particular is far above chance and remains interesting in its own right.
We conducted a study of human and machine concept learning by constructing a benchmark of 100 list functions that vary widely in learnability (Fig. 1). The set includes \({{{\mathcalF}}}\) and \({{\mathcalG}}\), so the discussion above is relevant to the entire benchmark. Our primary goals in constructing this benchmark were to collect functions: demonstrating broad variation in learning difficulty for humans (i.e. not dominated by floor/ceiling effects); which could be described with a small set of primitives; and that are easy enough to learn that the performance of program induction models would not be dominated by floor/ceiling effects. Moreover, testing our hypothesis requires problems where we can compare solutions which do and do not incorporate representations of structured reasoning. Most benchmark problems thus emphasize reasoning techniques which MPL can leverage during search. We compared MPL’s performance on this benchmark to leading alternative explanations of human rule-learning.
Building on the idea of learning over program-like representations, our approach to concept learning draws on three core insights inspired by the techniques of human programmers19.
First, most program learning models search over programs composed of object-level primitives, such as head and take in Eqs. (5) & (10). Assuming search operators are fully parameterized, programs can also be described using the decisions required to produce them during search. These decisions describe how to construct a program, namely by repeating the search process producing it. While this process is typically implicit in search algorithms, programmers often consider it explicitly, discussing transformations and their effects—e.g. swapping iteration for recursion or extracting repeated code into a shared function—in addition to actual code.
Second, many search algorithms apply a single generic operator, e.g. enumerating from a grammar or sampling from a distribution. Some bias search toward the best hypotheses discovered so far: consider Markov chain Monte Carlo’s accept/reject step65; particle filtering’s resampling66; or genetic programming’s tournaments67. Even so, learning inefficiently relies on accumulating small, often random, local changes. By contrast, human programmers can flexibly combine hundreds of structured techniques for revising programs68. Many cater to particular problems and specify context-dependent solutions, much like high-level actions in hierarchical planning69.
Third, many search algorithms begin without regard for available data, e.g. starting from the lexicographically first program or a random sample. Such hypothesis-driven learning generates proposals independently of data70. These methods are very general but must discover relevant structure by chance rather than by inferring it from data. By contrast, data-driven learning generalizes input/output pairs directly into a program using some inference technique, e.g. for detecting recurrent structure. It minimizes search but requires strong assumptions that sharply constrain which programs can be learned from what data. Human programming techniques supersede both approaches in many ways: they are often designed to expose latent structure and can be flexibly composed to apply to nearly any problem. They can thus be rooted directly in the data whose structure needs to be explained (e.g. “recursive data can be rewritten like so” or “if data contains repetition with minor differences, perhaps those differences can be abstracted away.”).
Given these observations, we hypothesize that people extend languages of object-level primitives with patterns of structured transformation called metaprimitives. Some metaprimitives might simplify repeated structure; others might memorize data for further analysis or to encode exceptions. On this view, primitives and metaprimitives can be freely composed into expressions called metaprograms that combine object-level content and structured transformation. Metaprimitives operate on structures built of primitives, so a metaprogram can always be evaluated to produce a program without metaprimitives. That is, metaprimitives provide an alternative way of expressing certain programs, shifting the inductive bias so that they become easier to describe.
While the introduction of new object-level primitives also shifts the inductive bias49,71, metaprimitives can capture different kinds of bias from object-level primitives. In particular, object-level primitives cannot leverage the internal structure of their arguments; they must treat those arguments as black boxes. By contrast, metaprimitives are program transformations and so can change their behavior based on this internal structure. For example, the MPL model includes a metaprimitive called AntiUnify, whose primary effect is to introduce variables into programs. There are many ways to do this, and considering them all would require a long search. AntiUnify, however, uses the structure of the input program to decide where to introduce variables without additional search. That is, AntiUnify uses the structure of its arguments to ignore portions of the search space which methods using only object-level primitives would otherwise have to consider.
Metaprimitives thus take advantage of all three insights above. First, they make program transformations an explicit part of the language instead of leaving them only implicitly available as search operators. Second, just as languages typically contain many primitives, they can also contain many metaprimitives, each expressing a different program manipulation. Third, if some metaprimitives can memorize data, other metaprimitives can extract information from those data and learn more efficiently than using primitives alone by introducing different kinds of inductive bias. By encoding search operators reminiscent of data-driven search and embedding them into the language of a hypothesis-driven learner, metaprimitives perhaps combine the best of both approaches.
To evaluate these ideas, we implement MPL, a symbolic learner which extends traditional program induction approaches by incorporating metaprimitives. We seek to investigate the usefulness of the metaprimitive approach rather than to make strong claims about any specific metaprimitive. The particular metaprimitives implemented here (Table 1; Supplementary Note 2) thus capture relatively simple patterns of reasoning inspired by operators in inductive logic programming72, analytical induction73, automated theorem proving74, and refactoring techniques in software engineering68. In practice, some metaprimitives do more work than others but each describes an operation for reasoning about program structure.
Program-induction-based models of concept learning often use languages whose primitives (and in this case, metaprimitives) are closely related to the concepts being studied. This can be seen, for example, in recent work on learning in the domains of number31,75, logic49,76, and geometry37,77, among others. The claim is not that these limited languages constitute a learner’s entire mental repertoire, nor that the studied domain is the only one in which humans are capable of learning. Nor is the claim that the simple existence of computational primitives (or metaprimitives) is enough to explain human learning, or that any existing model is sufficient to explain all of human learning. They are instead case studies comparing a plausible set of primitives and learning dynamics against human learners in a particular domain. We take the same approach in introducing metaprimitives.
Metaprimitives are useful for working with list functions because they capture patterns of reasoning (e.g. simple forms of structure mapping, composition, generalization) that are useful for reasoning about lists specifically or about programs generally, similar to human code manipulation techniques. Previous learning systems embed these operators directly into search algorithms and apply them in stereotypical patterns. Explicit metaprimitives allow MPL significantly more flexibility than previous models.
Figure 2A–C illustrates MPL using \({{{\mathcalF}}}\), described earlier. Given examples (Fig. 2A), MPL learns a metaprogram (Fig. 2B) combining primitives—namely the empty program, ε—and metaprimitives. MemorizeAll adds data directly to a program, making their latent structure available to other metaprimitives. Recurse hypothesizes that rules involving certain limited transformations of linearly recursive structures (e.g. elementwise transformations of lists, unary numbers, strings) can themselves be recursively decomposed into simpler rules. Here, it captures people’s observation that each input element explains two consecutive output elements by aligning and unrolling input/output lists. This change reveals latent structure but introduces many new rules. AntiUnify is helpful here. It uses anti-unification—an important program synthesis technique78,79—to compute a least-general generalization that systematically aligns shared structure across rules into a single general rule. For example, comparing F[1∣[3, 9, 7]] ≈ [1, 1∣(F[3, 9, 7])] and F[3∣[9, 7]] ≈ [3, 3∣(F[9, 7])] reveals a common structure: the first element is repeated twice, and the rest of the list is processed recursively. AntiUnify discovers a corresponding rule, F [x ∣ y] ≈ [x, x ∣ (F y)], by similarly aligning common structure and generalizing over differences.
Because metaprimitives represent program transformations, applying a series of metaprimitives produces intermediate results and then a final program that both explains the data and can be applied to novel inputs (Fig. 2C). Because MPL can freely mix primitives and metaprimitives, it can also learn programs directly, e.g. for problems where available metaprimitives are not applicable.
Figure 2D–F repeat the process for \({{{\mathcalG}}}\). While \({{{\mathcalG}}}\) is complex to describe in English, its metaprogram is even simpler than \({{{\mathcalF}}}\)’s. Lacking recursive structure, \({{{\mathcalG}}}\) can be described using structural alignment alone. After encoding data with MemorizeAll, a call to AntiUnify is sufficient. The resulting program, however, is more complex than the one for \({{{\mathcalF}}}\). MPL is sensitive to this complexity, which helps to explain why \({{{\mathcalG}}}\) is harder to learn than \({{{\mathcalF}}}\). While the metaprogram is simple, the complexity of the resulting program requires observing a sufficient amount of data.
To balance simplicity and fit, MPL models learning as MAP inference in a Bayesian posterior over metaprograms. Computing the posterior exactly is intractable; MPL approximates it using Markov Chain Monte Carlo (MCMC) over programs42,76 extended to the space of metaprograms. Monte Carlo methods are notable as rational process models80, addressing computational-level concerns with psychologically plausible methods. This approach might appear to suffer from the problem that we identified earlier of learning inefficiently via small, local changes. Searching over metaprograms, however, helps to address this problem. Because metaprimitives can encode arbitrary program transformations, even small changes can have large, non-local impacts on the resulting program.
link