Draft version, Sep 4, 2022

The provided data are extracted from CD proofs as basis for experiments with other systems and techniques, for example, machine learning or computing tree edit distance measures. A particular working thesis to be investigated with such techniques is formulated at the end of this outline, after the necessary concepts have been introduced.

We call the format of the provided data *FragSeq*, suggesting
*sequences of proof fragments*.

The considered data are composed of pieces that are extracted individually from proofs. Hence we outline their content in the following from the perspective of a single given proof, that is, a D-term that proves a given goal from given axioms. As a particular example we consider a proof of problem LCL044-1. Its axioms are, in Prolog notation:

1 | i(A,i(n(A),B)) |

2 | i(A,i(B,A)) |

3 | i(i(A,i(B,C)),i(B,i(A,C))) |

4 | i(i(A,B),i(i(C,A),i(C,B))) |

5 | i(i(A,B),i(i(n(A),B),B)) |

Its goal is:

i(a,n(n(a))) |

We generalize D-terms to permit, in addition to leaf nodes
representing an axiom, also leaf nodes considered as *open*,
which we label by 0.

We ascribe to the proof a sequence of construction steps that lead
from a one-node tree consisting of a single open leaf to the proof,
where in each step an open leaf is replaced either by an axiom leaf or
by an inner node with two open children. We call the D-terms appearing
in this sequence, where all but the last have at least one open leaf,
*proof fragments*, or briefly *frags*.

We assume a specific sequence of construction steps, characterized by the replaced node in each step the leaf that appears first in depth-first left-to-right order. (There is actually a vast number of ways to ascribe a sequence of construction steps to a proof tree. For an example tree of size 7 more than 2 million such sequences are possible, such that we defer the consideration of sequences characterized in alternate ways to future work).

The following image shows the D-terms of the proof sequence for the considered proof of LCL044-1, which appears as last member, step 8, in the sequence.

The *IPT (in-place theorem)* of a node of a D-term with respect to
a goal and axioms is its conclusion formula under the most general
substitution that takes account of the goal instantiation at the root of
the D-term, of copies of axioms at leaves that are not open, and of the
relationships induced by detachment.

The *frontier* of a proof fragment is the sequence of IPTs at its
open leafs (in order as these appear depth-first left-to-right) with
respect to the overall proof goal and the given axioms.

The following table shows for each proof fragment in the sequence the
operation that was applied to close the first open leaf of its
predecessor and the resulting frontier. The operations are as follows:
*d* for replacement by an inner node with two open successors;
*axiom(N)* for replacement by a leaf representing axiom *N*;
or the special value
*init* for the first member of the sequence.

1 | init | [i(a,n(n(a)))] |

2 | d | [i(A,i(a,n(n(a)))), A] |

3 | d | [i(B,i(A,i(a,n(n(a))))), B, A] |

4 | axiom(5) | [i(C,i(a,n(n(a)))), i(n(C),i(a,n(n(a))))] |

5 | d | [i(D,i(C,i(a,n(n(a))))), D, i(n(C),i(a,n(n(a))))] |

6 | axiom(3) | [i(a,i(C,n(n(a)))), i(n(C),i(a,n(n(a))))] |

7 | axiom(1) | [i(n(n(a)),i(a,n(n(a))))] |

8 | axiom(2) | [] |

The construction steps can also be viewed as operations that are directly performed on frontiers. For example, viewed in this way, the frontier of step 3 in the above sequence is obtained from that of 2 by replacing the first member with i(B,Y) and B, where B is a fresh variable and Y is bound to the replaced first member. The frontier of step 4 is obtained from that of 3 by applying the MGU of its first member with a fresh copy of axiom 5 and then removing this first member.

The scope of variables occurring in the formulas associated with a
D-term cover the whole D-term (for tableaux, such variables are known
as *rigid*). Hence, a variable may occur in different members of a
frontier, expressing constraints between these.

In addition, although different frontiers represent different stages of the proof construction, we use here global variable names to display them, which facilitates to track the changes from one stage to the next.

Our data sets do not just represent the sequence of frontiers that
leads to the proof, but also negative examples, data for steps that do
not lead to the proof. We represent these as a special kind of partial
proofs, which we call *nonfrags*, and also provide their
frontiers. A nonfrag is obtained from a frag, an actual partial proof
in the sequence of frags that leads to the full proof, by a single
step with an operation other than that used to obtain the successor of
the frag in the sequence. An additional prerequisite for the nonfrags
is that they have a defined most general theorem with respect to the
overall goal, and thus have a defined frontier. As an example,
consider step 3 of our example. Its successor, step 4 is
obtained from 3 by the *axiom(5)* operation. Hence, the
"other" possible steps would be *d*,
*axiom(1)*, *axiom(2)*,
*axiom(3)*, *axiom(4)*. For each of these other steps, the
data set contains a nonfrag with the frag of 3 as parent. Their
respective frontiers are:

[i(C,i(B,i(A,i(a,n(n(a)))))), C, B, A] |

[B, n(B)] |

[i(a,n(n(a))), A] |

[i(a,i(A,n(n(a)))), A] |

[i(C,n(n(a))), i(a,C)] |

A particular working thesis to be addressed with the data is the conjecture that there are general characteristics of and relationships between frontiers and also the axioms that help to distinguish between the step that leads from a proof fragment to its proper successor and the misdirected steps leading to a "nonfrag". If such a distinction is possible, it may be applied to guide a theorem prover.

- fragseq_doc.txt.
Documentation of the data format
*FragSeq*. - outputs_of_show_frontiers_LCL044-1.txt, outputs_of_show_frontiers_LCL082-1.txt. Outputs of the show_frontiers program for example proofs, which indicate frontiers of frags and nonfrags in different human-readable forms.
- show_frontiers.pl. A small program to extract and print out information about the frontiers in human-readable form from the data.
- fragseq_triples_ccs_sgcd.pl.gz. FragSeq data for proofs of 176 CD problems in the TPTP, obtained by CCS and SGCD. (gzipped to 6 MB; expands into 71 MB; 797,646 facts, 176 proofs; 34,894 frags; 68,949 nonfrags). Created from the proof collection results_ccs_sgcd.pl.
- fragseq_triples_ccs_sgcd_LCL082-1.pl. Brief excerpt from fragseq_triples_ccs_sgcd.pl with just the data for problem LCL082-1, e.g., for debugging.
- mk_fragdata.pl Module to generate FragSeq data from proofs represented with the result/5 predicate (specified in exp/cdex_testing).
- Proof collections, represented with the result/5 predicate
(specified in
exp/cdex_testing).
- results_ccs_sgcd.pl. 176 CD proofs obtained with CCS and SGCD. After n-simplification, which took effect for 2 of the proofs. At most one proof per problem. Chosen among various results of experiments by the following criteria: preference if solution by CCS (in order found by CCS configured for minimal csize), smaller csize, smaller tsize, smaller height. The problems are from TPTP 8.0.0.
- results_prover9.pl. 168 CD proofs obtained with Prover9. After n-simplification, which took effect for 72 of the proofs. At most one proof per problem. The problems are from TPTP 8.0.0.