V. Benzaken et al. we denote by f @i the unique subtree t of f such that t = s i or t = l i [ f ]. The set of identifiers of a forest f is then defined as Ids( f ) = {i | ∃ t. f @i = t}.Henceforth we will consider only well-formed forests and confound the notions of a node with that of the identifier of the node.Definition 2.4 (Root id). Let t be a tree. If t = s i or t = l i [ f ], we define RootId(t) = i.
Types and ValidationIn this work, we present our approach for an abstract model of types, namely regular tree grammars. It is well known that regular tree grammars encompass most of the features of well established schema specifications such as DTDs, XMLSchemas, RelaxNG definitions, XDuce and CDuce's regular expression types. This is for instance documented in Murata et al. [2005], from where we borrow the definition of regular tree grammar.Definition 2.5 (Regular tree grammar). A regular tree grammar is a pair (S, E) where S is a set of distinguished names (actually, nonterminal metavariables) and E is a set of production rules of the form {X 1 → R 1 , . . . , X n → R n } such that:(1) each R i is either the terminal String, denoting string content, or the terminal Any, denoting any tree, or l[ r ] where l ranges over valid element names and r is a regular expression on the nonterminal symbols X 1 , . . . , X n , that is:(henceforth, we use r+ for r r * and r? for ε|r);(2) S ⊆ {X 1 , . . . , X n } is the set of start symbols;(3) for any two production rules with the same left-hand sideThe intuition is that a regular tree grammar describes (i.e., it "types") a set of trees of the data-model. Notice that the left-hand sides of the rules in E do not need to be pairwise distinct. Allowing two rules to have the same left-hand side allows us to freely take the union of two sets of rules and also simplifies some definitions. Furthermore, given a regular tree grammar, it is always possible to equivalently rewrite it so that condition 3 holds: if there are two rules X i → l[r] and X i → l[r ], then they can be merged into a single rule, X i → l[r|r ].Definition 2.6 (Names of a regular expression). Given a regular expression r we denote by Names(r) the set of nonterminals occurring in it, namely:
Names(ε)= ∅ Names(r 1 r 2 ) = Names(r 1 ) ∪ Names(r 2 ) Names(r 1 | r 2 ) = Names(r 1 ) ∪ Names(r 2 )
Names(r * ) = Names(r) Names(X) = {X}By extension, given a set of rules E = {X 0 → R 0 , . . . , X n → R n }, we define Names(E) = i∈{0,...,n} Names(R i ).Definition 2.7 (Defined name). Given a rule X → R, we call X the defined name of the rule and we note Dn(X → R). By extension, given a set E = {X 0 → R 0 , . . . , X n → R n } we define Dn(E) = {X 0 , . . . , X n }.
Note that in general, Names(E) ⊆ Dn(e).We also say that r is a regular expression over (S, E), if r is a regular expression over names in Dn(E). We will denote by L(r) the language recognized by the regular expression r. We will use W, X, Y, Z to range over names. We use Greek letters to range over sets of rules. As (S, E) represents a regular tree grammar we ...