Principles of constructing systems for generating DTDs for a collection of XML documents are discussed. Methods and algorithms for creating DTDs are developed. A DTD generation system for a collection of XML documents is developed. This system can efficiently be used both for solving applied problems and for theoretical studies. PROGRAMMING AND COMPUTER SOFTWARE Vol. 31 No. 4 2005 LEONOV, KHUSNUTDINOV of other elements and attributes. The DTDs for attributes, elements with a mixed content, and elements with a content of type # PCDATA are not difficult to construct.Therefore, in what follows, we will consider generating DTDs for elements containing elements. More precisely, this problem can be formulated as follows. Let an element X occur in an XML document (or in a collection of XML documents) n times, and let s 1 , s 2 , …, s n be the corresponding sequences of elements nested in this element. It is required to construct a regular expression (DTD of the element X ) for the set of sequences I = { s 1 , s 2 , …, s n } that describes all these sequences.A regular expression is a coding for a set of sequences of symbols. The syntax of regular expressions is based on the use of the metasymbols ?, +, and * to denote the number of repetitions of a symbol ("0 or 1," "1 or greater," or "0, 1, or greater," respectively), the metasymbol | for disjunction, and the metasymbols (and) to separate a group of symbols. For example, the regular expression ( ab ) + ( c | d ) encodes the set of sequences { abc , abd , ababc , ababd , abababc , abababd ,…}. A rigorous definition of the regular expression can be found, for example, in [14].An automated creation of a brief and precise DTD for an element containing elements is a rather complicated task. The point is that simple, straightforward approaches to DTD generation give rise to cumbersome regular expressions that are not adequate to the internal structure of the element and are quite different from those that would be suggested by a human in a similar situation. The following example clarifies this point.Example 1. One of the straightforward approaches consists in constructing for the given element a regular expression that corresponds exactly to all sequences of its nested elements (and only these sequences) that occur in the collection of XML documents. For an XML document, we consider a list of references. Let this document consist of a sequence of elements < paper >, each of which in turn contains one nested element < title > and one or several nested elements < author >. For the sake of brevity, < title > and < author > are denoted by the symbols t and a , respectively. Let the element < paper > occur five times in the XML document, and let these occurrences be associated with the following sequences of nested elements: t , ta , taa , taaa , and taaaa . In this case, the above-described approach results in the regular expression t | ta | taa | taaa | taaaa , which can be simplified to the expression t | t ( a | a ( a | a ( a | aa ))). Clearly, this regular expression is to...