We describe a first attempt at using techniques from computational linguistics to analyze the undeciphered proto-Elamite script. Using hierarchical clustering, n-gram frequencies, and LDA topic models, we both replicate results obtained by manual decipherment and reveal previously-unobserved relationships between signs. This demonstrates the utility of these techniques as an aid to manual decipherment.
We propose a novel data-augmentation technique for neural machine translation based on ROT-k ciphertexts. ROT-k is a simple letter substitution cipher that replaces a letter in the plaintext with the kth letter after it in the alphabet. We first generate multiple ROT-k ciphertexts using different values of k for the plaintext which is the source side of the parallel data. We then leverage this enciphered training data along with the original parallel data via multi-source training to improve neural machine translation. Our method, CipherDAug, uses a co-regularization-inspired training procedure, requires no external data sources other than the original training data, and uses a standard Transformer to outperform strong data augmentation techniques on several datasets by a significant margin. This technique combines easily with existing approaches to data augmentation, and yields particularly strong results in low-resource settings. 1
We introduce a language modeling architecture which operates over sequences of images, or over multimodal sequences of images with associated labels. We use this architecture alongside other embedding models to investigate a category of signs called complex graphemes (CGs) in the undeciphered proto-Elamite script. We argue that CGs have meanings which are at least partly compositional, and we discover novel rules governing the construction of CGs. We find that a language model over sign images produces more interpretable results than a model over text or over sign images and text, which suggests that the names given to signs may be obscuring signals in the corpus. Our results reveal previously unknown regularities in proto-Elamite sign use that can inform future decipherment efforts, and our image-aware language model provides a novel way to abstract away from biases introduced by human annotators.
We show that an ε-free, chain-free synchronous context-free grammar (SCFG) can be converted into a weakly equivalent synchronous tree-adjoining grammar (STAG) which is prefix lexicalized. This transformation at most doubles the grammar's rank and cubes its size, but we show that in practice the size increase is only quadratic. Our results extend Greibach normal form from CFGs to SCFGs and prove new formal properties about SCFG, a formalism with many applications in natural language processing.
This paper presents a novel analysis of clausal coordination with shared arguments using synchronous tree adjoining grammar (STAG), a pairing of a tree adjoining grammar (TAG) for syntax and a TAG for semantics. In clausal coordination, one or more arguments can be shared by the verbal predicates of the conjuncts, as in Sue likes and Kim hates Pete, where an object argument Pete is shared by likes and hates. As the predicate-argument structure must be represented within each predicative elementary tree in TAG, modelling argument sharing across clauses poses an interesting challenge for TAG. A widely adopted approach within the TAG literature at present is to employ the conjoin operation (Sakar and Joshi, 1996, Proceedings of COLING '96, 610-615), a non-standard tree-composing operation in TAG. This operation applies across elementary trees to identify and merge the arguments from each clause, yielding a derivation structure in which the shared arguments are combined with multiple elementary trees and a derived tree in which the shared arguments are dominated by multiple verbal projections. In contrast, our STAG analysis pairs a syntactic elementary tree that participates in the derivation of clausal coordination with a semantic elementary tree that includes a λ-term to abstract over the shared argument. This allows the sharing of arguments in coordination to be instantiated in semantics, without being represented in syntax in the form of multiple dominance, utilizing only the standard TAG operations, substitution and adjoining.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.