Inverted index structures are the mainstay of modern text retrieval systems. They can be constructed quickly using off-line mergebased methods, and provide efficient support for a variety of querying modes. In this paper we examine the task of on-line index construction -that is, how to build an inverted index when the underlying data must be continuously queryable, and the documents must be indexed and available for search as soon they are inserted. When straightforward approaches are used, document insertions become increasingly expensive as the size of the database grows. This paper describes a mechanism based on controlled partitioning that can be adapted to suit different balances of insertion and querying operations, and is faster and scales better than previous methods. Using experiments on 100 GB of web data we demonstrate the efficiency of our methods in practice, showing that they dramatically reduce the cost of on-line index construction.
Inverted index structures are a core element of current text retrieval systems. They can be constructed quickly using offline approaches, in which one or more passes are made over a static set of input data, and, at the completion of the process, an index is available for querying. However, there are search environments in which even a small delay in timeliness cannot be tolerated, and the index must always be queryable and up to date. Here we describe and analyze a
geometric partitioning
mechanism for online index construction that provides a range of tradeoffs between costs, and can be adapted to different balances of insertion and querying operations. Detailed experimental results are provided that show the extent of these tradeoffs, and that these new methods can yield substantial savings in online indexing costs.
In certain English finite complement clauses, inclusion of the complementizer that is optional. Previous research has identified various factors that influence when native speakers tend to produce or omit the complementizer, including syntactic weight, clause juncture constraints, and predicate frequency. The present study addresses the question to what extent German and Spanish learners of English as a second language (L2) produce and omit the complementizer under similar conditions. 3,622 instances of English adjectival, object, and subject complement constructions were retrieved from the International Corpus of English and the German and Spanish components of the International Corpus of Learner English. A logistic regression model suggests that L2 learners’ and natives’ production is largely governed by the same factors. However, in comparison with native speakers, L2 learners display a lower rate of complementizer omission. They are more impacted by processing-related factors such as complexity and clause juncture, and less sensitive to verb-construction cue validity.
In this paper, we investigate evolutionarily recent changes in the distributions of speech sounds in the world's languages. In particular, we explore the impact of language contact in the past two millennia on today's distributions. Based on three extensive databases of phonological inventories, we analyse the discrepancies between the distribution of speech sounds of ancient and reconstructed languages, on the one hand, and those in present-day languages, on the other. Furthermore, we analyse the degree to which the diffusion of speech sounds via language contact played a role in these discrepancies. We find evidence for substantive differences between ancient and present-day distributions, as well as for the important role of language contact in shaping these distributions over time. Moreover, our findings suggest that the distributions of speech sounds across geographic macro-areas were homogenized to an observable extent in recent millennia. Our findings suggest that what we call the Implicit Uniformitarian Hypothesis, at least with respect to the composition of phonological inventories, cannot be held uncritically. Linguists who would like to draw inferences about human language based on present-day cross-linguistic distributions must consider their theories in light of even short-term language evolution.
This article is part of the theme issue ‘Reconstructing prehistoric languages’.
This study examines the factors that govern the variable presence of the complementizer that in English object-, subject-, and adjectival complement constructions as in (1) to (3): 1 (1) a. I thought that Nick likes candy. b. I thought Ø Nick likes candy. (2) a. The problem is that Nick doesn't like candy. b. The problem is Ø Nick doesn't like candy. (3) a. I'm glad that Stefan likes candy. b. I'm glad Ø Stefan likes candy. The conditions under which native speakers (NS) decide to realize or drop the complementizer have been intensively studied (e.g., Jaeger 2010; Tagliamonte and Smith 2005; Thompson and Mulac 1991; Torres Cacoullos and Walker 2009), while few studies have investigated this phenomenon in non-native speakers (NNS) (e.g., Durham 2011; Wulff, Lester, and Martinez-Garcia 2014). In the present study, we therefore address the following research questions:-What factors govern that-variation in intermediate-level German and Spanish L2 learners of English?-How do these learners' preferences compare to those of native speakers? More specifically, under what conditions, how much, and why do learners deviate from native speaker behavior?
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.