Shaobin Xu scite author profile

Brain-computer interfaces and other augmentative and alternative communication devices introduce language-modeing challenges distinct from other character-entry methods. In particular, the acquired signal of the EEG (electroencephalogram) signal is noisier, which, in turn, makes the user intent harder to decipher. In order to adapt to this condition, we propose to maintain ambiguous history for every time step, and to employ, apart from the character language model, word information to produce a more robust prediction system. We present preliminary results that compare this proposed Online-Context Language Model (OCLM) to current algorithms that are used in this type of setting. Evaluations on both perplexity and predictive accuracy demonstrate promising results when dealing with ambiguous histories in order to provide to the front end a distribution of the next character the user might type.

show abstract

Recovering Lexically and Semantically Reused Texts

MacLaughlin¹,

Xu²,

Smith³

2021

View full text Add to dashboard Cite

Writers often repurpose material from existing texts when composing new documents. Because most documents have more than one source, we cannot trace these connections using only models of document-level similarity. Instead, this paper considers methods for local text reuse detection (LTRD), detecting localized regions of lexically or semantically similar text embedded in otherwise unrelated material. In extensive experiments, we study the relative performance of four classes of neural and bag-of-words models on three LTRD tasks -detecting plagiarism, modeling journalists' use of press releases, and identifying scientists' citation of earlier papers. We conduct evaluations on three existing datasets and a new, publicly-available citation localization dataset. Our findings shed light on a number of previously-unexplored questions in the study of LTRD, including the importance of incorporating document-level context for predictions, the applicability of of-the-shelf neural models pretrained on "general" semantic textual similarity tasks such as paraphrase detection, and the trade-offs between more efficient bag-of-words and feature-based neural models and slower pairwise neural models.

show abstract

Contrastive Training for Models of Information Cascades

Xu¹,

Smith²

2018

Preprint

View full text Add to dashboard Cite

This paper proposes a model of information cascades as directed spanning trees (DSTs) over observed documents. In addition, we propose a contrastive training procedure that exploits partial temporal ordering of node infections in lieu of labeled training links. This combination of model and unsupervised training makes it possible to improve on models that use infection times alone and to exploit arbitrary features of the nodes and of the text content of messages in information cascades. With only basic node and time lag features similar to previous models, the DST model achieves performance with unsupervised training comparable to strong baselines on a blog network inference task. Unsupervised training with additional content features achieves significantly better results, reaching half the accuracy of a fully supervised model.

show abstract

Modeling text embedded information cascades

Xu¹

View full text Add to dashboard Cite

Networks mediate several aspects of society. For example, social networking services (SNS) like Twitter and Facebook have greatly helped people connect with families, friends and the outside world. Public policy diffuses over institutional and social networks that connect political actors in different areas. Inferring network structure is thus essential for understanding the transmission of ideas and information, which in turn could answer questions about communities, collective actions, and influential social participants. Since many networks are not directly observed, we often rely on indirect evidence, such as the timing of messages between participants, to infer latent connections. The textual content of messages, especially the reuse text originating elsewhere, is one source of such evidence. This thesis contributes techniques for detecting the evidence of text reuse and modeling underlying network structure. We propose methods to model text reuse with accidental and intentional lexical and semantic mutations. For lexical similarity detection, an n-gram shingling algorithm is proposed to detect "locally" reused passages, instead of near-duplicate documents, embedded within the larger text output of network nodes. For semantic similarity, we use an attention based neural network to also detect embedded reused texts. When modeling network structure, we are interested in inferring different levels of details: individual links between participants, the structure of a specific information cascade, or global network properties. We propose a contrastive training objective for conditional models of edges in information cascades that has the flexibility to answer those questions ii Abstract iii and is also capable of incorporating rich node and edge features. Last but not least, network embedding methods prove to be a good way to learn the representations of nodes while preserving structure, node and edge properties, and side information. We propose a self-attention Transformer-based neural network trained to predict the next activated node in a given cascade to learn node embeddings.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Shaobin Xu

Retrieving and Combining Repeated Passages to Improve OCR

A Multi-Context Character Prediction Model for a Brain-Computer Interface

Recovering Lexically and Semantically Reused Texts

Contrastive Training for Models of Information Cascades

Modeling text embedded information cascades

Contact Info

Product

Resources

About