Ilia Sucholutsky scite author profile

Most real-world datasets, and particularly those collected from physical systems, are full of noise, packet loss, and other imperfections. However, most specification mining, anomaly detection and other such algorithms assume, or even require, perfect data quality to function properly. Such algorithms may work in lab conditions when given clean, controlled data, but will fail in the field when given imperfect data. We propose a method for accurately reconstructing discrete temporal or sequential system traces affected by data loss, using Long Short-Term Memory Networks (LSTMs). The model works by learning to predict the next event in a sequence of events, and uses its own output as an input to continue predicting future events. As a result, this method can be used for data restoration even with streamed data. Such a method can reconstruct even long sequence of missing events, and can also help validate and improve data quality for noisy data. The output of the model will be a close reconstruction of the true data, and can be fed to algorithms that rely on clean data. We demonstrate our method by reconstructing automotive CAN traces consisting of long sequences of discrete events. We show that given even small parts of a CAN trace, our LSTM model can predict future events with an accuracy of almost 90%, and can successfully reconstruct large portions of the original trace, greatly outperforming a Markov Model benchmark. We separately feed the original, lossy, and reconstructed traces into a specification mining framework to perform downstream analysis of the effect of our method on state-of-the-art models that use these traces for understanding the behavior of complex systems.

show abstract

Soft-Label Dataset Distillation and Text Dataset Distillation

Sucholutsky

Schonlau

2021

View full text Add to dashboard Cite

GPT is an effective tool for multilingual psychological text analysis

Rathje¹,

Mirea

Sucholutsky³

et al. 2023

Preprint

View full text Add to dashboard Cite

The social and behavioral sciences have been increasingly using automated text analysis to measure psychological constructs in text. We explore whether GPT, the large-language model underlying the artificial intelligence chatbot ChatGPT, can be used as a tool for automated psychological text analysis in various languages. Across 15 datasets (n = 31,789 manually annotated tweets and news headlines), we tested whether GPT-3.5 and GPT-4 can accurately detect psychological constructs (sentiment, discrete emotions, and offensiveness) across 12 languages (English, Arabic, Indonesian, and Turkish, as well as eight African languages including Swahili, Amharic, Yoruba and Kinyarwanda). We found that GPT performs much better than English-language dictionary-based text analysis (r = 0.66-0.75 for correlations between manual annotations and GPT-4, as opposed to r = 0.20-0.30 for correlations between manual annotations and dictionary methods). Further, GPT performs nearly as well as or better than several fine-tuned machine learning models, though GPT had poorer performance in African languages and in comparison to more recent fine-tuned models. Overall, GPT may be superior to many existing methods of automated text analysis, since it achieves relatively high accuracy across many languages, requires no training data, and is easy to use with simple prompts (e.g., “is this text negative?”) and little coding experience. We provide sample code for analyzing text with the GPT application programming interface. GPT and other large-language models may be the future of psychological text analysis, and may help facilitate more cross-linguistic research with understudied languages.

show abstract

Text Mining with n-gram Variables

2017

View full text Add to dashboard Cite

Text mining is the process of turning free text into numerical variables and then analyzing them with statistical techniques. We introduce the command ngram, which implements the most common approach to text mining, the “bag of words”. An n-gram is a contiguous sequence of words in a text. Broadly speaking, ngram creates hundreds or thousands of variables, each recording how often the corresponding n-gram occurs in a given text. This is more useful than it sounds. We illustrate ngram with the categorization of text answers from two open-ended questions.

show abstract

`Less Than One'-Shot Learning: Learning N Classes From M < N Samples

Sucholutsky

Schonlau

2021

AAAI

View full text Add to dashboard Cite

Deep neural networks require large training sets but suffer from high computational cost and long training times. Training on much smaller training sets while maintaining nearly the same accuracy would be very beneficial. In the few-shot learning setting, a model must learn a new class given only a small number of samples from that class. One-shot learning is an extreme form of few-shot learning where the model must learn a new class from a single example. We propose the 'less than one'-shot learning task where models must learn N new classes given only M

show abstract

Predicting Human Similarity Judgments Using Large Language Models

Marjieh¹,

Sucholutsky²,

Sumers³

et al. 2022

Preprint

View full text Add to dashboard Cite

Similarity judgments provide a well-established method for accessing mental representations, with applications in psychology, neuroscience and machine learning. However, collecting similarity judgments can be prohibitively expensive for naturalistic datasets as the number of comparisons grows quadratically in the number of stimuli. One way to tackle this problem is to construct approximation procedures that rely on more accessible proxies for predicting similarity. Here we leverage recent advances in language models and online recruitment, proposing an efficient domain-general procedure for predicting human similarity judgments based on text descriptions. Intuitively, similar stimuli are likely to evoke similar descriptions, allowing us to use description similarity to predict pairwise similarity judgments. Crucially, the number of descriptions required grows only linearly with the number of stimuli, drastically reducing the amount of data required. We test this procedure on six datasets of naturalistic images and show that our models outperform previous approaches based on visual information.

show abstract

Text Mining with n-gram Variables

2017

View full text Add to dashboard Cite

ConvART: Improving Adaptive Resonance Theory for Unsupervised Image Clustering

Sucholutsky¹,

Schonlau²

2018

J. Comp. Vis. Imag. Sys.

View full text Add to dashboard Cite

While supervised learning techniques have become increasinglyadept at separating images into different classes, these techniquesrequire large amounts of labelled data which may not always beavailable. We propose a novel neuro-dynamic method for unsuper-vised image clustering by combining 2 biologically-motivated mod-els: Adaptive Resonance Theory (ART) and Convolutional Neu-ral Networks (CNN). ART networks are unsupervised clustering al-gorithms that have high stability in preserving learned informationwhile quickly learning new information. Meanwhile, a major prop-erty of CNNs is their translation and distortion invariance, whichhas led to their success in the domain of vision problems. Byembedding convolutional layers into an ART network, the usefulproperties of both networks can be leveraged to identify differentclusters within unlabelled image datasets and classify images intothese clusters. In exploratory experiments, we demonstrate thatthis method greatly increases the performance of unsupervisedART networks on a benchmark image dataset.

show abstract

12 3

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.