Alex Salcianu scite author profile

Alex Salcianu

5Publications

50Citation Statements Received

69Citation Statements Given

How they've been cited

112

How they cite others

Affiliations

Google (United States)

Publications

Order By: Most citations

Fast WordPiece Tokenization

Song¹,

Salcianu²,

Song³

et al. 2021

View full text Add to dashboard Cite

Tokenization is a fundamental preprocessing step for almost all NLP tasks. In this paper, we propose efficient algorithms for the Word-Piece tokenization used in BERT, from singleword tokenization to general text (e.g., sentence) tokenization. When tokenizing a single word, WordPiece uses a longest-matchfirst strategy, known as maximum matching. The best known algorithms so far are ( 2 ) (where is the input length) or ( ) (where is the maximum vocabulary token length). We propose a novel algorithm whose tokenization complexity is strictly ( ). Our method is inspired by the Aho-Corasick algorithm. We introduce additional linkages on top of the trie built from the vocabulary, allowing smart transitions when the trie matching cannot continue.For general text, we further propose an algorithm that combines pre-tokenization (splitting the text into words) and our linear-time Word-Piece method into a single pass. Experimental results show that our method is 8.2x faster than HuggingFace Tokenizers and 5.1x faster than TensorFlow Text on average for general text tokenization.

show abstract

Natural Language Processing with Small Feed-Forward Networks

Botha

Pitler

et al. 2017

View full text Add to dashboard Cite

We show that small and shallow feedforward neural networks can achieve near state-of-the-art results on a range of unstructured and structured language processing tasks while being considerably cheaper in memory and computational requirements than deep recurrent models. Motivated by resource-constrained environments like mobile phones, we showcase simple techniques for obtaining such small neural network models, and investigate different tradeoffs when deciding how to allocate a small memory budget.

show abstract

Scaling Up Models and Data with $\texttt{t5x}$ and $\texttt{seqio}$

Roberts¹,

Chung²,

Levskaya³

et al. 2022

Preprint

View full text Add to dashboard Cite

Recent neural network-based language models have benefited greatly from scaling up the size of training datasets and the number of parameters in the models themselves. Scaling can be complicated due to various factors including the need to distribute computation on supercomputer clusters (e.g., TPUs), prevent bottlenecks when infeeding data, and ensure reproducible results. In this work, we present two software libraries that ease these issues: t5x simplifies the process of building and training large language models at scale while maintaining ease of use, and seqio provides a task-based API for simple creation of fast and reproducible training data and evaluation pipelines. These open-source libraries have been used to train models with hundreds of billions of parameters on datasets with multiple terabytes of training data. Along with the libraries, we release configurations and instructions for T5-like encoder-decoder models as well as GPT-like decoder-only architectures.t5x and seqio are open source and available at https://github.com/google-research/ t5x and https://github.com/google/seqio, respectively.

show abstract

Fast WordPiece Tokenization

Song

Salcianu

Song

et al. 2020

Preprint

View full text Add to dashboard Cite

Natural Language Processing with Small Feed-Forward Networks

Botha¹,

Pitler²,

Ma³

et al. 2017

Preprint

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Alex Salcianu

Fast WordPiece Tokenization

Natural Language Processing with Small Feed-Forward Networks

Scaling Up Models and Data with $\texttt{t5x}$ and $\texttt{seqio}$

Fast WordPiece Tokenization

Natural Language Processing with Small Feed-Forward Networks

Contact Info

Product

Resources

About