Transformers, which are popular for language modeling, have been explored for solving vision tasks recently, e.g., the Vision Transformers (ViT) for image classification. The ViT model splits each image into a sequence of tokens with fixed length and then applies multiple Transformer layers to model their global relation for classification. However, ViT achieves inferior performance compared with CNNs when trained from scratch on a midsize dataset (e.g., ImageNet). We find it is because: 1) the simple tokenization of input images fails to model the important local structure (e.g., edges, lines) among neighboring pixels, leading to its low training sample efficiency; 2) the redundant attention backbone design of ViT leads to limited feature richness in fixed computation budgets and limited training samples.To overcome such limitations, we propose a new Tokens-To-Token Vision Transformers (T2T-ViT), which introduces 1) a layer-wise Tokens-to-Token (T2T) transformation to progressively structurize the image to tokens by recursively aggregating neighboring Tokens into one Token (Tokens-to-Token), such that local structure presented by surrounding tokens can be modeled and tokens length can be reduced; 2) an efficient backbone with a deep-narrow structure for vision transformers motivated by CNN architecture design after extensive study. Notably, T2T-ViT reduces the parameter counts and MACs of vanilla ViT by 200%, while achieving more than 2.5% improvement when trained from scratch on ImageNet. It also outperforms ResNets and achieves comparable performance with MobileNets by directly training on ImageNet. For example, T2T-ViT with ResNet50 comparable size can achieve 80.7% accuracy on ImageNet. 1
A smart face mask that can conveniently monitor breath information is beneficial for maintaining personal health and preventing the spread of diseases. However, some challenges still need to be addressed before such devices can be of practical use. One key challenge is to develop a pressure sensor that is easily triggered by low pressure and has excellent stability as well as electrical and mechanical properties. In this study, a wireless smart face mask is designed by integrating an ultrathin self‐powered pressure sensor and a compact readout circuit with a normal face mask. The pressure sensor is the thinnest (totally compressed thickness of ≈5.5 µm) and lightest (total weight of ≈4.5 mg) electrostatic pressure sensor capable of achieving a peak open‐circuit voltage of up to ≈10 V when stimulated by airflow, which endows the sensor with the advantage of readout circuit miniaturization and makes the breath‐monitoring system portable and wearable. To demonstrate the capabilities of the smart face mask, it is used to wirelessly measure and analyze the various breath conditions of multiple testers.
Wearable Healthcare Devices
Convenient breath monitoring via wearable devices is helpful for personal healthcare, especially during the COVID‐19 pandemic. In article number 2107758, Kenjiro Fukuda, Takao Someya, and co‐workers develop a wearable smart face mask based on an ultrathin self‐powered pressure sensor with high output ability, and various breath conditions from multiple testers are wirelessly detected and analyzed.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.