Leda Sarı scite author profile

We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It offers 3,670 hours of dailylife activity video spanning hundreds of scenarios (household, outdoor, workplace, leisure, etc.) captured by 931 unique camera wearers from 74 worldwide locations and 9 different countries. The approach to collection is designed to uphold rigorous privacy and ethics standards, with consenting participants and robust de-identification procedures where relevant. Ego4D dramatically expands the volume of diverse egocentric video footage publicly available to the research community. Portions of the video are accompanied by audio, 3D meshes of the environment, eye gaze, stereo, and/or synchronized videos from multiple egocentric cameras at the same event. Furthermore, we present a host of new benchmark challenges centered around understanding the first-person visual experience in the past (querying an episodic memory), present (analyzing hand-object manipulation, audio-visual conversation, and social interactions), and future (forecasting activities). By publicly sharing this massive annotated dataset and benchmark suite, we aim to push the frontier of first-person perception.

show abstract

Unsupervised Speaker Adaptation Using Attention-Based Speaker Memory for End-to-End ASR

Sarı

Moritz

Hori

et al. 2020

View full text Add to dashboard Cite

We propose an unsupervised speaker adaptation method inspired by the neural Turing machine for end-to-end (E2E) automatic speech recognition (ASR). The proposed model contains a memory block that holds speaker i-vectors extracted from the training data and reads relevant i-vectors from the memory through an attention mechanism. The resulting memory vector (M-vector) is concatenated to the acoustic features or to the hidden layer activations of an E2E neural network model. The E2E ASR system is based on the joint connectionist temporal classification and attention-based encoderdecoder architecture. M-vector and i-vector results are compared for inserting them at different layers of the encoder neural network using the WSJ and TED-LIUM2 ASR benchmarks. We show that M-vectors, which do not require an auxiliary speaker embedding extraction system at test time, achieve similar word error rates (WERs) compared to i-vectors for single speaker utterances and significantly lower WERs for utterances in which there are speaker changes.

show abstract

Pre-training of Speaker Embeddings for Low-latency Speaker Change Detection in Broadcast News

Sarı

Thomas²,

Hasegawa‐Johnson

et al. 2019

View full text Add to dashboard Cite

Ego4D: Around the World in 3,000 Hours of Egocentric Video

Grauman¹,

Westbury²,

Byrne³

et al. 2021

Preprint

View full text Add to dashboard Cite

We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It offers 3,025 hours of dailylife activity video spanning hundreds of scenarios (household, outdoor, workplace, leisure, etc.) captured by 855 unique camera wearers from 74 worldwide locations and 9 different countries. The approach to collection is designed to uphold rigorous privacy and ethics standards with consenting participants and robust de-identification procedures where relevant. Ego4D dramatically expands the volume of diverse egocentric video footage publicly available to the research community. Portions of the video are accompanied by audio, 3D meshes of the environment, eye gaze, stereo, and/or synchronized videos from multiple egocentric cameras at the same event. Furthermore, we present a host of new benchmark challenges centered around understanding the first-person visual experience in the past (querying an episodic memory), present (analyzing hand-object manipulation, audio-visual conversation, and social interactions), and future (forecasting activities). By publicly sharing this massive annotated dataset and benchmark suite, we aim to push the frontier of first-person perception.

show abstract

Theoretical study of hydrogen abstraction from dimethyl ether and methyl tert-butyl ether by hydroxyl radicalElectronic supplementary information (ESI) available: optimized structural parameters, energies, zero point energies and dipole moments for reactants, products, and transition states (Tables S1–8). See http://www.rsc.org/suppdata/cp/b1/b109970c/

Atadinç

Selçuki

Sarı

et al. 2002

Phys. Chem. Chem. Phys.

View full text Add to dashboard Cite

31G** and CCSD(T)/6-311++G**//MP2/6-31G** calculations have been used to investigate the H-abstraction reaction from CH 3 OCH 3 (DME) whereas MP2/6-31G**//MP2/6-31G** and PMP2/6-31G**//MP2/6-31G** levels have been used to model the H-abstraction reaction from (CH 3 ) 3 COCH 3 (MTBE) by OH. The methodology used has been proved to be adequate to reproduce the experimental geometrical parameters for the reactants and the C-H bond energies. The reaction rate constants for DME, calculated using the transition state theory reproduce the reported experimental results. The fact that H-abstraction is favored from the methoxy group of MTBE in comparison to the tert-butyl group has also been reproduced.y Electronic supplementary information (ESI) available: optimized structural parameters, energies, zero point energies and dipole moments for reactants, products, and transition states (Tables S1-8). See

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Leda Sarı

Ego4D: Around the World in 3,000 Hours of Egocentric Video

Unsupervised Speaker Adaptation Using Attention-Based Speaker Memory for End-to-End ASR

Pre-training of Speaker Embeddings for Low-latency Speaker Change Detection in Broadcast News

Ego4D: Around the World in 3,000 Hours of Egocentric Video

Contact Info

Product

Resources

About