publications | Wentao Wang

In prep.

2024

A systematic investigation of learnability from single child linguistic input

Yulu Qin, Wentao Wang, and Brenden M. Lake

Cognitive Science conference, 2024

Abs PDF

Language models (LMs) have demonstrated remarkable proficiency in generating linguistically coherent text, sparking discussions about their relevance to understanding human language learnability. However, a significant gap exists between the training data for these models and the linguistic input a child receives. LMs are typically trained on data that is orders of magnitude larger and fundamentally different from child-directed speech (Warstadt & Bowman, 2022; Warstadt et al., 2023; Frank, 2023a). Addressing this discrepancy, our research focuses on training LMs on subsets of a single child’s linguistic input. Previously, Wang, Vong, Kim, and Lake (2023) found that LMs trained in this setting can form syntactic and semantic word clusters and develop sensitivity to certain linguistic phenomena, but they only considered LSTMs and simpler neural networks trained from just one single-child dataset. Here, to examine the robustness of learnability from single-child input, we systematically train six different model architectures on five datasets (3 single-child and 2 baselines). We find that the models trained on single-child datasets showed consistent results that matched with previous work, underscoring the robustness of forming meaningful syntactic and semantic representations from a subset of a child’s linguistic input.
Grounded language acquisition through the eyes and ears of a single child

Wai Keen Vong, Wentao Wang, A. Emin Orhan, and Brenden M. Lake

Science, 2024

Abs Bib PDF

Starting around 6 to 9 months of age, children begin acquiring their first words, linking spoken words to their visual counterparts. How much of this knowledge is learnable from sensory input with relatively generic learning mechanisms, and how much requires stronger inductive biases? Using longitudinal head-mounted camera recordings from one child aged 6 to 25 months, we trained a relatively generic neural network on 61 hours of correlated visual-linguistic data streams, learning feature-based representations and cross-modal associations. Our model acquires many word-referent mappings present in the child’s everyday experience, enables zero-shot generalization to new visual referents, and aligns its visual and linguistic conceptual systems. These results show how critical aspects of grounded word meaning are learnable through joint representation and associative learning from one child’s input.
@article{saycam-multimodal, title = {Grounded language acquisition through the eyes and ears of a single child}, author = {Vong, Wai Keen and Wang, Wentao and Orhan, A. Emin and Lake, Brenden M.}, journal = {Science}, url = {https://www.science.org/doi/10.1126/science.adi1374}, bibtex_show = true, year = {2024}, kind = {2024} }
Self-supervised learning of video representations from a child’s perspective

A. Emin Orhan, Wentao Wang, Alex N. Wang, Mengye Ren, and Brenden M. Lake

Cognitive Science conference, 2024

Abs PDF

Children learn powerful internal models of the world around them from a few years of egocentric visual experience. Can such internal models be learned from a child’s visual experience with highly generic learning algorithms or do they require strong inductive biases? Recent advances in collecting large-scale, longitudinal, developmentally realistic video datasets and generic self-supervised learning (SSL) algorithms are allowing us to begin to tackle this nature vs. nurture question. However, existing work typically focuses on image-based SSL algorithms and visual capabilities that can be learned from static images (e.g. object recognition), thus ignoring temporal aspects of the world. To close this gap, here we train self-supervised video models on longitudinal, egocentric headcam recordings collected from a child over a two year period in their early development (6-31 months). The resulting models are highly effective at facilitating the learning of action concepts from a small number of labeled examples; they have favorable data size scaling properties; and they display emergent video interpolation capabilities. Video models also learn more robust object representations than image-based models trained with the exact same data. These results suggest that important temporal aspects of a child’s internal model of the world may be learnable from their visual experience using highly generic learning algorithms and without strong inductive biases.

2023

Finding Structure in One Child’s Linguistic Experience

Wentao Wang, Wai Keen Vong, Najoung Kim, and Brenden M. Lake

Cognitive Science, 2023

Abs Bib PDF

Neural network models have recently made striking progress in natural language processing, but they are typically trained on orders of magnitude more language input than children receive. What can these neural networks, which are primarily distributional learners, learn from a naturalistic subset of a single child’s experience? We examine this question using a recent longitudinal dataset collected from a single child, consisting of egocentric visual data paired with text transcripts. We train both language-only and vision-and-language neural networks and analyze the linguistic knowledge they acquire. In parallel with findings from Elman’s (1990) seminal work, the neural networks form emergent clusters of words corresponding to syntactic (nouns, transitive and intransitive verbs) and semantic categories (e.g., animals and clothing), based solely on one child’s linguistic input. The networks also acquire sensitivity to acceptability contrasts from linguistic phenomena such as determiner-noun agreement and argument structure. We find that incorporating visual information produces an incremental gain in predicting words in context, especially for syntactic categories that are comparatively more easily grounded such as nouns and verbs, but the underlying linguistic representations are not fundamentally altered. Our findings demonstrate which kinds of linguistic knowledge are learnable from a snapshot of a single child’s real developmental experience.
@article{saycam-text, title = {Finding Structure in One Child's Linguistic Experience}, author = {Wang, Wentao and Vong, Wai Keen and Kim, Najoung and Lake, Brenden M.}, journal = {Cognitive Science}, url = {https://psyarxiv.com/85k3y}, bibtex_show = true, year = {2023}, kind = {2023} }

Older

2020

Data-to-Text Generation with Style Imitation

Shuai Lin, Wentao Wang, Zichao Yang, Xiaodan Liang, Frank F. Xu, Eric Xing, and Zhiting Hu

In Findings of the Association for Computational Linguistics: EMNLP 2020 , Nov 2020

Abs Bib PDF

Recent neural approaches to data-to-text generation have mostly focused on improving content fidelity while lacking explicit control over writing styles (e.g., sentence structures, word choices). More traditional systems use templates to determine the realization of text. Yet manual or automatic construction of high-quality templates is difficult, and a template acting as hard constraints could harm content fidelity when it does not match the record perfectly. We study a new way of stylistic control by using existing sentences as “soft” templates. That is, a model learns to imitate the writing style of any given exemplar sentence, with automatic adaptions to faithfully describe the record. The problem is challenging due to the lack of parallel data. We develop a neural approach that includes a hybrid attention-copy mechanism, learns with weak supervisions, and is enhanced with a new content coverage constraint. We conduct experiments in restaurants and sports domains. Results show our approach achieves stronger performance than a range of comparison methods. Our approach balances well between content fidelity and style control given exemplars that match the records to varying degrees.
@inproceedings{lin-etal-2020-data, title = {Data-to-Text Generation with Style Imitation}, author = {Lin, Shuai and Wang, Wentao and Yang, Zichao and Liang, Xiaodan and Xu, Frank F. and Xing, Eric and Hu, Zhiting}, booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2020}, month = nov, year = {2020}, address = {Online}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2020.findings-emnlp.144}, doi = {10.18653/v1/2020.findings-emnlp.144}, pages = {1589--1598}, bibtex_show = true, kind = {Older} }

2019

Texar: A Modularized, Versatile, and Extensible Toolkit for Text Generation

Zhiting Hu, Haoran Shi, Bowen Tan, Wentao Wang, Zichao Yang, Tiancheng Zhao, Junxian He, Lianhui Qin, Di Wang, Xuezhe Ma, Zhengzhong Liu, Xiaodan Liang, Wanrong Zhu, Devendra Sachan, and Eric Xing

In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations , Jul 2019

Abs Bib PDF

We introduce Texar, an open-source toolkit aiming to support the broad set of text generation tasks that transform any inputs into natural language, such as machine translation, summarization, dialog, content manipulation, and so forth. With the design goals of modularity, versatility, and extensibility in mind, Texar extracts common patterns underlying the diverse tasks and methodologies, creates a library of highly reusable modules and functionalities, and allows arbitrary model architectures and algorithmic paradigms. In Texar, model architecture, inference, and learning processes are properly decomposed. Modules at a high concept level can be freely assembled or plugged in/swapped out. Texar is thus particularly suitable for researchers and practitioners to do fast prototyping and experimentation. The versatile toolkit also fosters technique sharing across different text generation tasks. Texar supports both TensorFlow and PyTorch, and is released under Apache License 2.0 at https://www.texar.io.
@inproceedings{hu-etal-2019-texar, title = {{T}exar: A Modularized, Versatile, and Extensible Toolkit for Text Generation}, author = {Hu, Zhiting and Shi, Haoran and Tan, Bowen and Wang, Wentao and Yang, Zichao and Zhao, Tiancheng and He, Junxian and Qin, Lianhui and Wang, Di and Ma, Xuezhe and Liu, Zhengzhong and Liang, Xiaodan and Zhu, Wanrong and Sachan, Devendra and Xing, Eric}, booktitle = {Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations}, month = jul, year = {2019}, address = {Florence, Italy}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/P19-3027}, doi = {10.18653/v1/P19-3027}, pages = {159--164}, bibtex_show = true, kind = {Older} }