AI/ML-Readiness for Neuroimaging of Language, National Institutes of Health

From Knoesis wiki
Jump to: navigation, search

AI/ML-Readiness for Neuroimaging of Language

This supplement relates to neuroimaging datasets that shed light on the organization of the semantic system. Rich, multimodal information in these data can be crucial in revealing how brain regions and networks operate when processing multiple types of concepts under a variety of conditions. Advanced AI/ML techniques, such as DL can be especially valuable in revealing underlying patterns in complex, high-dimensional, and dynamic data. Application of these methods to neuroimaging data, however, is nontrivial. By some estimates, data preparation tasks can occupy 70%-90% of the time of a scientist, which includes tasks such as finding, cleaning, relabeling, reformatting, and reorganizing data. Here, we aim to process and prepare these rich datasets to make them “ML-ready,” facilitating application of ML methods.


Faculty - AI Institute: Prof. Amit P. Sheth Prof. Christian O'Reilly

Faculty - Psychology Department: Prof. Rutvik H. Desai Prof Svetlana Shinkareva

Graduate Students: Deepa Tilwani Nayeem Mohammad


NIH $287,300


Semantic processing is highly distributed in the brain, and is intertwined with cognitive domains such as language processing, episodic memory, social cognition, action-perception systems, and executive systems. Hence, the data obtained from neuroimaging is inherently complex and multifaceted. While traditional univariate neuroimaging analysis methods based on linear regression have provided a wealth of findings, advanced ML methods have an unparalleled power and potential to provide new insights into the functional organization of the brain. We aim to adopt the latest ML operations (ML Ops) strategies to implement a neuroimaging data pipeline for data processing that lead to the ease of its further processing by DL methods, and further sharing of the data through public repositories.


Specific Aims:

Dataset 1: Traditionally, neuro- and psycholinguistic studies examine language by presenting isolated words or sentences that are carefully controlled. This study was aimed at understanding naturalistic language processing, including processing at lexical, phrasal, sentential, and discourse levels, in context. The dataset consists of fMRI data from 50 participants while they listened to short narratives, with additional data collection ongoing. Data from over 500 participants will be collected, and the methods developed under the current project are expected to be applied to the entire dataset.

Planned Analyses: We will transform these data such that they can be used in a DL framework. We will use DL methods to analyze the time series from voxels in regions of interest (ATL, AG, IFG, pMTG) to understand their relationship with a range of psycholinguistic variables in the naturalistic context. State of the art techniques will be to use the distributed representations of the narrative (capturing the meaning of the stories in high-dimensional semantic space), and use it to identify relevant neural representation of objects and entities in the neural data.

Dataset 2: Lesion-symptom mapping (LSM), or an examination of the location of brain lesions and associated deficits, is a powerful method for understanding brain function and has direct translational relevance. This dataset consists of two prongs: one is a detailed characterization of the lesioned anatomy of 79 left hemisphere stroke survivors, and the second is a battery of behavioral measures on these patients. Lesions are traced by a neurologist and normalized to standard space using our cutting-edge in-house software tools (NiiStat;

Planned Analyses: DL has a unique potential to reveal complex relationship between lesion locations, impaired structural and functional connectivity, and behavior, over and above what is revealed by univariate analyses. The data are currently in traditional neuroimaging format, in forms that can be read by LSM/CSM software such as NiiStat. We will transform the lesion location, structural connectome, functional connectome, and behavioral data to a DL-ready format. Then, we will train a DL network to predict behavior given lesion location and connectomes as input. We expect that this will provide new insights into the relationship between lesions to regions as well as connections and behavior, and potential for recovery.

Dissemination: We expect to make ML/DL-ready, de-identified data, along with the associated metadata, publicly available on databases such as the LONI/IDA ( and NeuroVault ( for statistical maps. The data will be searchable within the Neuroscience Information Framework (, which aggregates neuroscience data repositories, consistent with FAIR (Findable, Accessible, Interoperable, and Reusable) principles [56]. We aim to use CC0 1.0 license (Universal Public Domain Dedication), which we believe provides maximum dissemination.

More Information: For more information on data contact '' and for analysis pipeline contact ''

Abstract published in refereed conference proceedings

Tilwani, D., Goswami, R., O'Reilly, C., Riccardi, N., Yang, X., Shalin, V., Shinkareva, S., Sheth, A.P., & Desai, R.H. (2023) Predicting Language Outcomes from MRI Post-Stroke: A Machine Learning Approach, OHBM 2023 Annual Meeting, Montreal Canada, 22-26 July 2023.
Poster [1]