Predicting the Functional Impact of Genetic Variation Within Intrinsically Disordered Protein Regions

Alex Holehouse, M.Sc. Ph.D.

Project Overview:

This project addresses a critical barrier to progress in understanding the molecular basis of disease. Intrinsically disordered proteins and protein regions (collectively referred to here as IDRs) are protein regions that do not fold into a set 3D structure. IDRs are found in more than 50% of human proteins, where they play a wide range of essential roles. Despite their importance, IDRs are frequently disregarded in the context of mechanistic molecular biology.

IDR-truncated constructs may be used in vitro, or more commonly for cellular studies, IDRs are simply ignored while the role of folded domains is directly examined. A major weakness in this approach reflects the fact that IDRs play a wide range of critical roles in the context of human disease, both via mutational dysregulation or as essential components of many viral proteins. This raises a challenge: we need to understand how IDRs contribute to human disease, yet we are currently ill-equipped to do so.

My lab is addressing this challenge through the development and application of novel computational technology to help guide experiments to understand how IDRs contribute to human disease. We combine molecular biophysics, large-scale sequence annotations, and deep learning to interrogate how mutations could alter the biophysical behavior of IDRs. We do this to develop testable mechanistic models, with the ultimate goal of being able to directly interpret mutations.

In this project we propose to extend our tools and apply them to three model systems: a set of understudied human kinases implicated in cancers (in collaboration with Ben Majors, Dept. of Cell Biology); p62/SQSRTM, a key autophagic receptor that aggregates in a range of neurodegenerative diseases (in collaboration with Chris Weill, Dept. of Neurology); and the SARS-CoV-2 nucleocapsid protein, a core component of the viral packaging machinery that we believe offers a general pan-retroviral coronavirus therapeutic target (with Andrea Soranno, Dept. of Biochemistry and Molecular Biophysics).

Progress Report:

Our first year of funding was provided to develop computational techniques to identify sequence features within unstructured regions that could be used to explain function and dysfunction in human disease. In our original proposal, Aim 1 focused on the development of computational technology, while Aim 2 focused on applying that technology to a set of cancer-associated proteins. The long-term goal for both this proposal and the original is to demonstrate that our computational tools have sufficient fidelity to make meaningful inferences on how an IDR sequence may contribute.

The pandemic interfered with our ability to explore our predictions from the cancer-related proteins directly – specifically, experimental work within my lab and with collaborators to test our predictions essentially stalled until April 2021; so while we have predictions across the proposed 747 proteins, we have not yet been able to test these. To address this limitation going forward, we have identified specific collaborators at Washington University who are already working on systems that align with our original goals with a view to rapidly obtaining clinically important experimental validation of our technology. In contrast to the unavoidable challenges on the experimental front, we have made strong progress with respect to the development and application of our computational methods.

First, we have designed and implemented a new high-performance disorder predictor (an essential piece of technology for our pipeline) and a general deep-learning framework to map functional annotation to sequence (two separate papers, both submitted/on bioRxiv). We have also finalized initial versions of our tools to predict transient helicity and identify local subregions enriched for specific amino acids, as described in our original proposal for the first funding period. The development of these tools has empowered my group to make discoveries and test our overarching hypotheses regarding the mapping of sequence-to-function.

Specifically, we have made three distinct scientific discoveries with broad implications for human health and disease. First, we discovered that viruses can evolve rapidly through compensatory changes in disordered regions that lead to analogous functional outcomes despite large changes in amino acid sequence, a physical mechanism we have coined as conformational buffering. Second, we discovered that SARS-CoV-2 viral entry is dependent on a cysteine-rich disordered region in the membrane-bound spike protein, and that this molecular signature appears to underline a general physical mechanism for the catalysis of membrane fusion. Finally, using our computational methods we identified IDRs predicted to undergo self-association and bind RNA in the SARS-CoV-2 nucleocapsid protein, a protein essential for viral genome packaging. This work led us to propose a new physical mechanism for the early stages of virion assembly. Beyond these specific vignettes, we have applied our growing understanding of how IDRs work coupled with novel computational technology to uncover IDR-driven biological function across a wide range of systems, some of which are published but many others are not.

In short, despite my group’s broad and ambitious goal of developing a general understanding of how IDRs drive biological function, we have submitted/published sixteen primary research (five as corresponding author) articles since June of 2020. We have papers in press at Cell and under review/revisions in Nature, PNAS, Developmental Cell, eLife, Cell Systems, and others. We highlight this progress simply to illustrate that the general hypothesis we are pursuing – that IDR function can be decoded from amino acid sequence – appears to be both correct and fruitful and is allowing my group to make a wide range of fundamental discoveries while further developing our technological approaches to better ask and answer new questions.